Preprocessing Text in Python | Set 2


# import required libraries

import nltk

import string

import re

Part of speech tagging:

Part Speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. Basic natural language processing models like a bag of words cannot identify these relationships between words. Hence, we use the speech tagging part to tag a word in its speech tag part based on its context in the data. It is also used to extract relationships between words.

from nltk.tokenize import word_tokenize

from nltk import pos_tag

# convert text to word_tokens with their tags

def pos_tagging (text):

word_tokens = word_tokenize (text)

return pos_tag (word_tokens)


pos_tagging ( ` You just gave me a scare` )

Example :

Input: `You just gave me a scare`
Output: [(`You` , `PRP`), (`just`, `RB`), (`gave`, `VBD`), (`me`, `PRP`),
(`a`, `DT`), (`scare`, `NN`)]

In this example, PRP stands for personal pronoun, RB — adverb, VBD — past tense verb, DT — determinant and NN — noun. We can get details of all parts of the speech tags using the Penn Treebank tag set.

# download tag

nltk.download ( `tagsets` )

# retrieve tag information

nltk. help . upenn_tagset ( `NN` )


Input: `NN`
Output: NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humor falloff slick wind hyena override subhumanity



Splitting — it is the process of extracting phrases from unstructured text and additional structure to it. This is also known as shallow parsing. This is done on top of the part-of-speech tags. He groups the word into "chunks", mostly from nominal phrases. Partitioning is done using regular expressions.

from nltk.tokenize import word_tokenize 

from nltk import pos_tag

# define a chunk function with text and regular
# an expression representing the grammar as a parameter

def chunking (text, grammar):

word_tokens = word_tokenize (text)


# mark words with part of speech

  word_pos = pos_tag (word_tokens)


# create chunk parser using grammar

chunkParser = nltk.RegexpParser (grammar)


# check this in the token list of the word marked with the position

tree = chunkParser.parse (word_pos)


for subtree in tree.subtrees ():

print (subtree)

tree.draw ()


sentence = `the little yellow bird is flying in the sky`

grammar = " NP: {& lt; DT & gt;? & lt; JJ & gt; * & lt; NN & gt;} "

chunking (sentence, grammar)

In the above example, the grammar is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) block must be generated whenever the block finds an optional qualifier (DT) followed by any number of adjectives (JJ) followed by a noun (NN).

Such libraries like spaCy and Textblob are more suitable for chunking.

Example :

Input: `the little yellow bird is flying in the sky`
(NP the / DT little / JJ yellow / JJ bird / NN)
is / VBZ
flying / VBG
in / IN
(NP the / DT sky / NN))
(NP the / DT little / JJ yellow / JJ bird / NN)
(NP the / DT sky / NN)

Named Person Recognition:

Named Object Recognition is used to extract information from unstructured text. It is used to classify the entities present in the text into categories such as person, organization, event, place, etc. It gives us detailed knowledge about the text and the relationships between different entities.

from nltk.tokenize import word_tokenize

from nltk import pos_tag, ne_chunk


def named_entity_recognition (text):

# tokenize text

word_tokens = word_tokenize (text)


# part of speech tagging of words

word_pos = pos_tag (word_tokens)


# entity word tree

print (ne_chunk (word_pos))


text = `Bill works for Python.Engineering so he went to Delhi for a meetup.`

named_entity_recognition (text)

Example :

  Input: `Bill works for Python.Engineering so he went to Delhi for a meetup.`
works / VBZ
for / IN
(ORGANIZATION Python.Engineering / NNP)
so / RB
he / PRP
went / VBD
to / TO
(GPE Delhi / NNP)
for / IN
a / DT
meetup / NN

Get Solution for free from DataCamp guru