Change language

Preprocessing Text in Python | Set 2

| |

In

# import required libraries

import nltk

import string

import re

Part of speech tagging:

Part Speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. Basic natural language processing models like a bag of words cannot identify these relationships between words. Hence, we use the speech tagging part to tag a word in its speech tag part based on its context in the data. It is also used to extract relationships between words.

from nltk.tokenize import word_tokenize

from nltk import pos_tag

 
# convert text to word_tokens with their tags

def pos_tagging (text):

word_tokens = word_tokenize (text)

return pos_tag (word_tokens)

  

pos_tagging ( ’ You just gave me a scare’ )

Example :

Input: ’You just gave me a scare’
Output: [(’You’ , ’PRP’), (’just’, ’RB’), (’gave’, ’VBD’), (’me’, ’PRP’),
(’a’, ’DT’), (’scare’, ’NN’)]

In this example, PRP stands for personal pronoun, RB — adverb, VBD — past tense verb, DT — determinant and NN — noun. We can get details of all parts of the speech tags using the Penn Treebank tag set.

# download tag

nltk.download ( ’tagsets’ )

 
# retrieve tag information

nltk. help . upenn_tagset ( ’NN’ )

Example:

Input: ’NN’
Output: NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humor falloff slick wind hyena override subhumanity
machinist…

 

Chunking:

Splitting — it is the process of extracting phrases from unstructured text and additional structure to it. This is also known as shallow parsing. This is done on top of the part-of-speech tags. He groups the word into "chunks", mostly from nominal phrases. Partitioning is done using regular expressions.

from nltk.tokenize import word_tokenize 

from nltk import pos_tag

 
# define a chunk function with text and regular
# an expression representing the grammar as a parameter

def chunking (text, grammar):

word_tokens = word_tokenize (text)

  

# mark words with part of speech

  word_pos = pos_tag (word_tokens)

 

# create chunk parser using grammar

chunkParser = nltk.RegexpParser (grammar)

 

# check this in the token list of the word marked with the position

tree = chunkParser.parse (word_pos)

 

for subtree in tree.subtrees ():

print (subtree)

tree.draw ()

  

sentence = ’the little yellow bird is flying in the sky’

grammar = " NP: {"DT"? "JJ" * "NN"} "

chunking (sentence, grammar)

In the above example, the grammar is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) block must be generated whenever the block finds an optional qualifier (DT) followed by any number of adjectives (JJ) followed by a noun (NN).

Such libraries like spaCy and Textblob are more suitable for chunking.

Example :

Input: ’the little yellow bird is flying in the sky’
Output:
(S
(NP the / DT little / JJ yellow / JJ bird / NN)
is / VBZ
flying / VBG
in / IN
(NP the / DT sky / NN))
(NP the / DT little / JJ yellow / JJ bird / NN)
(NP the / DT sky / NN)

Named Person Recognition:

Named Object Recognition is used to extract information from unstructured text. It is used to classify the entities present in the text into categories such as person, organization, event, place, etc. It gives us detailed knowledge about the text and the relationships between different entities.

from nltk.tokenize import word_tokenize

from nltk import pos_tag, ne_chunk

 

def named_entity_recognition (text):

# tokenize text

word_tokens = word_tokenize (text)

 

# part of speech tagging of words

word_pos = pos_tag (word_tokens)

 

# entity word tree

print (ne_chunk (word_pos))

 

text = ’Bill works for Python.Engineering so he went to Delhi for a meetup.’

named_entity_recognition (text)

Example :

  Input: ’Bill works for Python.Engineering so he went to Delhi for a meetup.’
Output:
(S
(PERSON Bill / NNP)
works / VBZ
for / IN
(ORGANIZATION Python.Engineering / NNP)
so / RB
he / PRP
went / VBD
to / TO
(GPE Delhi / NNP)
for / IN
a / DT
meetup / NN
./.)

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically