Change language

NLP | Tokenizer training and filtering stop words in a sentence

| | | |

Let’s look at the following text to understand the concept. This kind of text is very common in any web text corpus.

  Example of TEXT:  A guy: So, what are your plans for the party? B girl: well! I am not going! A guy: Oh, but u should enjoy. 

To download the text file, click here .

Code # 1: Tutorial Tokenizer

# Loading libraries

from nltk.tokenize import PunktSentenceTokenizer

from nltk.corpus import webtext

 

text = webtext.raw ( ’C: Geeksforgeeks data_for_training_tokenizer .txt’ )

sent_tokenizer  = PunktSentenceTokenizer (text)

sents_1 = sent_tokenizer.tokenize (text)

 

print (sents_1 [ 0 ])

print ( "" sents_1 [ 678 ])

Output:

 ’White guy: So, do you have any plans for this evening?’ ’Hobo: Got any spare change?’ 

Code # 2: Default Offer Tokenizer

from nltk.tokenize import sent_tokenize

sents_2 = sent_tokenize (text)

  

print (sents_2 [ 0 ])

print ( "" sents_2 [ 678 ] )

Output:

 ’White guy: So, do you have any plans for this evening? ’’ Girl: But you already have a Big Mac ... Hobo: Oh, this is all theatrical.’ 

This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in goes into a typical paragraph sentence structure.

How does learning work?
The PunktSentenceTokenizer class follows an unsupervised learning algorithm to find out what constitutes a break offers. This is unattended because there is no need to give any labeled training data, just raw text.

Filtering stop words in the tokenized sentence

Stop- words — these are ordinary words that appear in the text, but, as a rule, do not affect the meaning of the sentence. They are almost irrelevant for information retrieval and natural language processing purposes. For example — "A" and "a". Most search engines filter stop words from search queries and documents. 
The NLTK library comes with the nltk_data / corpora / stopwords / corpus of words —  nltk_data / corpora / stopwords / which contains wordlists for many languages.

Code # 3: Stopwords with Python

# Loading library

from nltk.corpus import stopwords

 
# Using stop words from English

english_stops = set (stopwords.words ( ’ english’ ))

 
# Print stop word list in English

words  = [ "Let’s" , ’see ’,’ how ’," it’ s", ’working’ ]

  

print ( " Before stopwords removal: " , words)

print ( " After stopwords removal: " ,

  [word for word in words if word not in english_stops])

Exit :

 Before stopwords removal: ["Let’s", ’see’,’ how’, "it’s", ’working’] After stopwords removal: ["Let’s", ’see’,’ working’]? 

Code # 4: Complete list of languages ​​used in NLTK stopwords.

stopwords.fileids ()

Exit :

 [’danish’,’ dutch’, ’english’,’ finnish’, ’french’,’ german’, ’hungarian’ , ’italian’,’ norwegian’, ’portuguese’,’ russian’, ’spanish’,’ swedish’, ’turkish’] 

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

psycopg2: insert multiple rows with one query

12 answers

NUMPYNUMPY

How to convert Nonetype to int or string?

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Javascript Error: IPython is not defined in JupyterLab

12 answers

News


Wiki

Python OpenCV | cv2.putText () method

numpy.arctan2 () in Python

Python | os.path.realpath () method

Python OpenCV | cv2.circle () method

Python OpenCV cv2.cvtColor () method

Python - Move item to the end of the list

time.perf_counter () function in Python

Check if one list is a subset of another in Python

Python os.path.join () method