Change language

NLP | Tokenizer training and filtering stop words in a sentence

| | | |

Let’s look at the following text to understand the concept. This kind of text is very common in any web text corpus.

  Example of TEXT:  A guy: So, what are your plans for the party? B girl: well! I am not going! A guy: Oh, but u should enjoy. 

To download the text file, click here .

Code # 1: Tutorial Tokenizer

# Loading libraries

from nltk.tokenize import PunktSentenceTokenizer

from nltk.corpus import webtext


text = webtext.raw ( ’C: Geeksforgeeks data_for_training_tokenizer .txt’ )

sent_tokenizer  = PunktSentenceTokenizer (text)

sents_1 = sent_tokenizer.tokenize (text)


print (sents_1 [ 0 ])

print ( "" sents_1 [ 678 ])


 ’White guy: So, do you have any plans for this evening?’ ’Hobo: Got any spare change?’ 

Code # 2: Default Offer Tokenizer

from nltk.tokenize import sent_tokenize

sents_2 = sent_tokenize (text)


print (sents_2 [ 0 ])

print ( "" sents_2 [ 678 ] )


 ’White guy: So, do you have any plans for this evening? ’’ Girl: But you already have a Big Mac ... Hobo: Oh, this is all theatrical.’ 

This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in goes into a typical paragraph sentence structure.

How does learning work?
The PunktSentenceTokenizer class follows an unsupervised learning algorithm to find out what constitutes a break offers. This is unattended because there is no need to give any labeled training data, just raw text.

Filtering stop words in the tokenized sentence

Stop- words — these are ordinary words that appear in the text, but, as a rule, do not affect the meaning of the sentence. They are almost irrelevant for information retrieval and natural language processing purposes. For example — "A" and "a". Most search engines filter stop words from search queries and documents. 
The NLTK library comes with the nltk_data / corpora / stopwords / corpus of words —  nltk_data / corpora / stopwords / which contains wordlists for many languages.

Code # 3: Stopwords with Python

# Loading library

from nltk.corpus import stopwords

# Using stop words from English

english_stops = set (stopwords.words ( ’ english’ ))

# Print stop word list in English

words  = [ "Let’s" , ’see ’,’ how ’," it’ s", ’working’ ]


print ( " Before stopwords removal: " , words)

print ( " After stopwords removal: " ,

  [word for word in words if word not in english_stops])

Exit :

 Before stopwords removal: ["Let’s", ’see’,’ how’, "it’s", ’working’] After stopwords removal: ["Let’s", ’see’,’ working’]? 

Code # 4: Complete list of languages ​​used in NLTK stopwords.

stopwords.fileids ()

Exit :

 [’danish’,’ dutch’, ’english’,’ finnish’, ’french’,’ german’, ’hungarian’ , ’italian’,’ norwegian’, ’portuguese’,’ russian’, ’spanish’,’ swedish’, ’turkish’] 


Learn programming in R: courses


Best Python online courses for 2022


Best laptop for Fortnite


Best laptop for Excel


Best laptop for Solidworks


Best laptop for Roblox


Best computer for crypto mining


Best laptop for Sims 4


Latest questions


psycopg2: insert multiple rows with one query

12 answers


How to convert Nonetype to int or string?

12 answers


How to specify multiple return types using type-hints

12 answers


Javascript Error: IPython is not defined in JupyterLab

12 answers



Python OpenCV | cv2.putText () method

numpy.arctan2 () in Python

Python | os.path.realpath () method

Python OpenCV | () method

Python OpenCV cv2.cvtColor () method

Python - Move item to the end of the list

time.perf_counter () function in Python

Check if one list is a subset of another in Python

Python os.path.join () method