NLP | Tokenizer training and filtering stop words in a sentence



Let`s look at the following text to understand the concept. This kind of text is very common in any web text corpus.

  Example of TEXT:  A guy: So, what are your plans for the party? B girl: well! I am not going! A guy: Oh, but u should enjoy. 

To download the text file, click here .

Code # 1: Tutorial Tokenizer

# Loading libraries

from nltk.tokenize import PunktSentenceTokenizer

from nltk.corpus import webtext

 

text = webtext.raw ( `C: Geeksforgeeks data_for_training_tokenizer .txt` )

sent_tokenizer  = PunktSentenceTokenizer (text)

sents_1 = sent_tokenizer.tokenize (text)

 

print (sents_1 [ 0 ])

print ( "" sents_1 [ 678 ])

Output:

 `White guy: So, do you have any plans for this evening?` `Hobo: Got any spare change?` 

Code # 2: Default Offer Tokenizer

from nltk.tokenize import sent_tokenize

sents_2 = sent_tokenize (text)

  

print (sents_2 [ 0 ])

print ( "" sents_2 [ 678 ] )

Output:

 `White guy: So, do you have any plans for this evening? `` Girl: But you already have a Big Mac ... Hobo: Oh, this is all theatrical.` 

This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in goes into a typical paragraph sentence structure.

How does learning work?
The PunktSentenceTokenizer class follows an unsupervised learning algorithm to find out what constitutes a break offers. This is unattended because there is no need to give any labeled training data, just raw text.

Filtering stop words in the tokenized sentence

Stop- words — these are ordinary words that appear in the text, but, as a rule, do not affect the meaning of the sentence. They are almost irrelevant for information retrieval and natural language processing purposes. For example — “A” and “a”. Most search engines filter stop words from search queries and documents. 
The NLTK library comes with the nltk_data / corpora / stopwords / corpus of words —  nltk_data / corpora / stopwords / which contains wordlists for many languages.

Code # 3: Stopwords with Python

# Loading library

from nltk.corpus import stopwords

 
# Using stop words from English

english_stops = set (stopwords.words ( ` english` ))

 
# Print stop word list in English

words  = [ "Let`s" , `see `,` how `," it` s", `working` ]

  

print ( " Before stopwords removal: " , words)

print ( " After stopwords removal: " ,

  [word for word in words if word not in english_stops])

Exit :

 Before stopwords removal: ["Let`s", `see`,` how`, "it`s", `working`] After stopwords removal: ["Let`s", `see`,` working`]? 

Code # 4: Complete list of languages ​​used in NLTK stopwords.

stopwords.fileids ()

Exit :

 [`danish`,` dutch`, `english`,` finnish`, `french`,` german`, `hungarian` , `italian`,` norwegian`, `portuguese`,` russian`, `spanish`,` swedish`, `turkish`]