Change language

NLP | Tokenizer training and filtering stop words in a sentence

| | | |

Let’s look at the following text to understand the concept. This kind of text is very common in any web text corpus.

  Example of TEXT:  A guy: So, what are your plans for the party? B girl: well! I am not going! A guy: Oh, but u should enjoy. 

To download the text file, click here .

Code # 1: Tutorial Tokenizer

# Loading libraries

from nltk.tokenize import PunktSentenceTokenizer

from nltk.corpus import webtext

 

text = webtext.raw ( ’C: Geeksforgeeks data_for_training_tokenizer .txt’ )

sent_tokenizer  = PunktSentenceTokenizer (text)

sents_1 = sent_tokenizer.tokenize (text)

 

print (sents_1 [ 0 ])

print ( "" sents_1 [ 678 ])

Output:

 ’White guy: So, do you have any plans for this evening?’ ’Hobo: Got any spare change?’ 

Code # 2: Default Offer Tokenizer

from nltk.tokenize import sent_tokenize

sents_2 = sent_tokenize (text)

  

print (sents_2 [ 0 ])

print ( "" sents_2 [ 678 ] )

Output:

 ’White guy: So, do you have any plans for this evening? ’’ Girl: But you already have a Big Mac ... Hobo: Oh, this is all theatrical.’ 

This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in goes into a typical paragraph sentence structure.

How does learning work?
The PunktSentenceTokenizer class follows an unsupervised learning algorithm to find out what constitutes a break offers. This is unattended because there is no need to give any labeled training data, just raw text.

Filtering stop words in the tokenized sentence

Stop- words — these are ordinary words that appear in the text, but, as a rule, do not affect the meaning of the sentence. They are almost irrelevant for information retrieval and natural language processing purposes. For example — "A" and "a". Most search engines filter stop words from search queries and documents. 
The NLTK library comes with the nltk_data / corpora / stopwords / corpus of words —  nltk_data / corpora / stopwords / which contains wordlists for many languages.

Code # 3: Stopwords with Python

# Loading library

from nltk.corpus import stopwords

 
# Using stop words from English

english_stops = set (stopwords.words ( ’ english’ ))

 
# Print stop word list in English

words  = [ "Let’s" , ’see ’,’ how ’," it’ s", ’working’ ]

  

print ( " Before stopwords removal: " , words)

print ( " After stopwords removal: " ,

  [word for word in words if word not in english_stops])

Exit :

 Before stopwords removal: ["Let’s", ’see’,’ how’, "it’s", ’working’] After stopwords removal: ["Let’s", ’see’,’ working’]? 

Code # 4: Complete list of languages ​​used in NLTK stopwords.

stopwords.fileids ()

Exit :

 [’danish’,’ dutch’, ’english’,’ finnish’, ’french’,’ german’, ’hungarian’ , ’italian’,’ norwegian’, ’portuguese’,’ russian’, ’spanish’,’ swedish’, ’turkish’] 

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically