NLP | How tokenization of text, sentences, words works

File handling | NLP | Python Methods and Functions | Regular Expressions

Tokenization — it is the process of tokenizing or splitting a string, text into a list of tokens. You can think of a token as parts like a word, it is a token in a sentence, and a sentence is a token in a paragraph.

Key points of the article —

  • Text in tokenization sentences
  • Sentences in tokenization words
  • Sentences using tokenization regular expressions

Code # 1: Offer Tokenization — Splitting sentences in a paragraph

Output:

 ['Hello everyone.',' Welcome to GeeksforGeeks.', 'You are studying NLP article'] 

How does sent_tokenize work?
The sent_tokenize function uses the PunktSentenceTokenizer from nltk.tokenize.punkt module , which is already trained and therefore knows very well how to mark the end and beginning of a sentence with what symbols and punctuation.

# 2: PunktSentenceTokenizer when we have huge chunks of data, then use it effectively.

from nltk.tokenize import sent_tokenize

  

text = "Hello e veryone. Welcome to GeeksforGeeks. You are studying NLP article "

sent_tokenize (text)

import nltk.data

 
# Download PunktSentenceTokenizer using English pickle file

tokenizer = nltk.data.load ( ' tokenizers / punkt / PY3 / english.pickle' )

 
tokenizer.tokenize (text)

Output:

 ['Hello everyone.',' Welcome to GeeksforGeeks.', 'You are studying NLP article'] 

Code # 3: Tokenizing a sentence in another language. You can also tokenize a sentence in different languages ​​using a different pickle file than from English.

Output:

 ['Hola amigo.',' Estoy bien.'] 

Code # 4: Word tokenization — Separation of words in a sentence.

import nltk.data

 

spanish_tokenizer = nltk.data.load ( ' tokenizers / punkt / PY3 / spanish.pickle' )

 

text = 'Hola amigo. Estoy bien.'

spanish_tokenizer.tokenize (text)

from nltk.tokenize import word_tokenize

 

text = "Hello everyone. Welcome to GeeksforGeeks."

word_tokenize (text)

Output:

 ['Hello',' everyone', '.',' Welcome', 'to',' GeeksforGeeks', '.'] 

How does word_tokenize work?
word_tokenize () — it is a wrapper function that calls tokenize () on an instance of the TreebankWordTokenizer class .

Code # 5: Using TreebankWordTokenizer

from nltk.tokenize import TreebankWordTokenizer

  

tokenizer = TreebankWordTokenizer ()

tokenizer.tokenize (text)

Output:

 ['Hello',' everyone.', 'Welcome',' to', 'GeeksforGeeks',' .'] 

These tokenizers work, separating words using punctuation marks and spaces. And, as mentioned in the above code output, it does not override punctuation, allowing the user to decide what to do with punctuation during preprocessing.

Code # 6: PunktWordTokenizer it is not PunktWordTokenizer punctuation from words.

from nltk.tokenize import PunktWordTokenizer

 

tokenizer = PunktWordTokenizer ()

tokenizer.tokenize ( "Let's see how it 's working." )

Output:

 ['Let'," 's", 'see', 'how',' it', "' s ",' working', '.'] 

Code # 6: WordPunctTokenizer — WordPunctTokenizer punctuation from words.

from nltk.tokenize import WordPunctTokenizer

 

tokenizer = WordPunctTokenizer ()

tokenizer.tokenize ( "Let's see how it's working." )

Output:

 ['Let'," '", 's',' see', 'how',' it', "' ",' s', 'working',' .'] 

Code # 7: Using Regular Expressions

from nltk.tokenize import RegexpTokenizer

 

tokenizer = RegexpTokenizer ( "[w'] +" )

text = " Let's see how it's working. "

tokenizer.tokenize (text)

Output:

 ["Let's", 'see',' how', "it's", 'working'] 

Code # 7: Using Regular Expressions

from nltk.tokenize import regexp_tokenize

  

text = " Let's see how it's working. "

regexp_tokenize (text, " [w'] + " )

Output:

 ["Let's", 'see',' how', "it's", 'working'] 




Get Solution for free from DataCamp guru