Let’s look at the following text to understand the concept. This kind of text is very common in any web text corpus.
Example of TEXT: A guy: So, what are your plans for the party? B girl: well! I am not going! A guy: Oh, but u should enjoy.
To download the text file, click here .
Code # 1: Tutorial Tokenizer
’White guy: So, do you have any plans for this evening?’ ’Hobo: Got any spare change?’
Code # 2: Default Offer Tokenizer
’White guy: So, do you have any plans for this evening? ’’ Girl: But you already have a Big Mac ... Hobo: Oh, this is all theatrical.’
This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in goes into a typical paragraph sentence structure.
How does learning work?
PunktSentenceTokenizer class follows an unsupervised learning algorithm to find out what constitutes a break offers. This is unattended because there is no need to give any labeled training data, just raw text.
Filtering stop words in the tokenized sentence
Stop- words — these are ordinary words that appear in the text, but, as a rule, do not affect the meaning of the sentence. They are almost irrelevant for information retrieval and natural language processing purposes. For example — "A" and "a". Most search engines filter stop words from search queries and documents.
The NLTK library comes with the
nltk_data / corpora / stopwords / corpus of words —
nltk_data / corpora / stopwords / which contains wordlists for many languages.
Code # 3: Stopwords with Python
Before stopwords removal: ["Let’s", ’see’,’ how’, "it’s", ’working’] After stopwords removal: ["Let’s", ’see’,’ working’]?
Code # 4: Complete list of languages used in NLTK stopwords.
[’danish’,’ dutch’, ’english’,’ finnish’, ’french’,’ german’, ’hungarian’ , ’italian’,’ norwegian’, ’portuguese’,’ russian’, ’spanish’,’ swedish’, ’turkish’]