Let`s look at the following text to understand the concept. This kind of text is very common in any web text corpus.
Example of TEXT: A guy: So, what are your plans for the party? B girl: well! I am not going! A guy: Oh, but u should enjoy.
To download the text file, click here . p>
Code # 1: Tutorial Tokenizer
Output: p >
`White guy: So, do you have any plans for this evening?` `Hobo: Got any spare change?`
Code # 2: Default Offer Tokenizer strong >
`White guy: So, do you have any plans for this evening? `` Girl: But you already have a Big Mac ... Hobo: Oh, this is all theatrical.`
This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in goes into a typical paragraph sentence structure.
How does learning work?
PunktSentenceTokenizer class follows an unsupervised learning algorithm to find out what constitutes a break offers. This is unattended because there is no need to give any labeled training data, just raw text.
Stop- words — these are ordinary words that appear in the text, but, as a rule, do not affect the meaning of the sentence. They are almost irrelevant for information retrieval and natural language processing purposes. For example — “A” and “a”. Most search engines filter stop words from search queries and documents.
The NLTK library comes with the
nltk_data / corpora / stopwords / corpus of words —
nltk_data / corpora / stopwords / which contains wordlists for many languages.
Code # 3: Stopwords with Python
Before stopwords removal: ["Let`s", `see`,` how`, "it`s", `working`] After stopwords removal: ["Let`s", `see`,` working`]?
Code # 4: Complete list of languages used in NLTK stopwords.
[`danish`,` dutch`, `english`,` finnish`, `french`,` german`, `hungarian` , `italian`,` norwegian`, `portuguese`,` russian`, `spanish`,` swedish`, `turkish`]