Word Processing Using NLP | basics



In this article, we are going to discuss how we can get text from online text files and extract the necessary data from them. For this article, we will use a text file available here .

The following must be installed in your current production environment:

  • NLTC Library
  • urllib
  • BeautifulSoup Library

Step # 1: import the required libraries

import nltk

from bs4 import BeautifulSoup

from urllib.request import urlopen

Some basic information about the above libraries:

  • NLTK library. Library nltk is a collection of libraries and programs written for English language processing and written in the Python programming language.
  • urllib library: is a URL processing library for python … Find out more about this

    raw = urlopen ( " https://www.w3.org/TR/PNG/iso_8859-1. txt " ). read ()

    So the raw data is loaded into the raw variable.

    Step # 3: Then we process the data to remove all html / tags xml that can be present in our raw variable using:

     

    raw1 = BeautifulSoup (raw)

    Step # 4: Now we get the text in the “raw” variable.

    raw2 = raw1.get_text ()

    Output:

    Step # 5: Next we break the text into words.

     

    token = nltk.word_tokenize (raw2)

    Output:

    This is done as preprocessing for the next step, where we get the final text.

    Step # 6: Finally, we get our final text.

    text2 = `` . join (token)

    Output:

    Below is the complete code:


    # importing libraries

    import nltk

    from bs4 import BeautifulSoup

    from urllib.request import urlopen

      
    # extract the entire contents of the text file.

    raw = urlopen ( " https:// www. w3.org/TR/PNG/iso_8859-1.txt " ). read ()

     
    # remove all html / xml tags

    raw1 = BeautifulSoup (raw)

      
    # get the text present in & # 39; raw & # 39;

    raw2 = raw1.get_text ()

      
    # break the text into words.

    token = nltk.word_tokenize (raw2)

    text2 = `` . join (token)