Word Processing Using NLP | basics

| | | | | | | | | |

👻 Check our latest review to choose the best laptop for Machine Learning engineers and Deep learning tasks!

In this article, we are going to discuss how we can get text from online text files and extract the necessary data from them. For this article, we will use a text file available here .

The following must be installed in your current production environment:

  • NLTC Library
  • urllib
  • BeautifulSoup Library

Step # 1: import the required libraries

import nltk

from bs4 import BeautifulSoup

from urllib.request import urlopen

Some basic information about the above libraries:

  • NLTK library. Library nltk is a collection of libraries and programs written for English language processing and written in the Python programming language.
  • urllib library: is a URL processing library for python ... Find out more about this

    raw = urlopen ( " https://www.w3.org/TR/PNG/iso_8859-1. txt " ). read ()

    So the raw data is loaded into the raw variable.

    Step # 3: Then we process the data to remove all html / tags xml that can be present in our raw variable using:

    raw1 = BeautifulSoup (raw)

    Step # 4: Now we get the text in the "raw" variable.

    raw2 = raw1.get_text ()

    Output:

    Step # 5: Next we break the text into words.

    token = nltk.word_tokenize (raw2)

    Output:

    This is done as preprocessing for the next step, where we get the final text.

    Step # 6: Finally, we get our final text.

    text2 = ’’ . join (token)

    Output:

    Below is the complete code:

    👻 Read also: what is the best laptop for engineering students?

    Word Processing Using NLP | basics _files: Questions

    How do I list all files of a directory?

    5 answers

    How can I list all files of a directory in Python and add them to a list?

    3467

    Answer #1

    os.listdir() will get you everything that"s in a directory - files and directories.

    If you want just files, you could either filter this down using os.path:

    from os import listdir
    from os.path import isfile, join
    onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
    

    or you could use os.walk() which will yield two lists for each directory it visits - splitting into files and dirs for you. If you only want the top directory you can break the first time it yields

    from os import walk
    
    f = []
    for (dirpath, dirnames, filenames) in walk(mypath):
        f.extend(filenames)
        break
    

    or, shorter:

    from os import walk
    
    filenames = next(walk(mypath), (None, None, []))[2]  # [] if no file
    

    3467

    Answer #2

    I prefer using the glob module, as it does pattern matching and expansion.

    import glob
    print(glob.glob("/home/adam/*"))
    

    It does pattern matching intuitively

    import glob
    # All files ending with .txt
    print(glob.glob("/home/adam/*.txt")) 
    # All files ending with .txt with depth of 2 folder
    print(glob.glob("/home/adam/*/*.txt")) 
    

    It will return a list with the queried files:

    ["/home/adam/file1.txt", "/home/adam/file2.txt", .... ]
    

    3467

    Answer #3

    os.listdir() - list in the current directory

    With listdir in os module you get the files and the folders in the current dir

     import os
     arr = os.listdir()
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    Looking in a directory

    arr = os.listdir("c:\files")
    

    glob from glob

    with glob you can specify a type of file to list like this

    import glob
    
    txtfiles = []
    for file in glob.glob("*.txt"):
        txtfiles.append(file)
    

    glob in a list comprehension

    mylist = [f for f in glob.glob("*.txt")]
    

    get the full path of only files in the current directory

    import os
    from os import listdir
    from os.path import isfile, join
    
    cwd = os.getcwd()
    onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
    os.path.isfile(os.path.join(cwd, f))]
    print(onlyfiles) 
    
    ["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]
    

    Getting the full path name with os.path.abspath

    You get the full path in return

     import os
     files_path = [os.path.abspath(x) for x in os.listdir()]
     print(files_path)
     
     ["F:\documentiapplications.txt", "F:\documenticollections.txt"]
    

    Walk: going through sub directories

    os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

    import os
    
    # Getting the current work directory (cwd)
    thisdir = os.getcwd()
    
    # r=root, d=directories, f = files
    for r, d, f in os.walk(thisdir):
        for file in f:
            if file.endswith(".docx"):
                print(os.path.join(r, file))
    

    os.listdir(): get files in the current directory (Python 2)

    In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

     import os
     arr = os.listdir(".")
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    To go up in the directory tree

    # Method 1
    x = os.listdir("..")
    
    # Method 2
    x= os.listdir("/")
    

    Get files: os.listdir() in a particular directory (Python 2 and 3)

     import os
     arr = os.listdir("F:\python")
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    Get files of a particular subdirectory with os.listdir()

    import os
    
    x = os.listdir("./content")
    

    os.walk(".") - current directory

     import os
     arr = next(os.walk("."))[2]
     print(arr)
     
     >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]
    

    next(os.walk(".")) and os.path.join("dir", "file")

     import os
     arr = []
     for d,r,f in next(os.walk("F:\_python")):
         for file in f:
             arr.append(os.path.join(r,file))
    
     for f in arr:
         print(files)
    
    >>> F:\_python\dict_class.py
    >>> F:\_python\programmi.txt
    

    next(os.walk("F:\") - get the full path - list comprehension

     [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
     
     >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]
    

    os.walk - get full path - all files in sub dirs**

    x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
    print(x)
    
    >>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"]
    

    os.listdir() - get only txt files

     arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
     print(arr_txt)
     
     >>> ["work.txt", "3ebooks.txt"]
    

    Using glob to get the full path of the files

    If I should need the absolute path of the files:

    from path import path
    from glob import glob
    x = [path(f).abspath() for f in glob("F:\*.txt")]
    for f in x:
        print(f)
    
    >>> F:acquistionline.txt
    >>> F:acquisti_2018.txt
    >>> F:ootstrap_jquery_ecc.txt
    

    Using os.path.isfile to avoid directories in the list

    import os.path
    listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
    print(listOfFiles)
    
    >>> ["a simple game.py", "data.txt", "decorator.py"]
    

    Using pathlib from Python 3.4

    import pathlib
    
    flist = []
    for p in pathlib.Path(".").iterdir():
        if p.is_file():
            print(p)
            flist.append(p)
    
     >>> error.PNG
     >>> exemaker.bat
     >>> guiprova.mp3
     >>> setup.py
     >>> speak_gui2.py
     >>> thumb.PNG
    

    With list comprehension:

    flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]
    

    Alternatively, use pathlib.Path() instead of pathlib.Path(".")

    Use glob method in pathlib.Path()

    import pathlib
    
    py = pathlib.Path().glob("*.py")
    for file in py:
        print(file)
    
    >>> stack_overflow_list.py
    >>> stack_overflow_list_tkinter.py
    

    Get all and only files with os.walk

    import os
    x = [i[2] for i in os.walk(".")]
    y=[]
    for t in x:
        for f in t:
            y.append(f)
    print(y)
    
    >>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]
    

    Get only files with next and walk in a directory

     import os
     x = next(os.walk("F://python"))[2]
     print(x)
     
     >>> ["calculator.bat","calculator.py"]
    

    Get only directories with next and walk in a directory

     import os
     next(os.walk("F://python"))[1] # for the current dir use (".")
     
     >>> ["python3","others"]
    

    Get all the subdir names with walk

    for r,d,f in os.walk("F:\_python"):
        for dirs in d:
            print(dirs)
    
    >>> .vscode
    >>> pyexcel
    >>> pyschool.py
    >>> subtitles
    >>> _metaprogramming
    >>> .ipynb_checkpoints
    

    os.scandir() from Python 3.5 and greater

    import os
    x = [f.name for f in os.scandir() if f.is_file()]
    print(x)
    
    >>> ["calculator.bat","calculator.py"]
    
    # Another example with scandir (a little variation from docs.python.org)
    # This one is more efficient than os.listdir.
    # In this case, it shows the files only in the current directory
    # where the script is executed.
    
    import os
    with os.scandir() as i:
        for entry in i:
            if entry.is_file():
                print(entry.name)
    
    >>> ebookmaker.py
    >>> error.PNG
    >>> exemaker.bat
    >>> guiprova.mp3
    >>> setup.py
    >>> speakgui4.py
    >>> speak_gui2.py
    >>> speak_gui3.py
    >>> thumb.PNG
    

    Examples:

    Ex. 1: How many files are there in the subdirectories?

    In this example, we look for the number of files that are included in all the directory and its subdirectories.

    import os
    
    def count(dir, counter=0):
        "returns number of files in dir and subdirs"
        for pack in os.walk(dir):
            for f in pack[2]:
                counter += 1
        return dir + " : " + str(counter) + "files"
    
    print(count("F:\python"))
    
    >>> "F:\python" : 12057 files"
    

    Ex.2: How to copy all files from a directory to another?

    A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

    import os
    import shutil
    from path import path
    
    destination = "F:\file_copied"
    # os.makedirs(destination)
    
    def copyfile(dir, filetype="pptx", counter=0):
        "Searches for pptx (or other - pptx is the default) files and copies them"
        for pack in os.walk(dir):
            for f in pack[2]:
                if f.endswith(filetype):
                    fullpath = pack[0] + "\" + f
                    print(fullpath)
                    shutil.copy(fullpath, destination)
                    counter += 1
        if counter > 0:
            print("-" * 30)
            print("	==> Found in: `" + dir + "` : " + str(counter) + " files
    ")
    
    for dir in os.listdir():
        "searches for folders that starts with `_`"
        if dir[0] == "_":
            # copyfile(dir, filetype="pdf")
            copyfile(dir, filetype="txt")
    
    
    >>> _compiti18Compito Contabilità 1conti.txt
    >>> _compiti18Compito Contabilità 1modula4.txt
    >>> _compiti18Compito Contabilità 1moduloa4.txt
    >>> ------------------------
    >>> ==> Found in: `_compiti18` : 3 files
    

    Ex. 3: How to get all the files in a txt file

    In case you want to create a txt file with all the file names:

    import os
    mylist = ""
    with open("filelist.txt", "w", encoding="utf-8") as file:
        for eachfile in os.listdir():
            mylist += eachfile + "
    "
        file.write(mylist)
    

    Example: txt with all the files of an hard drive

    """
    We are going to save a txt file with all the files in your directory.
    We will use the function walk()
    """
    
    import os
    
    # see all the methods of os
    # print(*dir(os), sep=", ")
    listafile = []
    percorso = []
    with open("lista_file.txt", "w", encoding="utf-8") as testo:
        for root, dirs, files in os.walk("D:\"):
            for file in files:
                listafile.append(file)
                percorso.append(root + "\" + file)
                testo.write(file + "
    ")
    listafile.sort()
    print("N. of files", len(listafile))
    with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
        for file in listafile:
            testo_ordinato.write(file + "
    ")
    
    with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
        for file in percorso:
            file_percorso.write(file + "
    ")
    
    os.system("lista_file.txt")
    os.system("lista_file_ordinata.txt")
    os.system("percorso.txt")
    

    All the file of C: in one text file

    This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

    import os
    
    with open("file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk("C:\"):
            for file in f:
                filewrite.write(f"{r + file}
    ")
    

    How to write a file with all paths in a folder of a type

    With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

    import os
    
    def searchfiles(extension=".ttf", folder="H:\"):
        "Create a txt file with all the file of a type"
        with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
            for r, d, f in os.walk(folder):
                for file in f:
                    if file.endswith(extension):
                        filewrite.write(f"{r + file}
    ")
    
    # looking for png file (fonts) in the hard disk H:
    searchfiles(".png", "H:\")
    
    >>> H:4bs_18Dolphins5.png
    >>> H:4bs_18Dolphins6.png
    >>> H:4bs_18Dolphins7.png
    >>> H:5_18marketing htmlassetsimageslogo2.png
    >>> H:7z001.png
    >>> H:7z002.png
    

    (New) Find all files and open them with tkinter GUI

    I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

    import tkinter as tk
    import os
    
    def searchfiles(extension=".txt", folder="H:\"):
        "insert all files in the listbox"
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    lb.insert(0, r + "\" + file)
    
    def open_file():
        os.startfile(lb.get(lb.curselection()[0]))
    
    root = tk.Tk()
    root.geometry("400x400")
    bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
    bt.pack()
    lb = tk.Listbox(root)
    lb.pack(fill="both", expand=1)
    lb.bind("<Double-Button>", lambda x: open_file())
    root.mainloop()
    

    We hope this article has helped you to resolve the problem. Apart from Word Processing Using NLP | basics, check other _files-related topics.

    Want to excel in Python? See our review of the best Python online courses 2022. If you are interested in Data Science, check also how to learn programming in R.

    By the way, this material is also available in other languages:



    Boris Zelotti

    Prague | 2022-12-10

    Maybe there are another answers? What Word Processing Using NLP | basics exactly means?. Checked yesterday, it works!

    Xu Innsbruck

    New York | 2022-12-10

    String Variables is always a bit confusing 😭 Word Processing Using NLP | basics is not the only problem I encountered. Checked yesterday, it works!

    Davies Schteiner

    Prague | 2022-12-10

    open is always a bit confusing 😭 Word Processing Using NLP | basics is not the only problem I encountered. I just hope that will not emerge anymore

    Shop

    Learn programming in R: courses

    $

    Best Python online courses for 2022

    $

    Best laptop for Fortnite

    $

    Best laptop for Excel

    $

    Best laptop for Solidworks

    $

    Best laptop for Roblox

    $

    Best computer for crypto mining

    $

    Best laptop for Sims 4

    $

    Latest questions

    NUMPYNUMPY

    Common xlabel/ylabel for matplotlib subplots

    12 answers

    NUMPYNUMPY

    How to specify multiple return types using type-hints

    12 answers

    NUMPYNUMPY

    Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

    12 answers

    NUMPYNUMPY

    Flake8: Ignore specific warning for entire file

    12 answers

    NUMPYNUMPY

    glob exclude pattern

    12 answers

    NUMPYNUMPY

    How to avoid HTTP error 429 (Too Many Requests) python

    12 answers

    NUMPYNUMPY

    Python CSV error: line contains NULL byte

    12 answers

    NUMPYNUMPY

    csv.Error: iterator should return strings, not bytes

    12 answers


    Wiki

    Python | How to copy data from one Excel sheet to another

    Common xlabel/ylabel for matplotlib subplots

    Check if one list is a subset of another in Python

    sin

    How to specify multiple return types using type-hints

    exp

    Printing words vertically in Python

    exp

    Python Extract words from a given string

    Cyclic redundancy check in Python

    Finding mean, median, mode in Python without libraries

    cos

    Python add suffix / add prefix to strings in a list

    Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

    Python - Move item to the end of the list

    Python - Print list vertically

    # importing libraries

    import nltk

    from bs4 import BeautifulSoup

    from urllib.request import urlopen


    # extract the entire contents of the text file.

    raw = urlopen ( " https://www. w3.org/TR/PNG/iso_8859-1.txt " ). read ()


    # remove all html / xml tags

    raw1 = BeautifulSoup (raw)


    # get the text present in & # 39; raw & # 39;

    raw2 = raw1.get_text ()


    # break the text into words.

    token = nltk.word_tokenize (raw2)

    text2 = ’’ . join (token)