Python | Speech recognition on large audio files

_files | File handling | Python Methods and Functions




Libraries required

Pydub: sudo pip3 install pydub

Speech recognition: sudo pip3 install SpeechRecognition




Example:

Input:  peacock.wav 

Output:

exporting chunk0.wav
Processing chunk 0
exporting chunk1.wav
Processing chunk 1
exporting chunk2.wav
Processing chunk 2
exporting chunk3.wav
Processing chunk 3
exporting chunk4.wav
Processing chunk 4
exporting chunk5.wav
Processing chunk 5
exporting chunk6.wav
Processing chunk 6



Python Code:

# importing libraries
import speech_recognition as sr
  
import os
  
from pydub import AudioSegment
from pydub.silence import split_on_silence
  
# a function that splits the audio file into chunks
# and applies speech recognition
def silence_based_conversion(path = "alice-medium.wav"):
  
    # open the audio file stored in
    # the local system as a wav file.
    song = AudioSegment.from_wav(path)
  
    # open a file where we will concatenate  
    # and store the recognized text
    fh = open("recognized.txt", "w+")
          
    # split track where silence is 0.5 seconds 
    # or more and get chunks
    chunks = split_on_silence(song,
        # must be silent for at least 0.5 seconds
        # or 500 ms. adjust this value based on user
        # requirement. if the speaker stays silent for 
        # longer, increase this value. else, decrease it.
        min_silence_len = 500,
  
        # consider it silent if quieter than -16 dBFS
        # adjust this per requirement
        silence_thresh = -16
    )
  
    # create a directory to store the audio chunks.
    try:
        os.mkdir('audio_chunks')
    except(FileExistsError):
        pass
  
    # move into the directory to
    # store the audio files.
    os.chdir('audio_chunks')
  
    i = 0
    # process each chunk
    for chunk in chunks:
              
        # Create 0.5 seconds silence chunk
        chunk_silent = AudioSegment.silent(duration = 10)
  
        # add 0.5 sec silence to beginning and 
        # end of audio chunk. This is done so that
        # it doesn't seem abruptly sliced.
        audio_chunk = chunk_silent + chunk + chunk_silent
  
        # export audio chunk and save it in 
        # the current directory.
        print("saving chunk{0}.wav".format(i))
        # specify the bitrate to be 192 k
        audio_chunk.export("./chunk{0}.wav".format(i), bitrate ='192k', format ="wav")
  
        # the name of the newly created chunk
        filename = 'chunk'+str(i)+'.wav'
  
        print("Processing chunk "+str(i))
  
        # get the name of the newly created chunk
        # in the AUDIO_FILE variable for later use.
        file = filename
  
        # create a speech recognition object
        r = sr.Recognizer()
  
        # recognize the chunk
        with sr.AudioFile(file) as source:
            # remove this if it is not working
            # correctly.
            r.adjust_for_ambient_noise(source)
            audio_listened = r.listen(source)
  
        try:
            # try converting it to text
            rec = r.recognize_google(audio_listened)
            # write the output to the file.
            fh.write(rec+". ")
  
        # catch any errors.
        except sr.UnknownValueError:
            print("Could not understand audio")
  
        except sr.RequestError as e:
            print("Could not request results. check your internet connection")
  
        i += 1
  
    os.chdir('..')
  
  
if __name__ == '__main__':
          
    print('Enter the audio file path')
  
    path = input()
  
    silence_based_conversion(path)

Output:

recognized.txt:

The peacock is the national bird of India. They have colourful feathers, two legs and 
a small beak. They are famous for their dance. When a peacock dances it spreads its 
feathers like a fan. It has a long shiny dark blue neck. Peacocks are mostly found in 
the fields they are very beautiful birds. The females are known as 'Peahen1. Their 
feathers are used for making jackets, purses etc. We can see them in a zoo. 



How to convert large WAV file to text in Python?

Question from StackOverFlow

I already tried this code to convert my large wav file to text

import speech_recognition as sr
r = sr.Recognizer()

hellow=sr.AudioFile('hello_world.wav')
with hellow as source:
    audio = r.record(source)
try:
    s = r.recognize_google(audio)
    print("Text: "+s)
except Exception as e:
    print("Exception: "+str(e))

But it is not converting it accurately, the reason I feel it's the 'US' accent. Please tell me how i can convert whole large wav file accurately.

Answer:

Google's speech to text is very effective, try the below link,

https://cloud.google.com/speech-to-text/

You can choose the language (English US in your case) and also upload files.

Like @bigdataolddriver commented 100% accuracy is not possible yet, and will be worth millions.

Google speech to text has three types of APIs

Synchronous, Asynchronous and streaming, in which asynchronous allows you to ~480 minutes audio conversion while others will only let you ~1 minute. Following is the sample code to do the conversion.

filepath = "~/audio_wav/"     #Input audio file path
output_filepath = "~/Transcripts/" #Final transcript path
bucketname = "callsaudiofiles" #Name of the bucket created in the step before

# Import libraries
from pydub import AudioSegment
import io
import os
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
import wave
from google.cloud import storage

Speech to text support wav files with LINEAR16 or MULAW encoded audio.

Below is the code to get the frame rate and channel with code.

def frame_rate_channel(audio_file_name):
    with wave.open(audio_file_name, "rb") as wave_file:
        frame_rate = wave_file.getframerate()
        channels = wave_file.getnchannels()
        return frame_rate,channels

and the code below is the does the asynchronous conversion.

def google_transcribe(audio_file_name):

    file_name = filepath + audio_file_name

    # The name of the audio file to transcribe

    frame_rate, channels = frame_rate_channel(file_name)

    if channels > 1:
        stereo_to_mono(file_name)

    bucket_name = bucketname
    source_file_name = filepath + audio_file_name
    destination_blob_name = audio_file_name

    upload_blob(bucket_name, source_file_name, destination_blob_name)

    gcs_uri = 'gs://' + bucketname + '/' + audio_file_name
    transcript = ''

    client = speech.SpeechClient()
    audio = types.RecognitionAudio(uri=gcs_uri)

    config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=frame_rate,
    language_code='en-US')

    # Detects speech in the audio file
    operation = client.long_running_recognize(config, audio)
    response = operation.result(timeout=10000)

    for result in response.results:
        transcript += result.alternatives[0].transcript

    delete_blob(bucket_name, destination_blob_name)
    return transcript

and this is how you write them to file

def write_transcripts(transcript_filename,transcript):
    f= open(output_filepath + transcript_filename,"w+")
    f.write(transcript)
    f.close()

Kindly let me know if you need any further clarifications.


Speech recognition
— it is the process of converting sound to text. This is commonly used in voice assistants like Alexa, Siri, etc. Python provides an API called SpeechRecognition that allows us to convert audio to text for further processing. In this article, we will look at converting large or long audio files to text using the SpeechRecognition API in python.




Processing large audio files

When the input file is a long audio file, speech recognition accuracy decreases. Moreover, the Google Speech Recognition API cannot recognize long audio files with good fidelity. Therefore, we need to process the audio file into smaller chunks and then pass those chunks to the API. This improves accuracy and allows large audio files to be recognized.




Splitting audio based on silence

One way to handle an audio file — is to break it down into chunks of constant size. For example, we can take an audio file 10 minutes long and split it into 60 pieces of 10 seconds each. We can then pass these snippets to the API and convert speech to text by concatenating the results of all these snippets. This method is imprecise. Dividing an audio file into constant-sized chunks can interrupt sentences in between, and we can lose some important words in the process. This is because the audio file may end before the word is fully spoken, and Google will not be able to recognize incomplete words.

Another way — split audio file by silence. People stop for a short time between sentences. If we can split the audio file into chunks based on this silence, then we can process the file's sentence by sentence and combine them to get the result. This approach is more accurate than the previous one because we do not split sentences between them and the audio block will contain the entire sentence without any interruptions. This way we don't need to split it into chunks of constant length.

The disadvantage of this method is that it is difficult to determine the duration of the silence to separate, because different users speak differently and some users may do pause for 1 second between sentences, while some may pause for as little as 0.5 seconds.




How to Convert Speech to Text in Python

,p> Speech recognition is the ability of computer software to recognize words and sentences in spoken language and convert them into human-readable text. This tutorial will show you how to convert speech to text in Python using the SpeechRecognition library.

As a result, we don't have to build a machine learning model from scratch. This library provides us with handy wrappers for various popular public speech recognition APIs (like Google Cloud Speech API, IBM Speech To Text, etc.).

Okay, let's get started, installing the library using pip:

pip3 install SpeechRecognition pydub

Okey, open up a new Python file and import it:

import speech_recognition as sr

The nice thing about this library is it supports several recognition engines:

  • CMU Sphinx (offline)
  • Google Speech Recognition
  • Google Cloud Speech API
  • Wit.ai
  • Microsoft Bing Voice Recognition
  • Houndify API
  • IBM Speech To Text
  • Snowboy Hotword Detection (offline)
  • We are going to use Google Speech Recognition here, as it's straightforward and doesn't require any API key.

    Reading from a File

    Make sure you have an audio file in the current directory that contains english speech (if you want to follow along with me, get the audio file here):

    filename = "16-122828-0002.wav"
    

    This file was taken from LibriSpeech dataset, but you can use any audio WAV file you want, just change the name of the file, let's initialize our speech recognizer:

    # initialize the recognizer
    r = sr.Recognizer()
    

    The below code is responsible for loading the audio file, and converting the speech into text using Google Speech Recognition:

    # open the file
    with sr.AudioFile(filename) as source:
        # listen for the data (load audio to memory)
        audio_data = r.record(source)
        # recognize (convert from speech to text)
        text = r.recognize_google(audio_data)
        print(text)
    

    This will take few seconds to finish, as it uploads the file to Google and grabs the output, here is my result:

    I believe you're just talking nonsense
    

    The above code works well for small or medium size audio files. In the next section, we gonna write code for large files.

    Reading Large Audio Files

    If you want to perform speech recognition of a long audio file, then the below function handles that quite well:

    # importing libraries 
    import speech_recognition as sr 
    import os 
    from pydub import AudioSegment
    from pydub.silence import split_on_silence
    
    # create a speech recognition object
    r = sr.Recognizer()
    
    # a function that splits the audio file into chunks
    # and applies speech recognition
    def get_large_audio_transcription(path):
        """
        Splitting the large audio file into chunks
        and apply speech recognition on each of these chunks
        """
        # open the audio file using pydub
        sound = AudioSegment.from_wav(path)  
        # split audio sound where silence is 700 miliseconds or more and get chunks
        chunks = split_on_silence(sound,
            # experiment with this value for your target audio file
            min_silence_len = 500,
            # adjust this per requirement
            silence_thresh = sound.dBFS-14,
            # keep the silence for 1 second, adjustable as well
            keep_silence=500,
        )
        folder_name = "audio-chunks"
        # create a directory to store the audio chunks
        if not os.path.isdir(folder_name):
            os.mkdir(folder_name)
        whole_text = ""
        # process each chunk 
        for i, audio_chunk in enumerate(chunks, start=1):
            # export audio chunk and save it in
            # the 'folder_name' directory.
            chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
            audio_chunk.export(chunk_filename, format="wav")
            # recognize the chunk
            with sr.AudioFile(chunk_filename) as source:
                audio_listened = r.record(source)
                # try converting it to text
                try:
                    text = r.recognize_google(audio_listened)
                except sr.UnknownValueError as e:
                    print("Error:", str(e))
                else:
                    text = f"{text.capitalize()}. "
                    print(chunk_filename, ":", text)
                    whole_text += text
        # return the text for all chunks detected
        return whole_text
    

    Note: You need to install Pydub using pip for the above code to work.

    The above function uses split_on_silence() function from pydub.silence module to split audio data into chunks on silence. min_silence_len parameter is the minimum length of a silence to be used for a split.

    silence_thresh is the threshold in which anything quieter than this will be considered silence, I have set it to the average dBFS minus 14, keep_silence argument is the amount of silence to leave at the beginning and the end of each chunk detected in milliseconds.

    These parameters won't be perfect for all sound files, try to experiment with these parameters with your large audio needs.

    After that, we iterate over all chunks and convert each speech audio into text and adding them up all together, here is an example run:

    path = "7601-291468-0006.wav"
    print("\nFull text:", get_large_audio_transcription(path))
    

    Note: You can get 7601-291468-0006.wav file here.

    Output:

    audio-chunks\chunk1.wav : His abode which you had fixed in a bowery or country seat. 
    audio-chunks\chunk2.wav : At a short distance from the city. 
    audio-chunks\chunk3.wav : Just at what is now called dutch street. 
    audio-chunks\chunk4.wav : Sooner bounded with proofs of his ingenuity. 
    audio-chunks\chunk5.wav : Patent smokejacks. 
    audio-chunks\chunk6.wav : It required a horse to work some. 
    audio-chunks\chunk7.wav : Dutch oven roasted meat without fire. 
    audio-chunks\chunk8.wav : Carts that went before the horses. 
    audio-chunks\chunk9.wav : Weather cox that turned against the wind and other wrongheaded contrivances. 
    audio-chunks\chunk10.wav : So just understand can found it all beholders. 
    
    Full text: His abode which you had fixed in a bowery or country seat. At a short distance from the city. Just at what is now called dutch street. Sooner bounded with proofs of his ingenuity. Patent smokejacks. It required a horse to work some. Dutch oven roasted meat without fire. Carts that went before the horses. Weather cox that turned against the wind and other wrongheaded contrivances. So just understand can found it all beholders.
    So, this function automatically creates a folder for us and puts the chunks of the original audio file we specified, and then it runs speech recognition on all of them.
    

    Reading from the Microphone

    This requires PyAudio to be installed in your machine, here is the installation process depending on your operating system:

    Windows

    You can just pip install it:

    pip3 install pyaudio
    

    Linux

    You need to first install the dependencies:

    sudo apt-get install python-pyaudio python3-pyaudio
    pip3 install pyaudio
    

    MacOS

    You need to first install portaudio, then you can just pip install it:

    brew install portaudio
    pip3 install pyaudio
    

    Now let's use our microphone to convert our speech:

    with sr.Microphone() as source:
        # read the audio data from the default microphone
        audio_data = r.record(source, duration=5)
        print("Recognizing...")
        # convert speech to text
        text = r.recognize_google(audio_data)
        print(text)
    

    This will hear from your microphone for 5 seconds and then tries to convert that speech into text !

    It is pretty similar to the previous code, but we are using Microphone() object here to read the audio from the default microphone, and then we used duration parameter in record() function to stop reading after 5 seconds and then uploads the audio data to Google to get the output text.

    You can also use offset parameter in record() function to start recording after offset seconds.

    Also, you can recognize different languages by passing language parameter to recognize_google() function. For instance, if you want to recognize spanish speech, you would use:

    text = r.recognize_google(audio_data, language="es-ES")
    




    Python | Speech recognition on large audio files: StackOverflow Questions

    How do I list all files of a directory?

    How can I list all files of a directory in Python and add them to a list?

    Importing files from different folder

    I have the following folder structure.

    application
    ├── app
    │   └── folder
    │       └── file.py
    └── app2
        └── some_folder
            └── some_file.py
    

    I want to import some functions from file.py in some_file.py.

    I"ve tried

    from application.app.folder.file import func_name
    

    and some other various attempts but so far I couldn"t manage to import properly. How can I do this?

    If Python is interpreted, what are .pyc files?

    I"ve been given to understand that Python is an interpreted language...
    However, when I look at my Python source code I see .pyc files, which Windows identifies as "Compiled Python Files".

    Where do these come in?

    Find all files in a directory with extension .txt in Python

    How can I find all the files in a directory having the extension .txt in python?

    How to import other Python files?

    How do I import other files in Python?

    1. How exactly can I import a specific python file like import file.py?
    2. How can I import a folder instead of a specific file?
    3. I want to load a Python file dynamically at runtime, based on user input.
    4. I want to know how to load just one specific part from the file.

    For example, in main.py I have:

    from extra import * 
    

    Although this gives me all the definitions in extra.py, when maybe all I want is a single definition:

    def gap():
        print
        print
    

    What do I add to the import statement to just get gap from extra.py?

    How to use glob() to find files recursively?

    This is what I have:

    glob(os.path.join("src","*.c"))
    

    but I want to search the subfolders of src. Something like this would work:

    glob(os.path.join("src","*.c"))
    glob(os.path.join("src","*","*.c"))
    glob(os.path.join("src","*","*","*.c"))
    glob(os.path.join("src","*","*","*","*.c"))
    

    But this is obviously limited and clunky.

    How can I open multiple files using "with open" in Python?

    I want to change a couple of files at one time, iff I can write to all of them. I"m wondering if I somehow can combine the multiple open calls with the with statement:

    try:
      with open("a", "w") as a and open("b", "w") as b:
        do_something()
    except IOError as e:
      print "Operation failed: %s" % e.strerror
    

    If that"s not possible, what would an elegant solution to this problem look like?

    How can I iterate over files in a given directory?

    I need to iterate through all .asm files inside a given directory and do some actions on them.

    How can this be done in a efficient way?

    How to serve static files in Flask

    So this is embarrassing. I"ve got an application that I threw together in Flask and for now it is just serving up a single static HTML page with some links to CSS and JS. And I can"t find where in the documentation Flask describes returning static files. Yes, I could use render_template but I know the data is not templatized. I"d have thought send_file or url_for was the right thing, but I could not get those to work. In the meantime, I am opening the files, reading content, and rigging up a Response with appropriate mimetype:

    import os.path
    
    from flask import Flask, Response
    
    
    app = Flask(__name__)
    app.config.from_object(__name__)
    
    
    def root_dir():  # pragma: no cover
        return os.path.abspath(os.path.dirname(__file__))
    
    
    def get_file(filename):  # pragma: no cover
        try:
            src = os.path.join(root_dir(), filename)
            # Figure out how flask returns static files
            # Tried:
            # - render_template
            # - send_file
            # This should not be so non-obvious
            return open(src).read()
        except IOError as exc:
            return str(exc)
    
    
    @app.route("/", methods=["GET"])
    def metrics():  # pragma: no cover
        content = get_file("jenkins_analytics.html")
        return Response(content, mimetype="text/html")
    
    
    @app.route("/", defaults={"path": ""})
    @app.route("/<path:path>")
    def get_resource(path):  # pragma: no cover
        mimetypes = {
            ".css": "text/css",
            ".html": "text/html",
            ".js": "application/javascript",
        }
        complete_path = os.path.join(root_dir(), path)
        ext = os.path.splitext(path)[1]
        mimetype = mimetypes.get(ext, "text/html")
        content = get_file(complete_path)
        return Response(content, mimetype=mimetype)
    
    
    if __name__ == "__main__":  # pragma: no cover
        app.run(port=80)
    

    Someone want to give a code sample or url for this? I know this is going to be dead simple.

    Unzipping files in Python

    I read through the zipfile documentation, but couldn"t understand how to unzip a file, only how to zip a file. How do I unzip all the contents of a zip file into the same directory?

    Answer #1

    Recommendation for beginners:

    This is my personal recommendation for beginners: start by learning virtualenv and pip, tools which work with both Python 2 and 3 and in a variety of situations, and pick up other tools once you start needing them.

    PyPI packages not in the standard library:

    • virtualenv is a very popular tool that creates isolated Python environments for Python libraries. If you"re not familiar with this tool, I highly recommend learning it, as it is a very useful tool, and I"ll be making comparisons to it for the rest of this answer.

    It works by installing a bunch of files in a directory (eg: env/), and then modifying the PATH environment variable to prefix it with a custom bin directory (eg: env/bin/). An exact copy of the python or python3 binary is placed in this directory, but Python is programmed to look for libraries relative to its path first, in the environment directory. It"s not part of Python"s standard library, but is officially blessed by the PyPA (Python Packaging Authority). Once activated, you can install packages in the virtual environment using pip.

    • pyenv is used to isolate Python versions. For example, you may want to test your code against Python 2.7, 3.6, 3.7 and 3.8, so you"ll need a way to switch between them. Once activated, it prefixes the PATH environment variable with ~/.pyenv/shims, where there are special files matching the Python commands (python, pip). These are not copies of the Python-shipped commands; they are special scripts that decide on the fly which version of Python to run based on the PYENV_VERSION environment variable, or the .python-version file, or the ~/.pyenv/version file. pyenv also makes the process of downloading and installing multiple Python versions easier, using the command pyenv install.

    • pyenv-virtualenv is a plugin for pyenv by the same author as pyenv, to allow you to use pyenv and virtualenv at the same time conveniently. However, if you"re using Python 3.3 or later, pyenv-virtualenv will try to run python -m venv if it is available, instead of virtualenv. You can use virtualenv and pyenv together without pyenv-virtualenv, if you don"t want the convenience features.

    • virtualenvwrapper is a set of extensions to virtualenv (see docs). It gives you commands like mkvirtualenv, lssitepackages, and especially workon for switching between different virtualenv directories. This tool is especially useful if you want multiple virtualenv directories.

    • pyenv-virtualenvwrapper is a plugin for pyenv by the same author as pyenv, to conveniently integrate virtualenvwrapper into pyenv.

    • pipenv aims to combine Pipfile, pip and virtualenv into one command on the command-line. The virtualenv directory typically gets placed in ~/.local/share/virtualenvs/XXX, with XXX being a hash of the path of the project directory. This is different from virtualenv, where the directory is typically in the current working directory. pipenv is meant to be used when developing Python applications (as opposed to libraries). There are alternatives to pipenv, such as poetry, which I won"t list here since this question is only about the packages that are similarly named.

    Standard library:

    • pyvenv (not to be confused with pyenv in the previous section) is a script shipped with Python 3 but deprecated in Python 3.6 as it had problems (not to mention the confusing name). In Python 3.6+, the exact equivalent is python3 -m venv.

    • venv is a package shipped with Python 3, which you can run using python3 -m venv (although for some reason some distros separate it out into a separate distro package, such as python3-venv on Ubuntu/Debian). It serves the same purpose as virtualenv, but only has a subset of its features (see a comparison here). virtualenv continues to be more popular than venv, especially since the former supports both Python 2 and 3.

    Answer #2

    os.listdir() - list in the current directory

    With listdir in os module you get the files and the folders in the current dir

     import os
     arr = os.listdir()
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    Looking in a directory

    arr = os.listdir("c:\files")
    

    glob from glob

    with glob you can specify a type of file to list like this

    import glob
    
    txtfiles = []
    for file in glob.glob("*.txt"):
        txtfiles.append(file)
    

    glob in a list comprehension

    mylist = [f for f in glob.glob("*.txt")]
    

    get the full path of only files in the current directory

    import os
    from os import listdir
    from os.path import isfile, join
    
    cwd = os.getcwd()
    onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
    os.path.isfile(os.path.join(cwd, f))]
    print(onlyfiles) 
    
    ["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]
    

    Getting the full path name with os.path.abspath

    You get the full path in return

     import os
     files_path = [os.path.abspath(x) for x in os.listdir()]
     print(files_path)
     
     ["F:\documentiapplications.txt", "F:\documenticollections.txt"]
    

    Walk: going through sub directories

    os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

    import os
    
    # Getting the current work directory (cwd)
    thisdir = os.getcwd()
    
    # r=root, d=directories, f = files
    for r, d, f in os.walk(thisdir):
        for file in f:
            if file.endswith(".docx"):
                print(os.path.join(r, file))
    

    os.listdir(): get files in the current directory (Python 2)

    In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

     import os
     arr = os.listdir(".")
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    To go up in the directory tree

    # Method 1
    x = os.listdir("..")
    
    # Method 2
    x= os.listdir("/")
    

    Get files: os.listdir() in a particular directory (Python 2 and 3)

     import os
     arr = os.listdir("F:\python")
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    Get files of a particular subdirectory with os.listdir()

    import os
    
    x = os.listdir("./content")
    

    os.walk(".") - current directory

     import os
     arr = next(os.walk("."))[2]
     print(arr)
     
     >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]
    

    next(os.walk(".")) and os.path.join("dir", "file")

     import os
     arr = []
     for d,r,f in next(os.walk("F:\_python")):
         for file in f:
             arr.append(os.path.join(r,file))
    
     for f in arr:
         print(files)
    
    >>> F:\_python\dict_class.py
    >>> F:\_python\programmi.txt
    

    next(os.walk("F:\") - get the full path - list comprehension

     [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
     
     >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]
    

    os.walk - get full path - all files in sub dirs**

    x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
    print(x)
    
    >>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"]
    

    os.listdir() - get only txt files

     arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
     print(arr_txt)
     
     >>> ["work.txt", "3ebooks.txt"]
    

    Using glob to get the full path of the files

    If I should need the absolute path of the files:

    from path import path
    from glob import glob
    x = [path(f).abspath() for f in glob("F:\*.txt")]
    for f in x:
        print(f)
    
    >>> F:acquistionline.txt
    >>> F:acquisti_2018.txt
    >>> F:ootstrap_jquery_ecc.txt
    

    Using os.path.isfile to avoid directories in the list

    import os.path
    listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
    print(listOfFiles)
    
    >>> ["a simple game.py", "data.txt", "decorator.py"]
    

    Using pathlib from Python 3.4

    import pathlib
    
    flist = []
    for p in pathlib.Path(".").iterdir():
        if p.is_file():
            print(p)
            flist.append(p)
    
     >>> error.PNG
     >>> exemaker.bat
     >>> guiprova.mp3
     >>> setup.py
     >>> speak_gui2.py
     >>> thumb.PNG
    

    With list comprehension:

    flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]
    

    Alternatively, use pathlib.Path() instead of pathlib.Path(".")

    Use glob method in pathlib.Path()

    import pathlib
    
    py = pathlib.Path().glob("*.py")
    for file in py:
        print(file)
    
    >>> stack_overflow_list.py
    >>> stack_overflow_list_tkinter.py
    

    Get all and only files with os.walk

    import os
    x = [i[2] for i in os.walk(".")]
    y=[]
    for t in x:
        for f in t:
            y.append(f)
    print(y)
    
    >>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]
    

    Get only files with next and walk in a directory

     import os
     x = next(os.walk("F://python"))[2]
     print(x)
     
     >>> ["calculator.bat","calculator.py"]
    

    Get only directories with next and walk in a directory

     import os
     next(os.walk("F://python"))[1] # for the current dir use (".")
     
     >>> ["python3","others"]
    

    Get all the subdir names with walk

    for r,d,f in os.walk("F:\_python"):
        for dirs in d:
            print(dirs)
    
    >>> .vscode
    >>> pyexcel
    >>> pyschool.py
    >>> subtitles
    >>> _metaprogramming
    >>> .ipynb_checkpoints
    

    os.scandir() from Python 3.5 and greater

    import os
    x = [f.name for f in os.scandir() if f.is_file()]
    print(x)
    
    >>> ["calculator.bat","calculator.py"]
    
    # Another example with scandir (a little variation from docs.python.org)
    # This one is more efficient than os.listdir.
    # In this case, it shows the files only in the current directory
    # where the script is executed.
    
    import os
    with os.scandir() as i:
        for entry in i:
            if entry.is_file():
                print(entry.name)
    
    >>> ebookmaker.py
    >>> error.PNG
    >>> exemaker.bat
    >>> guiprova.mp3
    >>> setup.py
    >>> speakgui4.py
    >>> speak_gui2.py
    >>> speak_gui3.py
    >>> thumb.PNG
    

    Examples:

    Ex. 1: How many files are there in the subdirectories?

    In this example, we look for the number of files that are included in all the directory and its subdirectories.

    import os
    
    def count(dir, counter=0):
        "returns number of files in dir and subdirs"
        for pack in os.walk(dir):
            for f in pack[2]:
                counter += 1
        return dir + " : " + str(counter) + "files"
    
    print(count("F:\python"))
    
    >>> "F:\python" : 12057 files"
    

    Ex.2: How to copy all files from a directory to another?

    A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

    import os
    import shutil
    from path import path
    
    destination = "F:\file_copied"
    # os.makedirs(destination)
    
    def copyfile(dir, filetype="pptx", counter=0):
        "Searches for pptx (or other - pptx is the default) files and copies them"
        for pack in os.walk(dir):
            for f in pack[2]:
                if f.endswith(filetype):
                    fullpath = pack[0] + "\" + f
                    print(fullpath)
                    shutil.copy(fullpath, destination)
                    counter += 1
        if counter > 0:
            print("-" * 30)
            print("	==> Found in: `" + dir + "` : " + str(counter) + " files
    ")
    
    for dir in os.listdir():
        "searches for folders that starts with `_`"
        if dir[0] == "_":
            # copyfile(dir, filetype="pdf")
            copyfile(dir, filetype="txt")
    
    
    >>> _compiti18Compito Contabilità 1conti.txt
    >>> _compiti18Compito Contabilità 1modula4.txt
    >>> _compiti18Compito Contabilità 1moduloa4.txt
    >>> ------------------------
    >>> ==> Found in: `_compiti18` : 3 files
    

    Ex. 3: How to get all the files in a txt file

    In case you want to create a txt file with all the file names:

    import os
    mylist = ""
    with open("filelist.txt", "w", encoding="utf-8") as file:
        for eachfile in os.listdir():
            mylist += eachfile + "
    "
        file.write(mylist)
    

    Example: txt with all the files of an hard drive

    """
    We are going to save a txt file with all the files in your directory.
    We will use the function walk()
    """
    
    import os
    
    # see all the methods of os
    # print(*dir(os), sep=", ")
    listafile = []
    percorso = []
    with open("lista_file.txt", "w", encoding="utf-8") as testo:
        for root, dirs, files in os.walk("D:\"):
            for file in files:
                listafile.append(file)
                percorso.append(root + "\" + file)
                testo.write(file + "
    ")
    listafile.sort()
    print("N. of files", len(listafile))
    with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
        for file in listafile:
            testo_ordinato.write(file + "
    ")
    
    with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
        for file in percorso:
            file_percorso.write(file + "
    ")
    
    os.system("lista_file.txt")
    os.system("lista_file_ordinata.txt")
    os.system("percorso.txt")
    

    All the file of C: in one text file

    This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

    import os
    
    with open("file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk("C:\"):
            for file in f:
                filewrite.write(f"{r + file}
    ")
    

    How to write a file with all paths in a folder of a type

    With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

    import os
    
    def searchfiles(extension=".ttf", folder="H:\"):
        "Create a txt file with all the file of a type"
        with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
            for r, d, f in os.walk(folder):
                for file in f:
                    if file.endswith(extension):
                        filewrite.write(f"{r + file}
    ")
    
    # looking for png file (fonts) in the hard disk H:
    searchfiles(".png", "H:\")
    
    >>> H:4bs_18Dolphins5.png
    >>> H:4bs_18Dolphins6.png
    >>> H:4bs_18Dolphins7.png
    >>> H:5_18marketing htmlassetsimageslogo2.png
    >>> H:7z001.png
    >>> H:7z002.png
    

    (New) Find all files and open them with tkinter GUI

    I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

    import tkinter as tk
    import os
    
    def searchfiles(extension=".txt", folder="H:\"):
        "insert all files in the listbox"
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    lb.insert(0, r + "\" + file)
    
    def open_file():
        os.startfile(lb.get(lb.curselection()[0]))
    
    root = tk.Tk()
    root.geometry("400x400")
    bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
    bt.pack()
    lb = tk.Listbox(root)
    lb.pack(fill="both", expand=1)
    lb.bind("<Double-Button>", lambda x: open_file())
    root.mainloop()
    

    Answer #3

    -----> pip install gensim config --global http.sslVerify false

    Just install any package with the "config --global http.sslVerify false" statement

    You can ignore SSL errors by setting pypi.org and files.pythonhosted.org as trusted hosts.

    $ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org <package_name>
    

    Note: Sometime during April 2018, the Python Package Index was migrated from pypi.python.org to pypi.org. This means "trusted-host" commands using the old domain no longer work.

    Permanent Fix

    Since the release of pip 10.0, you should be able to fix this permanently just by upgrading pip itself:

    $ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org pip setuptools
    

    Or by just reinstalling it to get the latest version:

    $ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
    

    (… and then running get-pip.py with the relevant Python interpreter).

    pip install <otherpackage> should just work after this. If not, then you will need to do more, as explained below.


    You may want to add the trusted hosts and proxy to your config file.

    pip.ini (Windows) or pip.conf (unix)

    [global]
    trusted-host = pypi.python.org
                   pypi.org
                   files.pythonhosted.org
    

    Alternate Solutions (Less secure)

    Most of the answers could pose a security issue.

    Two of the workarounds that help in installing most of the python packages with ease would be:

    • Using easy_install: if you are really lazy and don"t want to waste much time, use easy_install <package_name>. Note that some packages won"t be found or will give small errors.
    • Using Wheel: download the Wheel of the python package and use the pip command pip install wheel_package_name.whl to install the package.

    Answer #4

    It helps to install a python package foo on your machine (can also be in virtualenv) so that you can import the package foo from other projects and also from [I]Python prompts.

    It does the similar job of pip, easy_install etc.,


    Using setup.py

    Let"s start with some definitions:

    Package - A folder/directory that contains __init__.py file.
    Module - A valid python file with .py extension.
    Distribution - How one package relates to other packages and modules.

    Let"s say you want to install a package named foo. Then you do,

    $ git clone https://github.com/user/foo  
    $ cd foo
    $ python setup.py install
    

    Instead, if you don"t want to actually install it but still would like to use it. Then do,

    $ python setup.py develop  
    

    This command will create symlinks to the source directory within site-packages instead of copying things. Because of this, it is quite fast (particularly for large packages).


    Creating setup.py

    If you have your package tree like,

    foo
    ├── foo
    │   ├── data_struct.py
    │   ├── __init__.py
    │   └── internals.py
    ├── README
    ├── requirements.txt
    └── setup.py
    

    Then, you do the following in your setup.py script so that it can be installed on some machine:

    from setuptools import setup
    
    setup(
       name="foo",
       version="1.0",
       description="A useful module",
       author="Man Foo",
       author_email="[email protected]",
       packages=["foo"],  #same as name
       install_requires=["wheel", "bar", "greek"], #external packages as dependencies
    )
    

    Instead, if your package tree is more complex like the one below:

    foo
    ├── foo
    │   ├── data_struct.py
    │   ├── __init__.py
    │   └── internals.py
    ├── README
    ├── requirements.txt
    ├── scripts
    │   ├── cool
    │   └── skype
    └── setup.py
    

    Then, your setup.py in this case would be like:

    from setuptools import setup
    
    setup(
       name="foo",
       version="1.0",
       description="A useful module",
       author="Man Foo",
       author_email="[email protected]",
       packages=["foo"],  #same as name
       install_requires=["wheel", "bar", "greek"], #external packages as dependencies
       scripts=[
                "scripts/cool",
                "scripts/skype",
               ]
    )
    

    Add more stuff to (setup.py) & make it decent:

    from setuptools import setup
    
    with open("README", "r") as f:
        long_description = f.read()
    
    setup(
       name="foo",
       version="1.0",
       description="A useful module",
       license="MIT",
       long_description=long_description,
       author="Man Foo",
       author_email="[email protected]",
       url="http://www.foopackage.com/",
       packages=["foo"],  #same as name
       install_requires=["wheel", "bar", "greek"], #external packages as dependencies
       scripts=[
                "scripts/cool",
                "scripts/skype",
               ]
    )
    

    The long_description is used in pypi.org as the README description of your package.


    And finally, you"re now ready to upload your package to PyPi.org so that others can install your package using pip install yourpackage.

    At this point there are two options.

    • publish in the temporary test.pypi.org server to make oneself familiarize with the procedure, and then publish it on the permanent pypi.org server for the public to use your package.
    • publish straight away on the permanent pypi.org server, if you are already familiar with the procedure and have your user credentials (e.g., username, password, package name)

    Once your package name is registered in pypi.org, nobody can claim or use it. Python packaging suggests the twine package for uploading purposes (of your package to PyPi). Thus,

    (1) the first step is to locally build the distributions using:

    # prereq: wheel (pip install wheel)  
    $ python setup.py sdist bdist_wheel   
    

    (2) then using twine for uploading either to test.pypi.org or pypi.org:

    $ twine upload --repository testpypi dist/*  
    username: ***  
    password: ***  
    

    It will take few minutes for the package to appear on test.pypi.org. Once you"re satisfied with it, you can then upload your package to the real & permanent index of pypi.org simply with:

    $ twine upload dist/*  
    

    Optionally, you can also sign the files in your package with a GPG by:

    $ twine upload dist/* --sign 
    

    Bonus Reading:

    Answer #5

    tl;dr / quick fix

    • Don"t decode/encode willy nilly
    • Don"t assume your strings are UTF-8 encoded
    • Try to convert strings to Unicode strings as soon as possible in your code
    • Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
    • Don"t be tempted to use quick reload hacks

    Unicode Zen in Python 2.x - The Long Version

    Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

    UnicodeDecodeError: "ascii" codec can"t decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

    In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

    The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

    Unicode strings can be declared in your code using the u prefix to strings. E.g.

    >>> my_u = u"my ünicôdé strįng"
    >>> type(my_u)
    <type "unicode">
    

    Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

    Gotchas

    Conversion from str to Unicode can happen even when you don"t explicitly call unicode().

    The following scenarios cause UnicodeDecodeError exceptions:

    # Explicit conversion without encoding
    unicode("€")
    
    # New style format string into Unicode string
    # Python will try to convert value string to Unicode first
    u"The currency is: {}".format("€")
    
    # Old style format string into Unicode string
    # Python will try to convert value string to Unicode first
    u"The currency is: %s" % "€"
    
    # Append string to Unicode
    # Python will try to convert string to Unicode first
    u"The currency is: " + "€"         
    

    Examples

    In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull: Diagram of a string being converted to a Python Unicode string

    In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can"t contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

    Diagram of a string being converted to a Python Unicode string with the wrong encoding

    The Unicode Sandwich

    It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

    Input / Decode

    Source code

    If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

    u"Zürich"
    

    To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

    # encoding: utf-8
    

    This is only necessary when you have non-ASCII in your source code.

    Files

    Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

    import io
    with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
         my_unicode_string = my_file.read() 
    

    my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you"ve probably used the wrong encoding value.

    CSV Files

    The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

    Use it like above but pass the opened file to it:

    from backports import csv
    import io
    with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
        for row in csv.reader(my_file):
            yield row
    

    Databases

    Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

    MySQL

    In the connection string add:

    charset="utf8",
    use_unicode=True
    

    E.g.

    >>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
    
    PostgreSQL

    Add:

    psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
    psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
    

    HTTP

    Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

    Manually

    If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you"ve probably got the wrong encoding.

    The meat of the sandwich

    Work with Unicodes as you would normal strs.

    Output

    stdout / printing

    print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

    An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

    Files

    Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.

    Database

    The same configuration for reading will allow Unicodes to be written directly.

    Python 3

    Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.

    The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

    Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

    Why you shouldn"t use sys.setdefaultencoding("utf8")

    It"s a nasty hack (there"s a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

    Answer #6

    You can"t.

    One workaround is to create clone a new environment and then remove the original one.

    First, remember to deactivate your current environment. You can do this with the commands:

    • deactivate on Windows or
    • source deactivate on macOS/Linux.

    Then:

    conda create --name new_name --clone old_name
    conda remove --name old_name --all # or its alias: `conda env remove --name old_name`
    

    Notice there are several drawbacks of this method:

    1. It redownloads packages (you can use --offline flag to disable it)
    2. Time consumed on copying environment"s files
    3. Temporary double disk usage

    There is an open issue requesting this feature.

    Answer #7

    Explanation

    From PEP 328

    Relative imports use a module"s __name__ attribute to determine that module"s position in the package hierarchy. If the module"s name does not contain any package information (e.g. it is set to "__main__") then relative imports are resolved as if the module were a top level module, regardless of where the module is actually located on the file system.

    At some point PEP 338 conflicted with PEP 328:

    ... relative imports rely on __name__ to determine the current module"s position in the package hierarchy. In a main module, the value of __name__ is always "__main__", so explicit relative imports will always fail (as they only work for a module inside a package)

    and to address the issue, PEP 366 introduced the top level variable __package__:

    By adding a new module level attribute, this PEP allows relative imports to work automatically if the module is executed using the -m switch. A small amount of boilerplate in the module itself will allow the relative imports to work when the file is executed by name. [...] When it [the attribute] is present, relative imports will be based on this attribute rather than the module __name__ attribute. [...] When the main module is specified by its filename, then the __package__ attribute will be set to None. [...] When the import system encounters an explicit relative import in a module without __package__ set (or with it set to None), it will calculate and store the correct value (__name__.rpartition(".")[0] for normal modules and __name__ for package initialisation modules)

    (emphasis mine)

    If the __name__ is "__main__", __name__.rpartition(".")[0] returns empty string. This is why there"s empty string literal in the error description:

    SystemError: Parent module "" not loaded, cannot perform relative import
    

    The relevant part of the CPython"s PyImport_ImportModuleLevelObject function:

    if (PyDict_GetItem(interp->modules, package) == NULL) {
        PyErr_Format(PyExc_SystemError,
                "Parent module %R not loaded, cannot perform relative "
                "import", package);
        goto error;
    }
    

    CPython raises this exception if it was unable to find package (the name of the package) in interp->modules (accessible as sys.modules). Since sys.modules is "a dictionary that maps module names to modules which have already been loaded", it"s now clear that the parent module must be explicitly absolute-imported before performing relative import.

    Note: The patch from the issue 18018 has added another if block, which will be executed before the code above:

    if (PyUnicode_CompareWithASCIIString(package, "") == 0) {
        PyErr_SetString(PyExc_ImportError,
                "attempted relative import with no known parent package");
        goto error;
    } /* else if (PyDict_GetItem(interp->modules, package) == NULL) {
        ...
    */
    

    If package (same as above) is empty string, the error message will be

    ImportError: attempted relative import with no known parent package
    

    However, you will only see this in Python 3.6 or newer.

    Solution #1: Run your script using -m

    Consider a directory (which is a Python package):

    .
    ├── package
    │   ├── __init__.py
    │   ├── module.py
    │   └── standalone.py
    

    All of the files in package begin with the same 2 lines of code:

    from pathlib import Path
    print("Running" if __name__ == "__main__" else "Importing", Path(__file__).resolve())
    

    I"m including these two lines only to make the order of operations obvious. We can ignore them completely, since they don"t affect the execution.

    __init__.py and module.py contain only those two lines (i.e., they are effectively empty).

    standalone.py additionally attempts to import module.py via relative import:

    from . import module  # explicit relative import
    

    We"re well aware that /path/to/python/interpreter package/standalone.py will fail. However, we can run the module with the -m command line option that will "search sys.path for the named module and execute its contents as the __main__ module":

    [email protected]:~$ python3 -i -m package.standalone
    Importing /home/vaultah/package/__init__.py
    Running /home/vaultah/package/standalone.py
    Importing /home/vaultah/package/module.py
    >>> __file__
    "/home/vaultah/package/standalone.py"
    >>> __package__
    "package"
    >>> # The __package__ has been correctly set and module.py has been imported.
    ... # What"s inside sys.modules?
    ... import sys
    >>> sys.modules["__main__"]
    <module "package.standalone" from "/home/vaultah/package/standalone.py">
    >>> sys.modules["package.module"]
    <module "package.module" from "/home/vaultah/package/module.py">
    >>> sys.modules["package"]
    <module "package" from "/home/vaultah/package/__init__.py">
    

    -m does all the importing stuff for you and automatically sets __package__, but you can do that yourself in the

    Solution #2: Set __package__ manually

    Please treat it as a proof of concept rather than an actual solution. It isn"t well-suited for use in real-world code.

    PEP 366 has a workaround to this problem, however, it"s incomplete, because setting __package__ alone is not enough. You"re going to need to import at least N preceding packages in the module hierarchy, where N is the number of parent directories (relative to the directory of the script) that will be searched for the module being imported.

    Thus,

    1. Add the parent directory of the Nth predecessor of the current module to sys.path

    2. Remove the current file"s directory from sys.path

    3. Import the parent module of the current module using its fully-qualified name

    4. Set __package__ to the fully-qualified name from 2

    5. Perform the relative import

    I"ll borrow files from the Solution #1 and add some more subpackages:

    package
    ├── __init__.py
    ├── module.py
    └── subpackage
        ├── __init__.py
        └── subsubpackage
            ├── __init__.py
            └── standalone.py
    

    This time standalone.py will import module.py from the package package using the following relative import

    from ... import module  # N = 3
    

    We"ll need to precede that line with the boilerplate code, to make it work.

    import sys
    from pathlib import Path
    
    if __name__ == "__main__" and __package__ is None:
        file = Path(__file__).resolve()
        parent, top = file.parent, file.parents[3]
    
        sys.path.append(str(top))
        try:
            sys.path.remove(str(parent))
        except ValueError: # Already removed
            pass
    
        import package.subpackage.subsubpackage
        __package__ = "package.subpackage.subsubpackage"
    
    from ... import module # N = 3
    

    It allows us to execute standalone.py by filename:

    [email protected]:~$ python3 package/subpackage/subsubpackage/standalone.py
    Running /home/vaultah/package/subpackage/subsubpackage/standalone.py
    Importing /home/vaultah/package/__init__.py
    Importing /home/vaultah/package/subpackage/__init__.py
    Importing /home/vaultah/package/subpackage/subsubpackage/__init__.py
    Importing /home/vaultah/package/module.py
    

    A more general solution wrapped in a function can be found here. Example usage:

    if __name__ == "__main__" and __package__ is None:
        import_parents(level=3) # N = 3
    
    from ... import module
    from ...module.submodule import thing
    

    Solution #3: Use absolute imports and setuptools

    The steps are -

    1. Replace explicit relative imports with equivalent absolute imports

    2. Install package to make it importable

    For instance, the directory structure may be as follows

    .
    ├── project
    │   ├── package
    │   │   ├── __init__.py
    │   │   ├── module.py
    │   │   └── standalone.py
    │   └── setup.py
    

    where setup.py is

    from setuptools import setup, find_packages
    setup(
        name = "your_package_name",
        packages = find_packages(),
    )
    

    The rest of the files were borrowed from the Solution #1.

    Installation will allow you to import the package regardless of your working directory (assuming there"ll be no naming issues).

    We can modify standalone.py to use this advantage (step 1):

    from package import module  # absolute import
    

    Change your working directory to project and run /path/to/python/interpreter setup.py install --user (--user installs the package in your site-packages directory) (step 2):

    [email protected]:~$ cd project
    [email protected]:~/project$ python3 setup.py install --user
    

    Let"s verify that it"s now possible to run standalone.py as a script:

    [email protected]:~/project$ python3 -i package/standalone.py
    Running /home/vaultah/project/package/standalone.py
    Importing /home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/__init__.py
    Importing /home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py
    >>> module
    <module "package.module" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py">
    >>> import sys
    >>> sys.modules["package"]
    <module "package" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/__init__.py">
    >>> sys.modules["package.module"]
    <module "package.module" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py">
    

    Note: If you decide to go down this route, you"d be better off using virtual environments to install packages in isolation.

    Solution #4: Use absolute imports and some boilerplate code

    Frankly, the installation is not necessary - you could add some boilerplate code to your script to make absolute imports work.

    I"m going to borrow files from Solution #1 and change standalone.py:

    1. Add the parent directory of package to sys.path before attempting to import anything from package using absolute imports:

      import sys
      from pathlib import Path # if you haven"t already done so
      file = Path(__file__).resolve()
      parent, root = file.parent, file.parents[1]
      sys.path.append(str(root))
      
      # Additionally remove the current file"s directory from sys.path
      try:
          sys.path.remove(str(parent))
      except ValueError: # Already removed
          pass
      
    2. Replace the relative import by the absolute import:

      from package import module  # absolute import
      

    standalone.py runs without problems:

    [email protected]:~$ python3 -i package/standalone.py
    Running /home/vaultah/package/standalone.py
    Importing /home/vaultah/package/__init__.py
    Importing /home/vaultah/package/module.py
    >>> module
    <module "package.module" from "/home/vaultah/package/module.py">
    >>> import sys
    >>> sys.modules["package"]
    <module "package" from "/home/vaultah/package/__init__.py">
    >>> sys.modules["package.module"]
    <module "package.module" from "/home/vaultah/package/module.py">
    

    I feel that I should warn you: try not to do this, especially if your project has a complex structure.


    As a side note, PEP 8 recommends the use of absolute imports, but states that in some scenarios explicit relative imports are acceptable:

    Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages). [...] However, explicit relative imports are an acceptable alternative to absolute imports, especially when dealing with complex package layouts where using absolute imports would be unnecessarily verbose.

    Answer #8

    Is this the correct use of conftest.py?

    Yes it is. Fixtures are a potential and common use of conftest.py. The fixtures that you will define will be shared among all tests in your test suite. However, defining fixtures in the root conftest.py might be useless and it would slow down testing if such fixtures are not used by all tests.

    Does it have other uses?

    Yes it does.

    • Fixtures: Define fixtures for static data used by tests. This data can be accessed by all tests in the suite unless specified otherwise. This could be data as well as helpers of modules which will be passed to all tests.

    • External plugin loading: conftest.py is used to import external plugins or modules. By defining the following global variable, pytest will load the module and make it available for its test. Plugins are generally files defined in your project or other modules which might be needed in your tests. You can also load a set of predefined plugins as explained here.

      pytest_plugins = "someapp.someplugin"

    • Hooks: You can specify hooks such as setup and teardown methods and much more to improve your tests. For a set of available hooks, read Hooks link. Example:

        def pytest_runtest_setup(item):
             """ called before ``pytest_runtest_call(item). """
             #do some stuff`
      
    • Test root path: This is a bit of a hidden feature. By defining conftest.py in your root path, you will have pytest recognizing your application modules without specifying PYTHONPATH. In the background, py.test modifies your sys.path by including all submodules which are found from the root path.

    Can I have more than one conftest.py file?

    Yes you can and it is strongly recommended if your test structure is somewhat complex. conftest.py files have directory scope. Therefore, creating targeted fixtures and helpers is good practice.

    When would I want to do that? Examples will be appreciated.

    Several cases could fit:

    Creating a set of tools or hooks for a particular group of tests.

    root/mod/conftest.py

    def pytest_runtest_setup(item):
        print("I am mod")
        #do some stuff
    
    
    test root/mod2/test.py will NOT produce "I am mod"
    

    Loading a set of fixtures for some tests but not for others.

    root/mod/conftest.py

    @pytest.fixture()
    def fixture():
        return "some stuff"
    

    root/mod2/conftest.py

    @pytest.fixture()
    def fixture():
        return "some other stuff"
    

    root/mod2/test.py

    def test(fixture):
        print(fixture)
    

    Will print "some other stuff".

    Overriding hooks inherited from the root conftest.py.

    root/mod/conftest.py

    def pytest_runtest_setup(item):
        print("I am mod")
        #do some stuff
    

    root/conftest.py

    def pytest_runtest_setup(item):
        print("I am root")
        #do some stuff
    

    By running any test inside root/mod, only "I am mod" is printed.

    You can read more about conftest.py here.

    EDIT:

    What if I need plain-old helper functions to be called from a number of tests in different modules - will they be available to me if I put them in a conftest.py? Or should I simply put them in a helpers.py module and import and use it in my test modules?

    You can use conftest.py to define your helpers. However, you should follow common practice. Helpers can be used as fixtures at least in pytest. For example in my tests I have a mock redis helper which I inject into my tests this way.

    root/helper/redis/redis.py

    @pytest.fixture
    def mock_redis():
        return MockRedis()
    

    root/tests/stuff/conftest.py

    pytest_plugin="helper.redis.redis"
    

    root/tests/stuff/test.py

    def test(mock_redis):
        print(mock_redis.get("stuff"))
    

    This will be a test module that you can freely import in your tests. NOTE that you could potentially name redis.py as conftest.py if your module redis contains more tests. However, that practice is discouraged because of ambiguity.

    If you want to use conftest.py, you can simply put that helper in your root conftest.py and inject it when needed.

    root/tests/conftest.py

    @pytest.fixture
    def mock_redis():
        return MockRedis()
    

    root/tests/stuff/test.py

    def test(mock_redis):
        print(mock_redis.get(stuff))
    

    Another thing you can do is to write an installable plugin. In that case your helper can be written anywhere but it needs to define an entry point to be installed in your and other potential test frameworks. See this.

    If you don"t want to use fixtures, you could of course define a simple helper and just use the plain old import wherever it is needed.

    root/tests/helper/redis.py

    class MockRedis():
        # stuff
    

    root/tests/stuff/test.py

    from helper.redis import MockRedis
    
    def test():
        print(MockRedis().get(stuff))
    

    However, here you might have problems with the path since the module is not in a child folder of the test. You should be able to overcome this (not tested) by adding an __init__.py to your helper

    root/tests/helper/init.py

    from .redis import MockRedis
    

    Or simply adding the helper module to your PYTHONPATH.

    Answer #9

    I would suggest reading PEP 483 and PEP 484 and watching this presentation by Guido on type hinting.

    In a nutshell: Type hinting is literally what the words mean. You hint the type of the object(s) you"re using.

    Due to the dynamic nature of Python, inferring or checking the type of an object being used is especially hard. This fact makes it hard for developers to understand what exactly is going on in code they haven"t written and, most importantly, for type checking tools found in many IDEs (PyCharm and PyDev come to mind) that are limited due to the fact that they don"t have any indicator of what type the objects are. As a result they resort to trying to infer the type with (as mentioned in the presentation) around 50% success rate.


    To take two important slides from the type hinting presentation:

    Why type hints?

    1. Helps type checkers: By hinting at what type you want the object to be the type checker can easily detect if, for instance, you"re passing an object with a type that isn"t expected.
    2. Helps with documentation: A third person viewing your code will know what is expected where, ergo, how to use it without getting them TypeErrors.
    3. Helps IDEs develop more accurate and robust tools: Development Environments will be better suited at suggesting appropriate methods when know what type your object is. You have probably experienced this with some IDE at some point, hitting the . and having methods/attributes pop up which aren"t defined for an object.

    Why use static type checkers?

    • Find bugs sooner: This is self-evident, I believe.
    • The larger your project the more you need it: Again, makes sense. Static languages offer a robustness and control that dynamic languages lack. The bigger and more complex your application becomes the more control and predictability (from a behavioral aspect) you require.
    • Large teams are already running static analysis: I"m guessing this verifies the first two points.

    As a closing note for this small introduction: This is an optional feature and, from what I understand, it has been introduced in order to reap some of the benefits of static typing.

    You generally do not need to worry about it and definitely don"t need to use it (especially in cases where you use Python as an auxiliary scripting language). It should be helpful when developing large projects as it offers much needed robustness, control and additional debugging capabilities.


    Type hinting with mypy:

    In order to make this answer more complete, I think a little demonstration would be suitable. I"ll be using mypy, the library which inspired Type Hints as they are presented in the PEP. This is mainly written for anybody bumping into this question and wondering where to begin.

    Before I do that let me reiterate the following: PEP 484 doesn"t enforce anything; it is simply setting a direction for function annotations and proposing guidelines for how type checking can/should be performed. You can annotate your functions and hint as many things as you want; your scripts will still run regardless of the presence of annotations because Python itself doesn"t use them.

    Anyways, as noted in the PEP, hinting types should generally take three forms:

    • Function annotations (PEP 3107).
    • Stub files for built-in/user modules.
    • Special # type: type comments that complement the first two forms. (See: What are variable annotations? for a Python 3.6 update for # type: type comments)

    Additionally, you"ll want to use type hints in conjunction with the new typing module introduced in Py3.5. In it, many (additional) ABCs (abstract base classes) are defined along with helper functions and decorators for use in static checking. Most ABCs in collections.abc are included, but in a generic form in order to allow subscription (by defining a __getitem__() method).

    For anyone interested in a more in-depth explanation of these, the mypy documentation is written very nicely and has a lot of code samples demonstrating/describing the functionality of their checker; it is definitely worth a read.

    Function annotations and special comments:

    First, it"s interesting to observe some of the behavior we can get when using special comments. Special # type: type comments can be added during variable assignments to indicate the type of an object if one cannot be directly inferred. Simple assignments are generally easily inferred but others, like lists (with regard to their contents), cannot.

    Note: If we want to use any derivative of containers and need to specify the contents for that container we must use the generic types from the typing module. These support indexing.

    # Generic List, supports indexing.
    from typing import List
    
    # In this case, the type is easily inferred as type: int.
    i = 0
    
    # Even though the type can be inferred as of type list
    # there is no way to know the contents of this list.
    # By using type: List[str] we indicate we want to use a list of strings.
    a = []  # type: List[str]
    
    # Appending an int to our list
    # is statically not correct.
    a.append(i)
    
    # Appending a string is fine.
    a.append("i")
    
    print(a)  # [0, "i"]
    

    If we add these commands to a file and execute them with our interpreter, everything works just fine and print(a) just prints the contents of list a. The # type comments have been discarded, treated as plain comments which have no additional semantic meaning.

    By running this with mypy, on the other hand, we get the following response:

    (Python3)[email protected]: mypy typeHintsCode.py
    typesInline.py:14: error: Argument 1 to "append" of "list" has incompatible type "int"; expected "str"
    

    Indicating that a list of str objects cannot contain an int, which, statically speaking, is sound. This can be fixed by either abiding to the type of a and only appending str objects or by changing the type of the contents of a to indicate that any value is acceptable (Intuitively performed with List[Any] after Any has been imported from typing).

    Function annotations are added in the form param_name : type after each parameter in your function signature and a return type is specified using the -> type notation before the ending function colon; all annotations are stored in the __annotations__ attribute for that function in a handy dictionary form. Using a trivial example (which doesn"t require extra types from the typing module):

    def annotated(x: int, y: str) -> bool:
        return x < y
    

    The annotated.__annotations__ attribute now has the following values:

    {"y": <class "str">, "return": <class "bool">, "x": <class "int">}
    

    If we"re a complete newbie, or we are familiar with Python 2.7 concepts and are consequently unaware of the TypeError lurking in the comparison of annotated, we can perform another static check, catch the error and save us some trouble:

    (Python3)[email protected]: mypy typeHintsCode.py
    typeFunction.py: note: In function "annotated":
    typeFunction.py:2: error: Unsupported operand types for > ("str" and "int")
    

    Among other things, calling the function with invalid arguments will also get caught:

    annotated(20, 20)
    
    # mypy complains:
    typeHintsCode.py:4: error: Argument 2 to "annotated" has incompatible type "int"; expected "str"
    

    These can be extended to basically any use case and the errors caught extend further than basic calls and operations. The types you can check for are really flexible and I have merely given a small sneak peak of its potential. A look in the typing module, the PEPs or the mypy documentation will give you a more comprehensive idea of the capabilities offered.

    Stub files:

    Stub files can be used in two different non mutually exclusive cases:

    • You need to type check a module for which you do not want to directly alter the function signatures
    • You want to write modules and have type-checking but additionally want to separate annotations from content.

    What stub files (with an extension of .pyi) are is an annotated interface of the module you are making/want to use. They contain the signatures of the functions you want to type-check with the body of the functions discarded. To get a feel of this, given a set of three random functions in a module named randfunc.py:

    def message(s):
        print(s)
    
    def alterContents(myIterable):
        return [i for i in myIterable if i % 2 == 0]
    
    def combine(messageFunc, itFunc):
        messageFunc("Printing the Iterable")
        a = alterContents(range(1, 20))
        return set(a)
    

    We can create a stub file randfunc.pyi, in which we can place some restrictions if we wish to do so. The downside is that somebody viewing the source without the stub won"t really get that annotation assistance when trying to understand what is supposed to be passed where.

    Anyway, the structure of a stub file is pretty simplistic: Add all function definitions with empty bodies (pass filled) and supply the annotations based on your requirements. Here, let"s assume we only want to work with int types for our Containers.

    # Stub for randfucn.py
    from typing import Iterable, List, Set, Callable
    
    def message(s: str) -> None: pass
    
    def alterContents(myIterable: Iterable[int])-> List[int]: pass
    
    def combine(
        messageFunc: Callable[[str], Any],
        itFunc: Callable[[Iterable[int]], List[int]]
    )-> Set[int]: pass
    

    The combine function gives an indication of why you might want to use annotations in a different file, they some times clutter up the code and reduce readability (big no-no for Python). You could of course use type aliases but that sometime confuses more than it helps (so use them wisely).


    This should get you familiarized with the basic concepts of type hints in Python. Even though the type checker used has been mypy you should gradually start to see more of them pop-up, some internally in IDEs (PyCharm,) and others as standard Python modules.

    I"ll try and add additional checkers/related packages in the following list when and if I find them (or if suggested).

    Checkers I know of:

    • Mypy: as described here.
    • PyType: By Google, uses different notation from what I gather, probably worth a look.

    Related Packages/Projects:

    • typeshed: Official Python repository housing an assortment of stub files for the standard library.

    The typeshed project is actually one of the best places you can look to see how type hinting might be used in a project of your own. Let"s take as an example the __init__ dunders of the Counter class in the corresponding .pyi file:

    class Counter(Dict[_T, int], Generic[_T]):
            @overload
            def __init__(self) -> None: ...
            @overload
            def __init__(self, Mapping: Mapping[_T, int]) -> None: ...
            @overload
            def __init__(self, iterable: Iterable[_T]) -> None: ...
    

    Where _T = TypeVar("_T") is used to define generic classes. For the Counter class we can see that it can either take no arguments in its initializer, get a single Mapping from any type to an int or take an Iterable of any type.


    Notice: One thing I forgot to mention was that the typing module has been introduced on a provisional basis. From PEP 411:

    A provisional package may have its API modified prior to "graduating" into a "stable" state. On one hand, this state provides the package with the benefits of being formally part of the Python distribution. On the other hand, the core development team explicitly states that no promises are made with regards to the the stability of the package"s API, which may change for the next release. While it is considered an unlikely outcome, such packages may even be removed from the standard library without a deprecation period if the concerns regarding their API or maintenance prove well-founded.

    So take things here with a pinch of salt; I"m doubtful it will be removed or altered in significant ways, but one can never know.


    ** Another topic altogether, but valid in the scope of type-hints: PEP 526: Syntax for Variable Annotations is an effort to replace # type comments by introducing new syntax which allows users to annotate the type of variables in simple varname: type statements.

    See What are variable annotations?, as previously mentioned, for a small introduction to these.

    Answer #10

    Whatever is assigned to the files variable is incorrect. Use the following code.

    import glob
    import os
    
    list_of_files = glob.glob("/path/to/folder/*") # * means all if need specific format then *.csv
    latest_file = max(list_of_files, key=os.path.getctime)
    print(latest_file)
    

    Tutorials