Change language

Python | Speech recognition on large audio files

| |

Libraries required

Pydub: sudo pip3 install pydub

Speech recognition: sudo pip3 install SpeechRecognition


Input:  peacock.wav 


exporting chunk0.wav
Processing chunk 0
exporting chunk1.wav
Processing chunk 1
exporting chunk2.wav
Processing chunk 2
exporting chunk3.wav
Processing chunk 3
exporting chunk4.wav
Processing chunk 4
exporting chunk5.wav
Processing chunk 5
exporting chunk6.wav
Processing chunk 6

Python Code:

# importing libraries
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence
# a function that splits the audio file into chunks
# and applies speech recognition
def silence_based_conversion(path = "alice-medium.wav"):
    # open the audio file stored in
    # the local system as a wav file.
    song = AudioSegment.from_wav(path)
    # open a file where we will concatenate  
    # and store the recognized text
    fh = open("recognized.txt", "w+")
    # split track where silence is 0.5 seconds 
    # or more and get chunks
    chunks = split_on_silence(song,
        # must be silent for at least 0.5 seconds
        # or 500 ms. adjust this value based on user
        # requirement. if the speaker stays silent for 
        # longer, increase this value. else, decrease it.
        min_silence_len = 500,
        # consider it silent if quieter than -16 dBFS
        # adjust this per requirement
        silence_thresh = -16
    # create a directory to store the audio chunks.
    # move into the directory to
    # store the audio files.
    i = 0
    # process each chunk
    for chunk in chunks:
        # Create 0.5 seconds silence chunk
        chunk_silent = AudioSegment.silent(duration = 10)
        # add 0.5 sec silence to beginning and 
        # end of audio chunk. This is done so that
        # it doesn’t seem abruptly sliced.
        audio_chunk = chunk_silent + chunk + chunk_silent
        # export audio chunk and save it in 
        # the current directory.
        print("saving chunk{0}.wav".format(i))
        # specify the bitrate to be 192 k
        audio_chunk.export("./chunk{0}.wav".format(i), bitrate =’192k’, format ="wav")
        # the name of the newly created chunk
        filename = ’chunk’+str(i)+’.wav’
        print("Processing chunk "+str(i))
        # get the name of the newly created chunk
        # in the AUDIO_FILE variable for later use.
        file = filename
        # create a speech recognition object
        r = sr.Recognizer()
        # recognize the chunk
        with sr.AudioFile(file) as source:
            # remove this if it is not working
            # correctly.
            audio_listened = r.listen(source)
            # try converting it to text
            rec = r.recognize_google(audio_listened)
            # write the output to the file.
            fh.write(rec+". ")
        # catch any errors.
        except sr.UnknownValueError:
            print("Could not understand audio")
        except sr.RequestError as e:
            print("Could not request results. check your internet connection")
        i += 1
if __name__ == ’__main__’:
    print(’Enter the audio file path’)
    path = input()



The peacock is the national bird of India. They have colourful feathers, two legs and 
a small beak. They are famous for their dance. When a peacock dances it spreads its 
feathers like a fan. It has a long shiny dark blue neck. Peacocks are mostly found in 
the fields they are very beautiful birds. The females are known as ’Peahen1. Their 
feathers are used for making jackets, purses etc. We can see them in a zoo. 

How to convert large WAV file to text in Python?

Question from StackOverFlow

I already tried this code to convert my large wav file to text

import speech_recognition as sr
r = sr.Recognizer()

with hellow as source:
    audio = r.record(source)
    s = r.recognize_google(audio)
    print("Text: "+s)
except Exception as e:
    print("Exception: "+str(e))

But it is not converting it accurately, the reason I feel it’s the ’US’ accent. Please tell me how i can convert whole large wav file accurately.


Google’s speech to text is very effective, try the below link,

You can choose the language (English US in your case) and also upload files.

Like @bigdataolddriver commented 100% accuracy is not possible yet, and will be worth millions.

Google speech to text has three types of APIs

Synchronous, Asynchronous and streaming, in which asynchronous allows you to ~480 minutes audio conversion while others will only let you ~1 minute. Following is the sample code to do the conversion.

filepath = "~/audio_wav/"     #Input audio file path
output_filepath = "~/Transcripts/" #Final transcript path
bucketname = "callsaudiofiles" #Name of the bucket created in the step before

# Import libraries
from pydub import AudioSegment
import io
import os
from import speech
from import enums
from import types
import wave
from import storage

Speech to text support wav files with LINEAR16 or MULAW encoded audio.

Below is the code to get the frame rate and channel with code.

def frame_rate_channel(audio_file_name):
    with, "rb") as wave_file:
        frame_rate = wave_file.getframerate()
        channels = wave_file.getnchannels()
        return frame_rate,channels

and the code below is the does the asynchronous conversion.

def google_transcribe(audio_file_name):

    file_name = filepath + audio_file_name

    # The name of the audio file to transcribe

    frame_rate, channels = frame_rate_channel(file_name)

    if channels > 1:

    bucket_name = bucketname
    source_file_name = filepath + audio_file_name
    destination_blob_name = audio_file_name

    upload_blob(bucket_name, source_file_name, destination_blob_name)

    gcs_uri = ’gs://’ + bucketname + ’/’ + audio_file_name
    transcript = ’’

    client = speech.SpeechClient()
    audio = types.RecognitionAudio(uri=gcs_uri)

    config = types.RecognitionConfig(

    # Detects speech in the audio file
    operation = client.long_running_recognize(config, audio)
    response = operation.result(timeout=10000)

    for result in response.results:
        transcript += result.alternatives[0].transcript

    delete_blob(bucket_name, destination_blob_name)
    return transcript

and this is how you write them to file

def write_transcripts(transcript_filename,transcript):
    f= open(output_filepath + transcript_filename,"w+")

Kindly let me know if you need any further clarifications.

Speech recognition
— it is the process of converting sound to text. This is commonly used in voice assistants like Alexa, Siri, etc. Python provides an API called SpeechRecognition that allows us to convert audio to text for further processing. In this article, we will look at converting large or long audio files to text using the SpeechRecognition API in python.

Processing large audio files

When the input file is a long audio file, speech recognition accuracy decreases. Moreover, the Google Speech Recognition API cannot recognize long audio files with good fidelity. Therefore, we need to process the audio file into smaller chunks and then pass those chunks to the API. This improves accuracy and allows large audio files to be recognized.

Splitting audio based on silence

One way to handle an audio file — is to break it down into chunks of constant size. For example, we can take an audio file 10 minutes long and split it into 60 pieces of 10 seconds each. We can then pass these snippets to the API and convert speech to text by concatenating the results of all these snippets. This method is imprecise. Dividing an audio file into constant-sized chunks can interrupt sentences in between, and we can lose some important words in the process. This is because the audio file may end before the word is fully spoken, and Google will not be able to recognize incomplete words.

Another way — split audio file by silence. People stop for a short time between sentences. If we can split the audio file into chunks based on this silence, then we can process the file’s sentence by sentence and combine them to get the result. This approach is more accurate than the previous one because we do not split sentences between them and the audio block will contain the entire sentence without any interruptions. This way we don’t need to split it into chunks of constant length.

The disadvantage of this method is that it is difficult to determine the duration of the silence to separate, because different users speak differently and some users may do pause for 1 second between sentences, while some may pause for as little as 0.5 seconds.

How to Convert Speech to Text in Python

,p> Speech recognition is the ability of computer software to recognize words and sentences in spoken language and convert them into human-readable text. This tutorial will show you how to convert speech to text in Python using the SpeechRecognition library.

As a result, we don’t have to build a machine learning model from scratch. This library provides us with handy wrappers for various popular public speech recognition APIs (like Google Cloud Speech API, IBM Speech To Text, etc.).

Okay, let’s get started, installing the library using pip:

pip3 install SpeechRecognition pydub

Okey, open up a new Python file and import it:

import speech_recognition as sr

The nice thing about this library is it supports several recognition engines:

  • CMU Sphinx (offline)
  • Google Speech Recognition
  • Google Cloud Speech API
  • Microsoft Bing Voice Recognition
  • Houndify API
  • IBM Speech To Text
  • Snowboy Hotword Detection (offline)
  • We are going to use Google Speech Recognition here, as it’s straightforward and doesn’t require any API key.

    Reading from a File

    Make sure you have an audio file in the current directory that contains english speech (if you want to follow along with me, get the audio file here):

    filename = "16-122828-0002.wav"

    This file was taken from LibriSpeech dataset, but you can use any audio WAV file you want, just change the name of the file, let’s initialize our speech recognizer:

    # initialize the recognizer
    r = sr.Recognizer()

    The below code is responsible for loading the audio file, and converting the speech into text using Google Speech Recognition:

    # open the file
    with sr.AudioFile(filename) as source:
        # listen for the data (load audio to memory)
        audio_data = r.record(source)
        # recognize (convert from speech to text)
        text = r.recognize_google(audio_data)

    This will take few seconds to finish, as it uploads the file to Google and grabs the output, here is my result:

    I believe you’re just talking nonsense

    The above code works well for small or medium size audio files. In the next section, we gonna write code for large files.

    Reading Large Audio Files

    If you want to perform speech recognition of a long audio file, then the below function handles that quite well:

    # importing libraries 
    import speech_recognition as sr 
    import os 
    from pydub import AudioSegment
    from pydub.silence import split_on_silence
    # create a speech recognition object
    r = sr.Recognizer()
    # a function that splits the audio file into chunks
    # and applies speech recognition
    def get_large_audio_transcription(path):
        Splitting the large audio file into chunks
        and apply speech recognition on each of these chunks
        # open the audio file using pydub
        sound = AudioSegment.from_wav(path)  
        # split audio sound where silence is 700 miliseconds or more and get chunks
        chunks = split_on_silence(sound,
            # experiment with this value for your target audio file
            min_silence_len = 500,
            # adjust this per requirement
            silence_thresh = sound.dBFS-14,
            # keep the silence for 1 second, adjustable as well
        folder_name = "audio-chunks"
        # create a directory to store the audio chunks
        if not os.path.isdir(folder_name):
        whole_text = ""
        # process each chunk 
        for i, audio_chunk in enumerate(chunks, start=1):
            # export audio chunk and save it in
            # the ’folder_name’ directory.
            chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
            audio_chunk.export(chunk_filename, format="wav")
            # recognize the chunk
            with sr.AudioFile(chunk_filename) as source:
                audio_listened = r.record(source)
                # try converting it to text
                    text = r.recognize_google(audio_listened)
                except sr.UnknownValueError as e:
                    print("Error:", str(e))
                    text = f"{text.capitalize()}. "
                    print(chunk_filename, ":", text)
                    whole_text += text
        # return the text for all chunks detected
        return whole_text

    Note: You need to install Pydub using pip for the above code to work.

    The above function uses split_on_silence() function from pydub.silence module to split audio data into chunks on silence. min_silence_len parameter is the minimum length of a silence to be used for a split.

    silence_thresh is the threshold in which anything quieter than this will be considered silence, I have set it to the average dBFS minus 14, keep_silence argument is the amount of silence to leave at the beginning and the end of each chunk detected in milliseconds.

    These parameters won’t be perfect for all sound files, try to experiment with these parameters with your large audio needs.

    After that, we iterate over all chunks and convert each speech audio into text and adding them up all together, here is an example run:

    path = "7601-291468-0006.wav"
    Full text:", get_large_audio_transcription(path))

    Note: You can get 7601-291468-0006.wav file here.


    audio-chunkschunk1.wav : His abode which you had fixed in a bowery or country seat. 
    audio-chunkschunk2.wav : At a short distance from the city. 
    audio-chunkschunk3.wav : Just at what is now called dutch street. 
    audio-chunkschunk4.wav : Sooner bounded with proofs of his ingenuity. 
    audio-chunkschunk5.wav : Patent smokejacks. 
    audio-chunkschunk6.wav : It required a horse to work some. 
    audio-chunkschunk7.wav : Dutch oven roasted meat without fire. 
    audio-chunkschunk8.wav : Carts that went before the horses. 
    audio-chunkschunk9.wav : Weather cox that turned against the wind and other wrongheaded contrivances. 
    audio-chunkschunk10.wav : So just understand can found it all beholders. 
    Full text: His abode which you had fixed in a bowery or country seat. At a short distance from the city. Just at what is now called dutch street. Sooner bounded with proofs of his ingenuity. Patent smokejacks. It required a horse to work some. Dutch oven roasted meat without fire. Carts that went before the horses. Weather cox that turned against the wind and other wrongheaded contrivances. So just understand can found it all beholders.
    So, this function automatically creates a folder for us and puts the chunks of the original audio file we specified, and then it runs speech recognition on all of them.

    Reading from the Microphone

    This requires PyAudio to be installed in your machine, here is the installation process depending on your operating system:


    You can just pip install it:

    pip3 install pyaudio


    You need to first install the dependencies:

    sudo apt-get install python-pyaudio python3-pyaudio
    pip3 install pyaudio


    You need to first install portaudio, then you can just pip install it:

    brew install portaudio
    pip3 install pyaudio

    Now let’s use our microphone to convert our speech:

    with sr.Microphone() as source:
        # read the audio data from the default microphone
        audio_data = r.record(source, duration=5)
        # convert speech to text
        text = r.recognize_google(audio_data)

    This will hear from your microphone for 5 seconds and then tries to convert that speech into text !

    It is pretty similar to the previous code, but we are using Microphone() object here to read the audio from the default microphone, and then we used duration parameter in record() function to stop reading after 5 seconds and then uploads the audio data to Google to get the output text.

    You can also use offset parameter in record() function to start recording after offset seconds.

    Also, you can recognize different languages by passing language parameter to recognize_google() function. For instance, if you want to recognize spanish speech, you would use:

    text = r.recognize_google(audio_data, language="es-ES")