Python | Speech recognition on large audio files

Michael Zippo 18.07.2021

Libraries required

Pydub: sudo pip3 install pydub
Speech recognition: sudo pip3 install SpeechRecognition

Example:

Input:  peacock.wav

Output:

exporting chunk0.wav
Processing chunk 0
exporting chunk1.wav
Processing chunk 1
exporting chunk2.wav
Processing chunk 2
exporting chunk3.wav
Processing chunk 3
exporting chunk4.wav
Processing chunk 4
exporting chunk5.wav
Processing chunk 5
exporting chunk6.wav
Processing chunk 6

Python Code:

# importing libraries
import speech_recognition as sr
  
import os
  
from pydub import AudioSegment
from pydub.silence import split_on_silence
  
# a function that splits the audio file into chunks
# and applies speech recognition
def silence_based_conversion(path = "alice-medium.wav"):
  
    # open the audio file stored in
    # the local system as a wav file.
    song = AudioSegment.from_wav(path)
  
    # open a file where we will concatenate  
    # and store the recognized text
    fh = open("recognized.txt", "w+")
          
    # split track where silence is 0.5 seconds 
    # or more and get chunks
    chunks = split_on_silence(song,
        # must be silent for at least 0.5 seconds
        # or 500 ms. adjust this value based on user
        # requirement. if the speaker stays silent for 
        # longer, increase this value. else, decrease it.
        min_silence_len = 500,
  
        # consider it silent if quieter than -16 dBFS
        # adjust this per requirement
        silence_thresh = -16
    )
  
    # create a directory to store the audio chunks.
    try:
        os.mkdir(’audio_chunks’)
    except(FileExistsError):
        pass
  
    # move into the directory to
    # store the audio files.
    os.chdir(’audio_chunks’)
  
    i = 0
    # process each chunk
    for chunk in chunks:
              
        # Create 0.5 seconds silence chunk
        chunk_silent = AudioSegment.silent(duration = 10)
  
        # add 0.5 sec silence to beginning and 
        # end of audio chunk. This is done so that
        # it doesn’t seem abruptly sliced.
        audio_chunk = chunk_silent + chunk + chunk_silent
  
        # export audio chunk and save it in 
        # the current directory.
        print("saving chunk{0}.wav".format(i))
        # specify the bitrate to be 192 k
        audio_chunk.export("./chunk{0}.wav".format(i), bitrate =’192k’, format ="wav")
  
        # the name of the newly created chunk
        filename = ’chunk’+str(i)+’.wav’
  
        print("Processing chunk "+str(i))
  
        # get the name of the newly created chunk
        # in the AUDIO_FILE variable for later use.
        file = filename
  
        # create a speech recognition object
        r = sr.Recognizer()
  
        # recognize the chunk
        with sr.AudioFile(file) as source:
            # remove this if it is not working
            # correctly.
            r.adjust_for_ambient_noise(source)
            audio_listened = r.listen(source)
  
        try:
            # try converting it to text
            rec = r.recognize_google(audio_listened)
            # write the output to the file.
            fh.write(rec+". ")
  
        # catch any errors.
        except sr.UnknownValueError:
            print("Could not understand audio")
  
        except sr.RequestError as e:
            print("Could not request results. check your internet connection")
  
        i += 1
  
    os.chdir(’..’)
  
  
if __name__ == ’__main__’:
          
    print(’Enter the audio file path’)
  
    path = input()
  
    silence_based_conversion(path)

Output:

recognized.txt:

The peacock is the national bird of India. They have colourful feathers, two legs and 
a small beak. They are famous for their dance. When a peacock dances it spreads its 
feathers like a fan. It has a long shiny dark blue neck. Peacocks are mostly found in 
the fields they are very beautiful birds. The females are known as ’Peahen1. Their 
feathers are used for making jackets, purses etc. We can see them in a zoo.

How to convert large WAV file to text in Python?

Question from StackOverFlow

I already tried this code to convert my large wav file to text

import speech_recognition as sr
r = sr.Recognizer()

hellow=sr.AudioFile(’hello_world.wav’)
with hellow as source:
    audio = r.record(source)
try:
    s = r.recognize_google(audio)
    print("Text: "+s)
except Exception as e:
    print("Exception: "+str(e))

But it is not converting it accurately, the reason I feel it’s the ’US’ accent. Please tell me how i can convert whole large wav file accurately.

Answer:

Google’s speech to text is very effective, try the below link,

https://cloud.google.com/speech-to-text/

You can choose the language (English US in your case) and also upload files.

Like @bigdataolddriver commented 100% accuracy is not possible yet, and will be worth millions.

Google speech to text has three types of APIs

Synchronous, Asynchronous and streaming, in which asynchronous allows you to ~480 minutes audio conversion while others will only let you ~1 minute. Following is the sample code to do the conversion.

filepath = "~/audio_wav/"     #Input audio file path
output_filepath = "~/Transcripts/" #Final transcript path
bucketname = "callsaudiofiles" #Name of the bucket created in the step before

# Import libraries
from pydub import AudioSegment
import io
import os
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
import wave
from google.cloud import storage

Speech to text support wav files with LINEAR16 or MULAW encoded audio.

Below is the code to get the frame rate and channel with code.

def frame_rate_channel(audio_file_name):
    with wave.open(audio_file_name, "rb") as wave_file:
        frame_rate = wave_file.getframerate()
        channels = wave_file.getnchannels()
        return frame_rate,channels

and the code below is the does the asynchronous conversion.

def google_transcribe(audio_file_name):

    file_name = filepath + audio_file_name

    # The name of the audio file to transcribe

    frame_rate, channels = frame_rate_channel(file_name)

    if channels > 1:
        stereo_to_mono(file_name)

    bucket_name = bucketname
    source_file_name = filepath + audio_file_name
    destination_blob_name = audio_file_name

    upload_blob(bucket_name, source_file_name, destination_blob_name)

    gcs_uri = ’gs://’ + bucketname + ’/’ + audio_file_name
    transcript = ’’

    client = speech.SpeechClient()
    audio = types.RecognitionAudio(uri=gcs_uri)

    config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=frame_rate,
    language_code=’en-US’)

    # Detects speech in the audio file
    operation = client.long_running_recognize(config, audio)
    response = operation.result(timeout=10000)

    for result in response.results:
        transcript += result.alternatives[0].transcript

    delete_blob(bucket_name, destination_blob_name)
    return transcript

and this is how you write them to file

def write_transcripts(transcript_filename,transcript):
    f= open(output_filepath + transcript_filename,"w+")
    f.write(transcript)
    f.close()

Kindly let me know if you need any further clarifications.

Speech recognition — it is the process of converting sound to text. This is commonly used in voice assistants like Alexa, Siri, etc. Python provides an API called SpeechRecognition that allows us to convert audio to text for further processing. In this article, we will look at converting large or long audio files to text using the SpeechRecognition API in python.

Processing large audio files

When the input file is a long audio file, speech recognition accuracy decreases. Moreover, the Google Speech Recognition API cannot recognize long audio files with good fidelity. Therefore, we need to process the audio file into smaller chunks and then pass those chunks to the API. This improves accuracy and allows large audio files to be recognized.

Splitting audio based on silence

One way to handle an audio file — is to break it down into chunks of constant size. For example, we can take an audio file 10 minutes long and split it into 60 pieces of 10 seconds each. We can then pass these snippets to the API and convert speech to text by concatenating the results of all these snippets. This method is imprecise. Dividing an audio file into constant-sized chunks can interrupt sentences in between, and we can lose some important words in the process. This is because the audio file may end before the word is fully spoken, and Google will not be able to recognize incomplete words.

Another way — split audio file by silence. People stop for a short time between sentences. If we can split the audio file into chunks based on this silence, then we can process the file’s sentence by sentence and combine them to get the result. This approach is more accurate than the previous one because we do not split sentences between them and the audio block will contain the entire sentence without any interruptions. This way we don’t need to split it into chunks of constant length.

The disadvantage of this method is that it is difficult to determine the duration of the silence to separate, because different users speak differently and some users may do pause for 1 second between sentences, while some may pause for as little as 0.5 seconds.

How to Convert Speech to Text in Python

,p> Speech recognition is the ability of computer software to recognize words and sentences in spoken language and convert them into human-readable text. This tutorial will show you how to convert speech to text in Python using the SpeechRecognition library.

As a result, we don’t have to build a machine learning model from scratch. This library provides us with handy wrappers for various popular public speech recognition APIs (like Google Cloud Speech API, IBM Speech To Text, etc.).

Okay, let’s get started, installing the library using pip:

pip3 install SpeechRecognition pydub

Okey, open up a new Python file and import it:

import speech_recognition as sr

The nice thing about this library is it supports several recognition engines:

CMU Sphinx (offline)

Google Speech Recognition

Google Cloud Speech API

Wit.ai

Microsoft Bing Voice Recognition

Houndify API

IBM Speech To Text

Snowboy Hotword Detection (offline)

We are going to use Google Speech Recognition here, as it’s straightforward and doesn’t require any API key.

Reading from a File

Make sure you have an audio file in the current directory that contains english speech (if you want to follow along with me, get the audio file here):

filename = "16-122828-0002.wav"

This file was taken from LibriSpeech dataset, but you can use any audio WAV file you want, just change the name of the file, let’s initialize our speech recognizer:

# initialize the recognizer
r = sr.Recognizer()

The below code is responsible for loading the audio file, and converting the speech into text using Google Speech Recognition:

# open the file
with sr.AudioFile(filename) as source:
    # listen for the data (load audio to memory)
    audio_data = r.record(source)
    # recognize (convert from speech to text)
    text = r.recognize_google(audio_data)
    print(text)

This will take few seconds to finish, as it uploads the file to Google and grabs the output, here is my result:

I believe you’re just talking nonsense

The above code works well for small or medium size audio files. In the next section, we gonna write code for large files.

Reading Large Audio Files

If you want to perform speech recognition of a long audio file, then the below function handles that quite well:

# importing libraries 
import speech_recognition as sr 
import os 
from pydub import AudioSegment
from pydub.silence import split_on_silence

# create a speech recognition object
r = sr.Recognizer()

# a function that splits the audio file into chunks
# and applies speech recognition
def get_large_audio_transcription(path):
    """
    Splitting the large audio file into chunks
    and apply speech recognition on each of these chunks
    """
    # open the audio file using pydub
    sound = AudioSegment.from_wav(path)  
    # split audio sound where silence is 700 miliseconds or more and get chunks
    chunks = split_on_silence(sound,
        # experiment with this value for your target audio file
        min_silence_len = 500,
        # adjust this per requirement
        silence_thresh = sound.dBFS-14,
        # keep the silence for 1 second, adjustable as well
        keep_silence=500,
    )
    folder_name = "audio-chunks"
    # create a directory to store the audio chunks
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    whole_text = ""
    # process each chunk 
    for i, audio_chunk in enumerate(chunks, start=1):
        # export audio chunk and save it in
        # the ’folder_name’ directory.
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")
        # recognize the chunk
        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)
            # try converting it to text
            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                whole_text += text
    # return the text for all chunks detected
    return whole_text

Note: You need to install Pydub using pip for the above code to work.

The above function uses split_on_silence() function from pydub.silence module to split audio data into chunks on silence. min_silence_len parameter is the minimum length of a silence to be used for a split.

silence_thresh is the threshold in which anything quieter than this will be considered silence, I have set it to the average dBFS minus 14, keep_silence argument is the amount of silence to leave at the beginning and the end of each chunk detected in milliseconds.

These parameters won’t be perfect for all sound files, try to experiment with these parameters with your large audio needs.

After that, we iterate over all chunks and convert each speech audio into text and adding them up all together, here is an example run:

path = "7601-291468-0006.wav"
print("
Full text:", get_large_audio_transcription(path))

Note: You can get 7601-291468-0006.wav file here.

Output:

audio-chunkschunk1.wav : His abode which you had fixed in a bowery or country seat. 
audio-chunkschunk2.wav : At a short distance from the city. 
audio-chunkschunk3.wav : Just at what is now called dutch street. 
audio-chunkschunk4.wav : Sooner bounded with proofs of his ingenuity. 
audio-chunkschunk5.wav : Patent smokejacks. 
audio-chunkschunk6.wav : It required a horse to work some. 
audio-chunkschunk7.wav : Dutch oven roasted meat without fire. 
audio-chunkschunk8.wav : Carts that went before the horses. 
audio-chunkschunk9.wav : Weather cox that turned against the wind and other wrongheaded contrivances. 
audio-chunkschunk10.wav : So just understand can found it all beholders.

Full text: His abode which you had fixed in a bowery or country seat. At a short distance from the city. Just at what is now called dutch street. Sooner bounded with proofs of his ingenuity. Patent smokejacks. It required a horse to work some. Dutch oven roasted meat without fire. Carts that went before the horses. Weather cox that turned against the wind and other wrongheaded contrivances. So just understand can found it all beholders.
So, this function automatically creates a folder for us and puts the chunks of the original audio file we specified, and then it runs speech recognition on all of them.

Reading from the Microphone

This requires PyAudio to be installed in your machine, here is the installation process depending on your operating system:

Windows

You can just pip install it:

pip3 install pyaudio

Linux

You need to first install the dependencies:

sudo apt-get install python-pyaudio python3-pyaudio
pip3 install pyaudio

MacOS

You need to first install portaudio, then you can just pip install it:

brew install portaudio
pip3 install pyaudio

Now let’s use our microphone to convert our speech:

with sr.Microphone() as source:
    # read the audio data from the default microphone
    audio_data = r.record(source, duration=5)
    print("Recognizing...")
    # convert speech to text
    text = r.recognize_google(audio_data)
    print(text)

This will hear from your microphone for 5 seconds and then tries to convert that speech into text !

It is pretty similar to the previous code, but we are using Microphone() object here to read the audio from the default microphone, and then we used duration parameter in record() function to stop reading after 5 seconds and then uploads the audio data to Google to get the output text.

You can also use offset parameter in record() function to start recording after offset seconds.

Also, you can recognize different languages by passing language parameter to recognize_google() function. For instance, if you want to recognize spanish speech, you would use:

text = r.recognize_google(audio_data, language="es-ES")

Shop

Best laptop for Excel

Best laptop for Solidworks

$399+

Best laptop for Roblox

$399+

Best laptop for development

$499+

Best laptop for Cricut Maker

$299+

Best laptop for hacking

$890

Best laptop for Machine Learning

$699+

Raspberry Pi robot kit

$150

Python | Speech recognition on large audio files

Libraries required

Example:

Output:

Python Code:

Output:

How to convert large WAV file to text in Python?

Question from StackOverFlow

Answer:

Processing large audio files

Splitting audio based on silence

How to Convert Speech to Text in Python

Reading from a File

Reading Large Audio Files

Output:

Reading from the Microphone

Windows

Linux

MacOS

Shop

News

Wiki