Python | Speech recognition on large audio files

Speech recognition
— it is the process of converting sound to text. This is commonly used in voice assistants like Alexa, Siri, etc. Python provides an API called SpeechRecognition that allows us to convert audio to text for further processing. In this article, we will look at converting large or long audio files to text using the SpeechRecognition API in python.

Processing large audio files

When the input file is a long audio file, speech recognition accuracy decreases. Moreover, the Google Speech Recognition API cannot recognize long audio files with good fidelity. Therefore, we need to process the audio file into smaller chunks and then pass those chunks to the API. This improves accuracy and allows large audio files to be recognized.

Splitting audio based on silence

One way to handle an audio file — is to break it down into chunks of constant size. For example, we can take an audio file 10 minutes long and split it into 60 pieces of 10 seconds each. We can then pass these snippets to the API and convert speech to text by concatenating the results of all these snippets. This method is imprecise. Dividing an audio file into constant-sized chunks can interrupt sentences in between, and we can lose some important words in the process. This is because the audio file may end before the word is fully spoken, and Google will not be able to recognize incomplete words.

Another way — split audio file by silence. People stop for a short time between sentences. If we can split the audio file into chunks based on this silence, then we can process the file`s sentence by sentence and combine them to get the result. This approach is more accurate than the previous one because we do not split sentences between them and the audio block will contain the entire sentence without any interruptions. This way we don`t need to split it into chunks of constant length.

The disadvantage of this method is that it is difficult to determine the duration of the silence to separate, because different users speak differently and some users may do pause for 1 second between sentences, while some may pause for as little as 0.5 seconds.

Libraries required

  Pydub:  sudo pip3 install pydub  Speech recognition:  sudo pip3 install SpeechRecognition 

Example :

  Input:  peacock.wav  Output:  exporting chunk0.wav Processing chunk 0 exporting chunk1.wav Processing chunk 1 exporting chunk2.wav Processing chunk 2 exporting chunk3.wav Processing chunk 3 exporting chunk4.wav Processing chunk 4 exporting chunk5.wav Processing chunk 5 exporting chunk6.wav Processing chunk 6 

Code :

# import libraries

import speech_recognition as sr


import os


from pydub import AudioSegment

from pydub.silence import split_on_silence

# function that breaks the audio file into chunks
# and applies speech recognition

def silence_based_conversion (path = " alice-medium.wav " ):


# open the audio file saved in

# local system as a wav file.

song = AudioSegment.from_wav (path)


# open the file we will merge into

# and save the recognized text

fh = open ( "recognized .txt " , " w + " )


# split the track, where silence is 0.5 seconds

# or more and get chunks

chunks = split_on_silence (song,

  # must be silent for at least 0.5 seconds

# or 500ms adjust this value depending on the user

# requirement. if the speaker is silent

# longer, increase this value. otherwise reduce it.

min_silence_len = 500 ,


# quieter if quieter -16 dBFS

# adjust as required

silence_thresh = - 16



# create a directory to store audio snippets.

try  :

os.mkdir ( ` audio_chunks` )

except (FileExistsError):



# go to directory to

# store audio files.

os.chdir ( `audio_chunks` )


i = 0

  # process each chunk

for chunk in chunks:


  # Create 0.5 second block of silence

chunk_silent = AudioSegment.silent (duration = 10 )


# add 0 , 5 seconds of silence to the beginning and

# the end of the audio block. This is done so that

# it doesn`t feel like harsh slices.

audio_chunk = chunk_silent + chunk + chunk_silent


# export the audio block and save it to

# current directory.

print ( "saving chunk {0} .wav" . format (i))

< code class = "comments"> # specify 192 k bitrate

audio_chunk.export ( "./ chunk {0} .wav" . format (i), bitrate = `192k` , format = "wav" )


# the name of the newly created chunk

filename = `chunk` + str (i) + ` .wav`


print ( "Processing chunk" + str (i))


# get the name of the newly created chunk

# in the AUDIO_FILE variable for later use.

file = filename


# create speech recognition object

r = sr.Recognizer ()


  # recognize piece

with sr .AudioFile ( file ) as source:

# remove this if it doesn`t work

  # correct.

r.adjust_for_ambient_noise (source)

audio_listened = r.listen (source)


  try :

  # try converting it to text

  rec = r.recognize_google (audio_listened)

# write the output to a file.

fh.write (rec + ". " )


  # catch any errors.

  except sr.UnknownValueError:

  print ( " Could not understand audio " )


except sr.RequestError as e:

print ( "Could not request results. check your internet connection" )


  i + = 1


os.chdir ( ` ..` )



if __ name__ = = `__main__` :


print ( ` Enter the audio file path` )


path = input ()


silence_based_conversion (path)


 recognized.txt: The peacock is the national bird of India. They have colorful feathers, two legs and a small beak. They are famous for their dance. When a peacock dances it spreads its feathers like a fan. It has a long shiny dark blue neck. Peacocks are mostly found in the fields they are very beautiful birds. The females are known as `Peahen1. Their feathers are used for making jackets, purses etc. We can see them in a zoo.