Audio processing using Pydub and Google speechRecognition API



Do you need to break the audio file?

When we do any processing on audio files, it takes a long time. Here, processing can mean anything. For example, we may want to increase or decrease the frequency of an audio, or, as in this article, recognize the content in an audio file. By breaking it down into small audio files called chunks, we can ensure fast processing.

Required Installations:

 pip3 install pydub pip3 install audioread pip3 install SpeechRecognition  

There are two main steps in the program.

Step # 1: It deals with slicing audio files into small chunks at regular intervals. Slicing can be done with or without overlap. Overlapping means that the next fragment created will start from a constant time back, so that during the cut, if any audio / word is cut out, it can be covered by this overlap. For example, if the audio file is 22 seconds and the overlap is 1.5 seconds, the timing of these chunks will be:

 chunk1: 0 - 5 seconds chunk2: 3.5 - 8.5 seconds chunk3: 7 - 12 seconds chunk4: 10.5 - 15.5 seconds chunk5: 14 - 19.5 seconds chunk6: 18 - 22 seconds 

We can ignore this overlap by setting the overlap to 0.

Step # 2: It deals with working with the sliced ​​audio file to do whatever the user wants. Here, for demonstration purposes, the snippets were streamed through Google`s speech recognition engine, and the text was written to a separate file. To understand how to use the Google Speech Recognition engine to recognize sound from a microphone, check out Geek.wav Output: Screenshot of cmd running the code: Text File: recognized

Here`s the implementation:

# Import required libraries

from pydub import Audi oSegment

import speech_recognition as sr

 
# Input audio file to cut

audio = AudioSegment.from_wav ( "1.wav" )

 
"" "
Step # 1 - Cut the audio file into smaller pieces.
"" "
# Audio file length in milliseconds

n = len (audio)

 
# Variable for counting the number of sliced ​​pieces

co unter = 1

  
# Text file for recording recognized audio

fh = open ( " recognized.txt " , "w +" )

 
# The length of the interval at which to cut the audio file.
# If the length is 22 seconds and the interval is 5 seconds,
# The generated chunks will be:
# chunk1: 0 - 5 seconds
# chunk2: 5 - 10 seconds
# chunk3: 10 - 15 seconds
# chunk4: 15-20 seconds
# chunk5: 20 - 22 seconds

interval = 5 * 1000

 
# Sound length to overlap.
# If the length is 22 seconds and the interval is 5 seconds,
# With 1.5 seconds overlap,
# The generated chunks will be:
# chunk1: 0 - 5 seconds
# chunk2 : 3.5 - 8.5 seconds
# chunk3: 7 - 12 seconds
# chunk4 : 10.5 - 15.5 seconds
# chunk5: 14 - 19.5 seconds
# chunk6: 18 - 22 seconds

overlap = 1.5 * 1000

 
# Initialize start and end seconds to 0

start = 0

end = 0

  
# A flag to keep track of the end of the file.
# When the sound ends, the flag is set to 1 and we break

flag = 0

 
# Iterate from 0 to the end of the file,
# increments = spacing

for i in range ( 0 , 2 * n, interval) :

 

# During the first iteration

  # start 0, end interval

if i = = 0 :

start = 0

  end = interval

 

# All other iterations,

# start - previous end - overlap

# end becomes end + spacing

  else :

  start = end - overlap

end = start + interval 

 

# When the end is greater than the length of the file

# end set to file length

The # flag is set to 1 to indicate a break.

if end & gt; = n:

end = n

  flag = 1

 

# Save an audio file from a specific beginning to the end

chunk  = audio [start: end]

  

# Filename / Path for storing sliced ​​audio

filename = ` chunk` + str (counter) + `.wav`

  

# Save the sliced ​​audio file to the specified path

  chunk.export (filename, format = "wav" )

  # Print information about the current chunk

  print ( " Processing chunk " + str (counter) + " ... Start = "

  + str (start) + "end =" + str (end))

 

# Increment counter for next chunk

counter = counter + 1

 

# Audio file slicing completed.

# Skip the next steps if there are other uses

  # for sliced ​​audio files.

 

 
"" "
Step # 2 - Recognizing a chunk and writing to a file.
"" "

  

# Google Speech Recognition is used here

# take each piece and recognize the text in it.

 

# Specify the audio file to recognize

 

AUDIO_FILE = filename

 

 

# Initialize recognizer

r = sr.Recognizer ()

 

# Go through the audio file and listen to the audio

with sr.AudioFile (AUDIO_FILE) as source:

audio_listened = r.listen (source)

 

# Try to recognize the sound you hear

# And catch expectations. < / p>

try

rec = r .recognize_google (audio_listened)

 

# If recognized, write to file.

fh.write (rec + "" )

 

# If Google couldn`t understand audio

except sr.UnknownValueError:

print (< / code> "Could not understand audio" )

 

# If results cannot be requested from Google.

# Probably an error with the internet connection.

except sr.RequestError as e:

  print ( " Could not request results. " )

 

  # Check the flag.

# If the flag is 1, the end of all audio has been reached.

  # Close the file and split.

  if flag = = 1 :

fh.close ()

  break

Exit:

recognized.txt — 

As we can see in the screenshot above, all these chunks are stored on the local system. We have now successfully sliced ​​the overdubbed audio file and recognized the content of the fragments.

Advantages of this method:

  • The interval can be set to any length depending on how long we need the chunks.
  • Overlapping ensures that no data is lost, even if any word is spoken exactly at the end of the interval.
  • All chunks can be saved in different audio files and used later if needed.
  • Any processing that can be done on the audio file can also be done in these chunks as they are just audio files.

Disadvantages of this method:

  • Using Google Speech Recognition requires an active Internet connection.
  • After overlay, some text processing needs to be done to remove recognized duplicate words.
  • Google Speech Recognition accuracy depends on many factors, such as background th noise, speaker accent, etc.