Libraries required
Pydub: sudo pip3 install
pydub
Speech recognition: sudo pip3 install
SpeechRecognition
Example:
Input: peacock.wav
Output:
exporting chunk0.wav Processing chunk 0 exporting chunk1.wav Processing chunk 1 exporting chunk2.wav Processing chunk 2 exporting chunk3.wav Processing chunk 3 exporting chunk4.wav Processing chunk 4 exporting chunk5.wav Processing chunk 5 exporting chunk6.wav Processing chunk 6
Python Code:
# importing libraries import speech_recognition as sr import os from pydub import AudioSegment from pydub.silence import split_on_silence # a function that splits the audio file into chunks # and applies speech recognition def silence_based_conversion(path = "alice-medium.wav"): # open the audio file stored in # the local system as a wav file. song = AudioSegment.from_wav(path) # open a file where we will concatenate # and store the recognized text fh = open("recognized.txt", "w+") # split track where silence is 0.5 seconds # or more and get chunks chunks = split_on_silence(song, # must be silent for at least 0.5 seconds # or 500 ms. adjust this value based on user # requirement. if the speaker stays silent for # longer, increase this value. else, decrease it. min_silence_len = 500, # consider it silent if quieter than -16 dBFS # adjust this per requirement silence_thresh = -16 ) # create a directory to store the audio chunks. try: os.mkdir(’audio_chunks’) except(FileExistsError): pass # move into the directory to # store the audio files. os.chdir(’audio_chunks’) i = 0 # process each chunk for chunk in chunks: # Create 0.5 seconds silence chunk chunk_silent = AudioSegment.silent(duration = 10) # add 0.5 sec silence to beginning and # end of audio chunk. This is done so that # it doesn’t seem abruptly sliced. audio_chunk = chunk_silent + chunk + chunk_silent # export audio chunk and save it in # the current directory. print("saving chunk{0}.wav".format(i)) # specify the bitrate to be 192 k audio_chunk.export("./chunk{0}.wav".format(i), bitrate =’192k’, format ="wav") # the name of the newly created chunk filename = ’chunk’+str(i)+’.wav’ print("Processing chunk "+str(i)) # get the name of the newly created chunk # in the AUDIO_FILE variable for later use. file = filename # create a speech recognition object r = sr.Recognizer() # recognize the chunk with sr.AudioFile(file) as source: # remove this if it is not working # correctly. r.adjust_for_ambient_noise(source) audio_listened = r.listen(source) try: # try converting it to text rec = r.recognize_google(audio_listened) # write the output to the file. fh.write(rec+". ") # catch any errors. except sr.UnknownValueError: print("Could not understand audio") except sr.RequestError as e: print("Could not request results. check your internet connection") i += 1 os.chdir(’..’) if __name__ == ’__main__’: print(’Enter the audio file path’) path = input() silence_based_conversion(path)
Output:
recognized.txt: The peacock is the national bird of India. They have colourful feathers, two legs and a small beak. They are famous for their dance. When a peacock dances it spreads its feathers like a fan. It has a long shiny dark blue neck. Peacocks are mostly found in the fields they are very beautiful birds. The females are known as ’Peahen1. Their feathers are used for making jackets, purses etc. We can see them in a zoo.
How to convert large WAV file to text in Python?
Question from StackOverFlow
I already tried this code to convert my large wav file to text
import speech_recognition as sr
r = sr.Recognizer()
hellow=sr.AudioFile(’hello_world.wav’)
with hellow as source:
audio = r.record(source)
try:
s = r.recognize_google(audio)
print("Text: "+s)
except Exception as e:
print("Exception: "+str(e))
But it is not converting it accurately, the reason I feel it’s the ’US’ accent. Please tell me how i can convert whole large wav file accurately.
Answer:
Google’s speech to text is very effective, try the below link,
https://cloud.google.com/speech-to-text/
You can choose the language (English US in your case) and also upload files.
Like @bigdataolddriver commented 100% accuracy is not possible yet, and will be worth millions.
Google speech to text has three types of APIs
Synchronous, Asynchronous and streaming, in which asynchronous allows you to ~480 minutes audio conversion while others will only let you ~1 minute. Following is the sample code to do the conversion.
filepath = "~/audio_wav/" #Input audio file path
output_filepath = "~/Transcripts/" #Final transcript path
bucketname = "callsaudiofiles" #Name of the bucket created in the step before
# Import libraries
from pydub import AudioSegment
import io
import os
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
import wave
from google.cloud import storage
Speech to text support wav files with LINEAR16 or MULAW encoded audio.
Below is the code to get the frame rate and channel with code.
def frame_rate_channel(audio_file_name):
with wave.open(audio_file_name, "rb") as wave_file:
frame_rate = wave_file.getframerate()
channels = wave_file.getnchannels()
return frame_rate,channels
and the code below is the does the asynchronous conversion.
def google_transcribe(audio_file_name):
file_name = filepath + audio_file_name
# The name of the audio file to transcribe
frame_rate, channels = frame_rate_channel(file_name)
if channels > 1:
stereo_to_mono(file_name)
bucket_name = bucketname
source_file_name = filepath + audio_file_name
destination_blob_name = audio_file_name
upload_blob(bucket_name, source_file_name, destination_blob_name)
gcs_uri = ’gs://’ + bucketname + ’/’ + audio_file_name
transcript = ’’
client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=frame_rate,
language_code=’en-US’)
# Detects speech in the audio file
operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=10000)
for result in response.results:
transcript += result.alternatives[0].transcript
delete_blob(bucket_name, destination_blob_name)
return transcript
and this is how you write them to file
def write_transcripts(transcript_filename,transcript):
f= open(output_filepath + transcript_filename,"w+")
f.write(transcript)
f.close()
Kindly let me know if you need any further clarifications.
Speech recognition — it is the process of converting sound to text. This is commonly used in voice assistants like Alexa, Siri, etc. Python provides an API called SpeechRecognition that allows us to convert audio to text for further processing. In this article, we will look at converting large or long audio files to text using the SpeechRecognition API in python.
Processing large audio files
When the input file is a long audio file, speech recognition accuracy decreases. Moreover, the Google Speech Recognition API cannot recognize long audio files with good fidelity. Therefore, we need to process the audio file into smaller chunks and then pass those chunks to the API. This improves accuracy and allows large audio files to be recognized.
Splitting audio based on silence
One way to handle an audio file — is to break it down into chunks of constant size. For example, we can take an audio file 10 minutes long and split it into 60 pieces of 10 seconds each. We can then pass these snippets to the API and convert speech to text by concatenating the results of all these snippets. This method is imprecise. Dividing an audio file into constant-sized chunks can interrupt sentences in between, and we can lose some important words in the process. This is because the audio file may end before the word is fully spoken, and Google will not be able to recognize incomplete words.
Another way — split audio file by silence. People stop for a short time between sentences. If we can split the audio file into chunks based on this silence, then we can process the file’s sentence by sentence and combine them to get the result. This approach is more accurate than the previous one because we do not split sentences between them and the audio block will contain the entire sentence without any interruptions. This way we don’t need to split it into chunks of constant length.
The disadvantage of this method is that it is difficult to determine the duration of the silence to separate, because different users speak differently and some users may do pause for 1 second between sentences, while some may pause for as little as 0.5 seconds.
How to Convert Speech to Text in Python
,p> Speech recognition is the ability of computer software to recognize words and sentences in spoken language and convert them into human-readable text. This tutorial will show you how to convert speech to text in Python using the SpeechRecognition library.As a result, we don’t have to build a machine learning model from scratch. This library provides us with handy wrappers for various popular public speech recognition APIs (like Google Cloud Speech API, IBM Speech To Text, etc.).
Okay, let’s get started, installing the library using pip
:
pip3 install SpeechRecognition pydub
Okey, open up a new Python file and import it:
import speech_recognition as sr
The nice thing about this library is it supports several recognition engines:
We are going to use Google Speech Recognition here, as it’s straightforward and doesn’t require any API key.
Reading from a File
Make sure you have an audio file in the current directory that contains english speech (if you want to follow along with me, get the audio file here):
filename = "16-122828-0002.wav"
This file was taken from LibriSpeech dataset, but you can use any audio WAV file you want, just change the name of the file, let’s initialize our speech recognizer:
# initialize the recognizer r = sr.Recognizer()
The below code is responsible for loading the audio file, and converting the speech into text using Google Speech Recognition:
# open the file with sr.AudioFile(filename) as source: # listen for the data (load audio to memory) audio_data = r.record(source) # recognize (convert from speech to text) text = r.recognize_google(audio_data) print(text)
This will take few seconds to finish, as it uploads the file to Google and grabs the output, here is my result:
I believe you’re just talking nonsense
The above code works well for small or medium size audio files. In the next section, we gonna write code for large files.
Reading Large Audio Files
If you want to perform speech recognition of a long audio file, then the below function handles that quite well:
# importing libraries import speech_recognition as sr import os from pydub import AudioSegment from pydub.silence import split_on_silence # create a speech recognition object r = sr.Recognizer() # a function that splits the audio file into chunks # and applies speech recognition def get_large_audio_transcription(path): """ Splitting the large audio file into chunks and apply speech recognition on each of these chunks """ # open the audio file using pydub sound = AudioSegment.from_wav(path) # split audio sound where silence is 700 miliseconds or more and get chunks chunks = split_on_silence(sound, # experiment with this value for your target audio file min_silence_len = 500, # adjust this per requirement silence_thresh = sound.dBFS-14, # keep the silence for 1 second, adjustable as well keep_silence=500, ) folder_name = "audio-chunks" # create a directory to store the audio chunks if not os.path.isdir(folder_name): os.mkdir(folder_name) whole_text = "" # process each chunk for i, audio_chunk in enumerate(chunks, start=1): # export audio chunk and save it in # the ’folder_name’ directory. chunk_filename = os.path.join(folder_name, f"chunk{i}.wav") audio_chunk.export(chunk_filename, format="wav") # recognize the chunk with sr.AudioFile(chunk_filename) as source: audio_listened = r.record(source) # try converting it to text try: text = r.recognize_google(audio_listened) except sr.UnknownValueError as e: print("Error:", str(e)) else: text = f"{text.capitalize()}. " print(chunk_filename, ":", text) whole_text += text # return the text for all chunks detected return whole_text
Note: You need to install
Pydub
using pip for the above code to work.
The above function uses split_on_silence()
function from pydub.silence module to split audio data into chunks on silence. min_silence_len parameter is the minimum length of a silence to be used for a split.
silence_thresh
is the threshold in which anything quieter than this will be considered silence, I have set it to the average dBFS
minus 14, keep_silence argument is the amount of silence to leave at the beginning and the end of each chunk detected in milliseconds.
These parameters won’t be perfect for all sound files, try to experiment with these parameters with your large audio needs.
After that, we iterate over all chunks and convert each speech audio into text and adding them up all together, here is an example run:
path = "7601-291468-0006.wav" print(" Full text:", get_large_audio_transcription(path))
Note: You can get 7601-291468-0006.wav file here.
Output:
audio-chunkschunk1.wav : His abode which you had fixed in a bowery or country seat. audio-chunkschunk2.wav : At a short distance from the city. audio-chunkschunk3.wav : Just at what is now called dutch street. audio-chunkschunk4.wav : Sooner bounded with proofs of his ingenuity. audio-chunkschunk5.wav : Patent smokejacks. audio-chunkschunk6.wav : It required a horse to work some. audio-chunkschunk7.wav : Dutch oven roasted meat without fire. audio-chunkschunk8.wav : Carts that went before the horses. audio-chunkschunk9.wav : Weather cox that turned against the wind and other wrongheaded contrivances. audio-chunkschunk10.wav : So just understand can found it all beholders.
Full text: His abode which you had fixed in a bowery or country seat. At a short distance from the city. Just at what is now called dutch street. Sooner bounded with proofs of his ingenuity. Patent smokejacks. It required a horse to work some. Dutch oven roasted meat without fire. Carts that went before the horses. Weather cox that turned against the wind and other wrongheaded contrivances. So just understand can found it all beholders. So, this function automatically creates a folder for us and puts the chunks of the original audio file we specified, and then it runs speech recognition on all of them.
Reading from the Microphone
This requires PyAudio
to be installed in your machine, here is the installation process depending on your operating system:
Windows
You can just pip install it:
pip3 install pyaudio
Linux
You need to first install the dependencies:
sudo apt-get install python-pyaudio python3-pyaudio pip3 install pyaudio
MacOS
You need to first install portaudio, then you can just pip install it:
brew install portaudio pip3 install pyaudio
Now let’s use our microphone to convert our speech:
with sr.Microphone() as source: # read the audio data from the default microphone audio_data = r.record(source, duration=5) print("Recognizing...") # convert speech to text text = r.recognize_google(audio_data) print(text)
This will hear from your microphone for 5 seconds and then tries to convert that speech into text !
It is pretty similar to the previous code, but we are using Microphone() object here to read the audio from the default microphone, and then we used duration parameter in record() function to stop reading after 5 seconds and then uploads the audio data to Google to get the output text.
You can also use offset parameter in record() function to start recording after offset seconds.
Also, you can recognize different languages by passing language parameter to recognize_google() function. For instance, if you want to recognize spanish speech, you would use:
text = r.recognize_google(audio_data, language="es-ES")