02: Task of Automatic Speech Recognition (ASR) System

Introduction

Automatic Speech Recognition (ASR) systems are designed to convert spoken language into text. The input to the ASR system is audio, while the output is the corresponding transcript. To effectively process the audio, it is essential to transform it into a set of features that can be utilized in machine learning algorithms to generate the transcript.

Audio Processing: Features Extraction

The journey of audio processing begins with dividing the audio signal into continuous overlapping frames. This segmentation of audio is driven by two key parameters:

Frame Window: Defines the duration of each frame. For instance, if the frame window is set at 25 milliseconds, every audio frame will correspond to this time duration.
Frame Shift: Determines the overlap between consecutive frames. Assuming a frame shift of 10 milliseconds, the first audio frame will start at 0 milliseconds, the second frame at 10 milliseconds, and the third at 35 milliseconds. This overlapping approach is important to maintain continuity in the audio signal.

Fourier Transform and Mel Filters

To convert the audio frames from the time domain into the frequency domain, a Fourier Transform is used. The transformation breaks down the audio signal into its constituent frequencies, supporting the analysis of sound in a more meaningful way.

Next, we employ Mel filters, a crucial component that simulates human auditory perception. Humans are more adept at discerning differences in lower frequencies than in higher ones. For example, the difference between 1000 Hz and 1100 Hz is more perceptible to us than that between 5000 Hz and 5100 Hz. Mel filters, therefore, help our ASR system to align more closely with human hearing capabilities. If n Mel filters are utilized, the frequency domain representation is processed through these filters to yield n Mel features.

Log Mel Features

After obtaining the Mel features for each audio frame, we take the logarithm of these values to produce log Mel features. Thus, after processing an audio frame of 25 milliseconds, we end up with log Mel features that are represented as n-dimensional vectors, with a common practice value of n being 80.

Building ASR Systems

Once we have extracted the log Mel features from the audio frames, the following two paradigms can be adopted to build the ASR system:

Modularized ASR Systems: This approach comprises several components, such as an acoustic model, language model, and pronunciation model. Each of these components works together to create the overall ASR system.
End-to-End ASR Systems: In contrast, this method utilizes a single neural network to directly perform automatic speech recognition, streamlining the process and simplifying system architecture.

Keywords

Automatic Speech Recognition (ASR)
Audio Signal
Transcript
Features Extraction
Frame Window
Frame Shift
Fourier Transform
Mel Filters
Log Mel Features
Modularized ASR System
End-to-End ASR System

FAQ

What is an Automatic Speech Recognition (ASR) system?
An ASR system converts spoken language into text by processing audio input and generating a transcript as output.

What are the key parameters in audio frame processing?
The key parameters are Frame Window, which determines the duration of each frame, and Frame Shift, which defines the overlap between the frames.

What role does the Fourier Transform play in ASR?
The Fourier Transform helps to convert audio signals from the time domain to the frequency domain, breaking down the audio into its constituent frequencies.

How do Mel filters contribute to the ASR process?
Mel filters simulate human hearing sensitivity, allowing the ASR system to better recognize differences in sound frequencies, particularly at lower frequencies.

What are the two main paradigms for building ASR systems?
The two paradigms are Modularized ASR Systems, which use multiple components, and End-to-End ASR Systems, which rely on a single neural network for recognition.