Kaggle Winning Solutions Walkthroughs: Bengali.AI Speech Recognition with Team AudioAlchemists

Introduction

Welcome to our detailed walkthrough of Team AudioAlchemists’ presentation, where we discuss our experience and findings in the Bengali.AI Speech Recognition competition. As a collaborative effort among three members, our collective background and expertise in the field significantly contributed to our success.

Team Background

Our team consists of three dedicated members:

Must - Team Leader and Lead AI Research Engineer at Soloscope Limited.
S - Graduate Research Fellow at Stevens Institute of Technology.
Mah - Research Engineer at C Limited.

All members completed their Bachelor's degrees at the Bangladesh University of Engineering and Technology and have prior experience in speech recognition technologies, particularly in developing speech-to-text models for the Bengali language. My experience from the previous year's DL Spring competition, hosted by the same organization, proved invaluable for our current project.

Summary of Our Solution

Our solution can be broadly divided into two main components: the Acoustic Model and the Language Model.

Acoustic Model

To build the acoustic model, we employed a model architecture designed specifically for Indic languages. We utilized a combination of the competition dataset, which comprised 1,200 hours of annotated audio data from a single domain, and datasets OP37 and OP53.

Language Model

For the language model, we trained a five-language model using KLM (Kneser-Ney Language Model). Our leaderboard performance placed us at a score of 0.491 on the private leaderboard and 0.413 on the public leaderboard, with an overall standing of 31st on the private leaderboard.

Data Utilization

We primarily used the competition datasets and other datasets for training our models. The competition dataset was filtered and cleaned to ensure high quality, removing noise and correcting annotation errors.

Data Cleaning and Preprocessing

We approached our data cleaning in a two-step process:

Filtering Outliers: We analyzed the audio transcripts to calculate the audio length-to-transcription length ratio, removing entries that exhibited unusual ratios indicative of annotation errors.
Quality Assessment: Utilizing the metadata provided by the competition organizers, we filtered out data based on certain quality metrics like mean opinion score and loudness, resulting in a more manageable dataset of approximately 500,000 samples for training.

During preprocessing, we resampled the audio data to 16,000 Hz, performed normalization, and removed irrelevant characters and punctuation from text data. We also employed data augmentation techniques by incorporating background noise from other domain audio clips.

Training Methods

Our base model was Fine-tuned on a model pre-trained with 40 Indic languages. Throughout our training process, we mainly stuck with default parameters due to the minimal positive impacts from various tuning attempts. Utilizing an NV RTX A450 GPU with 20GB memory, our training lasted roughly five days.

Key Insights

Through our journey, we learned that data quality was more critical than data quantity. The filtered dataset provided more robust results compared to the raw dataset. Building a larger language model helped overcome vocabulary diversity issues but required advanced tuning. Furthermore, experiments with audio enhancement techniques, wrong model selections, and other alterations led to discoveries regarding the importance of data filtering.

In conclusion, while our approach may seem straightforward, we encountered and navigated numerous experimental trials, many of which were less fruitful.

Keywords

Bengali.AI
Speech Recognition
Acoustic Model
Language Model
Data Cleaning
Preprocessing
Audio Quality
Machine Learning

FAQ

Q1: What was the main focus of Team AudioAlchemists in the competition?
A1: The main focus was to develop an effective speech recognition system for the Bengali language combining acoustic and language models.

Q2: How did Team AudioAlchemists handle data quality issues?
A2: They performed data cleaning and filtering based on audio length ratios and metadata quality metrics to ensure high-quality training samples.

Q3: What were the results of the competition?
A3: The team achieved scores of 0.491 on the private leaderboard and 0.413 on the public leaderboard, ranking 31st overall in the private leaderboard.

Q4: What insights did the team gain during the project?
A4: The team discovered that focusing on data quality yielded better results than simply amassing a large quantity of data, highlighting the importance of effective data cleaning and preprocessing.

Q5: What methods did the team use for preprocessing and augmenting audio data?
A5: The team resampled audio data at 16,000 Hz, normalized sentences, and applied background noise augmentation techniques sourced from various audio samples.