Offline Speech Recognition on Meta Quest: Testing Unity Sentis + Whisper AI

Introduction

Imagine controlling your virtual reality (VR) experience with just your voice, all without relying on online AI services. Recently, I've been diving into developing an application on my Meta Quest 3 that integrates speech recognition, with all processes running locally on the device. In this article, I'll walk you through a demo of the app I've built and explain how it was accomplished.

Overview of the App

The app is built using the Unity game engine, enabling speech recognition through the Whisper model developed by OpenAI. Specifically, I am utilizing the Whisper tiny model for speech-to-text transcription. The procedure is straightforward: by holding down the trigger button on the left controller, I can record my voice, and pressing the trigger on the right controller sends the recording to the Whisper model for transcription. The transcribed text is displayed in orange on the app interface.

Technical Details

To create the app, I leveraged Unity's XR Interaction Toolkit and the OpenXR Meta plugin, although you can also use the official Meta XR SDK if you prefer. Whisper is an automatic speech recognition (ASR) model trained on an extensive dataset of multilingual speech comprising around 680,000 hours. Among the various sizes of the model, I chose Whisper tiny, which contains 39 million parameters.

Integrating Whisper into Unity

To integrate Whisper into my Unity app and enable it to run locally, I utilized Unity's Sentis technology. Sentis is a neural network inference library that allows developers to run AI models directly from their Unity applications. To start integration, I installed the Sentis package via the Unity Package Manager.

Instead of downloading the model from OpenAI's website, I obtained it from Unity's page on Hugging Face, which hosts AI models optimized for Unity Sentis. The Whisper tiny model consists of several essential files that need to be imported. It is important to note that the ideal location for placing these files can be tricky on Android due to how Unity packs project folders, but I opted to place them in a custom project folder called "AI Models".

Understanding Whisper's Model Structure

Whisper operates through a multi-part pipeline that consists of a log Mel spectrogram, an audio encoder, and a decoder. Here's a brief overview of how this transcription process works:

Log Mel Spectrogram: The raw audio is converted into a format more manageable for the neural network.
Audio Encoder: This extracts higher-level features such as phonetics and linguistic elements from the spectrogram data.
Audio Decoder: Finally, the decoder interprets the encoded data and generates a sequence of text based on the processed audio.

Implementation Steps

The Unity project is set up in a typical manner, with an XR rig called XR origin and a debug panel to log the transcription progress. The core functionality is driven by a mic recorder script that allows the built-in microphone of Meta Quest to record audio.

Using a modified version of the "Run Whisper" script from Hugging Face, the integration of the Whisper model involves establishing a state machine to manage the transcription process across frames. This approach helps maintain a smoother frame rate during transcription by splitting the processing load of the encoder across multiple frames.

The transcription process does experience some performance issues—specifically, noticeable frame rate drops and slower transcription times—but these challenges underline the app's potential as a proof of concept for future developments.

Conclusion

The project serves as an exciting exploration into offline speech recognition on the Meta Quest. While there are still some performance issues to iron out, it offers a promising glimpse into local AI integration in VR. If you're interested in trying it for yourself, I’ve made the complete Unity project available for free on my Patreon page.

Keywords

Offline speech recognition, Meta Quest, Unity, Sentis, Whisper AI, speech-to-text transcription, XR Interaction Toolkit, log Mel spectrogram, audio encoder, audio decoder.

FAQ

Q1: What is the Whisper tiny model?
A1: Whisper tiny is a smaller version of OpenAI's Whisper speech recognition model, optimized for local use and containing 39 million parameters.

Q2: How does speech recognition work in this app?
A2: Users can record their voice using the left controller, and by pressing the right controller, the app sends the recording to Whisper for transcription, which displays the text on the screen.

Q3: What technology is used to run AI models in Unity?
A3: Unity's Sentis technology is utilized to run AI models locally directly within Unity applications.

Q4: Are there any performance issues with the app?
A4: Yes, while the app demonstrates promising capabilities, users may experience frame rate drops and slower transcription speeds.

Q5: Where can I download the project?
A5: The full Unity project is available for free on the creator's Patreon page.