ad
ad

Azure AI Search - RAG with GPT-4o Realtime API for Audio with Azure OpenAI Service

Education


Introduction

In the world of voice-centric applications, the new Azure OpenAI Realtime API offers exciting opportunities to implement a retrieval-augmented generation (RAG) pattern. This article will walk you through a small example application demonstrating how to structure such an application using Azure AI technologies, focusing on audio processing alongside PDF document management.

Overview of the Application

This example application uses PDFs containing the employee handbook for a fictional company named Koso. The handbook includes various key details about employee benefits, healthcare plans, and company perks. Users can interact with this application through voice commands.

Sample Queries and Responses

Here are examples of the queries you might ask when interacting with the application, along with the expected responses:

  • Query: Does the company employee perks include scuba diving lessons?
    Response: Yes, the company employee perks include scuba diving lessons.

  • Query: What are my options for healthcare plans?
    Response: You have two options: North Wind Health Plus and Northwind Standard. The Plus plan offers comprehensive coverage.

  • Query: Not all the plans include emergency care, correct?
    Response: Correct. Only the North Wind Health Plus plan includes emergency care coverage, both in-network and out-of-network. The North Wind Standard plan does not cover emergency services.

Building Blocks of the Application

To build the application effectively, we need to consider several components:

  1. Storage and Update Mechanism:
    We store all content, including employee handbooks, in Azure Blob Storage. This allows for automatic updates as documents are added or modified.

  2. Search Index Creation:
    To enable effective retrieval of information during voice interactions, we create a search index over the stored data. Azure AI Search can create both vector and keyword searches, providing a high-quality ranking system to ensure the best results during conversations.

  3. Application Architecture:
    The application is divided into two key components:

    • Front-End: Implemented as a React application, the front end utilizes the Realtime API over WebSockets to facilitate audio streaming in both directions.
    • Back-End: This acts as a proxy for communication, processing audio in both directions while managing tool invocations necessary for searching the index. This ensures that client applications do not need direct access to the retrieval system or Azure OpenAI model configurations.

Example Code Structure

The front end of the application makes use of the Realtime API, while the back-end, dubbed the RT middle tier, includes additional tools that operate server-side, such as a Search tool utilizing Azure AI Search for grounding data and semantic reranking.

Conclusion

This architecture combines the features of Azure Blob Storage, Azure AI Search, and the Azure OpenAI Realtime API to create interactive, secure applications with real-time voice experiences. We look forward to seeing what innovative solutions developers will create using these technologies.


Keyword

FAQ

Q: What is the Azure OpenAI Realtime API?
A: The Azure OpenAI Realtime API allows real-time audio interaction using voice commands with responses generated through AI models.

Q: How does the retrieval-augmented generation (RAG) pattern enhance voice-centric applications?
A: The RAG pattern allows applications to pull in relevant information from storage systems while generating responses, making interactions more informative and relevant.

Q: What technologies are used to build the example application?
A: The application uses Azure Blob Storage for document management, Azure AI Search for indexing and search capabilities, and a React front end powered by the Azure OpenAI Realtime API for audio interaction.

Q: How is data managed and updated within the application?
A: Data is stored in Azure Blob Storage, and updates are automatically reflected in the application when documents are changed or added.

Q: What are the benefits of structuring the application with front-end and back-end components?
A: This separation allows for secure applications by managing API interactions and tool configurations server-side, while providing a user-friendly experience in the front end.