Search-Based RAG with DuckDB and GLiNER

Introduction

In this article, we will delve into the concept of Search-Based Retrieval Augmented Generation (RAG). This approach is somewhat similar to traditional RAG, but instead of using vector searches, we perform a full-text search on document chunks. This technique was inspired by a blog post from Simon Willison, which will be linked in the description below. Let's get started by implementing this approach using IPython.

Step 1: Reading the Transcript

We start by launching IPython and using a transcript from an episode of the "This Day in AI" podcast as our text source. We'll read in the file containing the transcript and use the Rich console to inspect its content. The transcript contains about 60 minutes of AI-related discussions.

Step 2: Splitting the Text

Next, we need to chop the transcript into smaller chunks of around 300 characters each. We'll use the recursive_character_text_splitter class from LangChain's text splitters. Upon passing our text into the create_documents function, it gets divided into 208 chunks, with each chunk representing a document.

Step 3: Storing the Data

We'll store these document chunks in DuckDB, a fast-storage database. We'll connect to DuckDB, create a table named podcast_transcript, and insert our chunks into this table. Each entry will contain a unique ID, an episode number, a paragraph index, and the text content.

Step 4: Full-Text Search Index

To facilitate easy querying, we'll set up a full-text search index using DuckDB's full-text search extension. We'll index the text column of our podcast_transcript table. With BM25 as our search algorithm, we can query the database to retrieve relevant chunks based on search terms like "Claude Sonnet."

Step 5: Building a Query Function

We then create a function query_store that takes in a search query and a limit, and returns the most relevant text chunks. For example, querying using "Apple AI" will return chunks that talk about Apple's AI-related developments.

Step 6: Extracting Key Terms

To make our searches more efficient, we extract key terms or phrases from user queries. Simon's blog uses standard prompts with Claude Sonnet for this, but we will use the GLiNER library. GLiNER allows us to extract multiple tasks, including entity extraction. We create an extract_search_terms function that takes a prompt and returns key terms.

Step 7: Querying the Store with Extracted Terms

We modify the query function to accept an array of search terms extracted by GLiNER. In doing so, our searches become more targeted. For instance, querying about "Apple AI and OpenAI" will return chunks discussing both these entities.

Step 8: Generating Answers

Finally, we generate answers using a Llama Server running the Mistral 7B v3 model. We set up the server and use the OpenAI library to initiate calls to this model. Our generate_answer function will take a search query and its results, format them into a structured prompt, and use the LLM to generate coherent answers based on the searched context.

Example Queries and Output

What is focus mode and what's its importance? The LLM generates a succinct answer about 'focus mode' as a feature in AI.
Do the speakers prefer OpenAI or Claude? The LLM summarizes the discussion, indicating a preference for Claude over OpenAI.

In summary, this approach enables more effective full-text retrieval and generation using DuckDB, GLiNER, and a language model.

Keywords

Search-Based RAG
Full-Text Search
DuckDB
LangChain
GLiNER
Transcript Analysis
Text Chunks
Llama Server
Retrieval Augmented Generation

FAQs

Q: What is Search-Based RAG?

A: Search-Based Retrieval Augmented Generation (RAG) leverages full-text search on document chunks to retrieve relevant information before generating responses, unlike traditional RAG which uses vector searches.

Q: Why use DuckDB in this approach?

A: DuckDB provides fast and efficient storage along with support for full-text search, making it suitable for handling and querying large text datasets.

Q: How do we extract key terms from user queries?

A: We use the GLiNER library to extract important terms or phrases from the queries, which helps in generating more accurate search results.

Q: What role does the Llama Server play?

A: The Llama Server, running a model like Mistral 7B, generates coherent answers based on the context retrieved from the full-text search results.

Q: Can this method be used in chatbots?

A: Yes, this method is particularly useful for chatbots, enabling them to understand and respond based on detailed text analysis and retrieval.