Vector Databases simply explained! (Embeddings & Indexes)

Introduction

Vector databases have recently gained a lot of attention, with companies raising hundreds of millions of dollars to develop them. People are calling it a new kind of database for the AI era. However, for many projects, using a vector database might be overkill, and a traditional database or even just a NumPy ndarray might work just fine. Despite this, vector databases are extremely fascinating and enable many great applications, especially when it comes to giving large language models like GPT-4 long-term memory. In this article, we'll provide a beginner-friendly explanation of what vector databases are and how they work, examine some use cases, and briefly introduce some options you can use. Let's get started!

Why Vector Databases?

Over 80% of the data out there is unstructured, such as social media posts, images, videos, or audio data. This kind of data cannot be easily fitted into a relational database. Let's take an image as an example. If you want to put this into a relational database in order to search for similar images, what usually happens is that we manually assign keywords or tags to it. This is because from the pixel values alone, it's challenging to search for similar images. The same holds true for unstructured text blobs or audio and video data. So, either we have to assign tags or attributes often manually, or we can use a different representation to store the data. This brings us to vector embeddings and vector databases.

Vector Embeddings and Vector Databases

In simple terms, a vector database indexes and stores vector embeddings for fast retrieval and similarity search. Let's break down these two key components:

Vector Embeddings

First, vector databases use clever algorithms to calculate the so-called vector embeddings. This is done by machine learning models. A vector embedding is just a list of numbers that represent the data in a different way. For example, you can calculate an embedding for a single word, a whole sentence, or an image. Now, we have numerical data that the computer can understand. One easy possibility with vectors is to find similar vectors by calculating distances and performing a nearest neighbor search. For simplicity, I'll display a 2D case here, but in reality, these vectors can have hundreds of dimensions.

Indexing

Just storing the data as embeddings is not enough. Performing a query across thousands of vectors based on its distance metric would be extremely slow. This is why those vectors also need to be indexed. Indexing is the second key element of a vector database. An index is a data structure that facilitates the search process. The indexing step maps the vectors to a new data structure that will enable faster searching. There are different ways to calculate indexes, and the whole process is a research field on its own. Just know that indexes are needed for efficient search.

Use Cases

Long-term Memory for Large Language Models: Vector databases can be used to equip large language models with long-term memory. This is something you can easily implement with LangChain.
Semantic Search: In scenarios where we need to search not for exact string matches but rather based on the meaning or context of the question, vector databases can be highly useful.
Similarity Search for Multimedia Data: Vector databases can be employed for image, audio, or video similarity searches, where you can find similar items without using keywords to describe them.
Ranking and Recommendation Engines: For online retailers, vector databases can help suggest items similar to past purchases of a customer by simply identifying the nearest neighbors of an item in the database.

Options for Vector Databases

There are numerous vector databases available:

Pinecone
Weaviate
Chroma
Redis
Milvus
Vespa AI

These options come with different features and capabilities. If you are interested in a detailed comparison, feel free to let us know!

Keywords

Vector databases
Vector embeddings
Indexing
Semantic search
Similarity search
Long-term memory
Machine learning
Distance metric

FAQ

Q: What are vector embeddings? A: Vector embeddings are numerical representations of data generated by machine learning models. They allow computers to understand and manipulate unstructured data like text, images, and audio.

Q: Why can't traditional databases handle unstructured data effectively? A: Traditional databases are designed for structured data. Unstructured data like text, images, and videos do not fit neatly into relational tables, making it difficult to perform searches and queries.

Q: How does indexing help in vector databases? A: Indexing maps vectors to a new data structure, enabling faster and more efficient searches. It is essential for handling large datasets and performing real-time queries.

Q: Can you give some examples of use cases for vector databases? A: Yes, vector databases can be used for equipping large language models with long-term memory, performing semantic search, similarity search for multimedia data, and for creating ranking and recommendation engines.

Q: What are some popular vector databases available today? A: Some popular vector databases include Pinecone, Weaviate, Chroma, Redis, Milvus, and Vespa AI.

By understanding vector databases and their potential applications, you can better decide whether they are suitable for your projects. If you want to see more explainer articles and AI tutorials, make sure to subscribe to our channel. Thank you for reading!