AI Show | Similarity and Scoring in Azure Cognitive Search

[Music]

Hello and welcome to this episode of the AI Show. We're going to talk about similarity and scoring in Azure Cognitive Search. I've got a special guest with me. Tell us who you are and what you do, my friend.

Hey Seth and everyone watching. My name is Ralph Marooch. I'm a software engineer on the Azure Cognitive Search team. I've been working in that team for a few years now, mostly focusing on the relevance part of things — essentially how we rank documents. Today, I'm going to be discussing how we do that.

Basic Overview: Azure Cognitive Search

Azure Cognitive Search is a "search as a service" product on Azure. The idea is simple: you provide us with your documents, either by telling us where they are or by pushing them via our API. From there, we offer rich full-text search capabilities.

The Two Main Processes

Most search engines operate through two main processes: indexing and querying.

Indexing: This is an asynchronous process in which we get your documents into our search index.
Querying: This is how we efficiently fetch documents relevant to a specific query, optimizing for speed since users are usually waiting for the results.

Text Processing and Inverted Index

Text processing is one of the most compute-intensive parts of indexing. We apply lexical analysis to extract tokens from raw text using techniques like stemming or lemmatization. This helps to increase recall in queries and includes handling stopwords and possessives.

Once we have the tokens, we create an inverted index. This data structure enables quick identification of documents matching the query terms without scanning the entire content.

Query Processing and Ranking

Upon receiving a search query, we apply a lightweight version of lexical analysis to extract query tokens. These tokens match against our inverted index, retrieving relevant documents.

After retrieving the documents, we need a way to rank them. This is where the similarity score comes in. The score is computed based on term frequency and document frequency. Most algorithms, such as the well-known TF-IDF, rely on these two variables. In production systems, more elaborate versions like BM25 are used, which also accounts for document length and normalizes term frequency.

Customizing Similarity and Scoring: Scoring Profiles

While the default settings in Azure Cognitive Search work well for general purposes, more fine-tuned control is often necessary. This is where scoring profiles come in. Developers can create customized scoring profiles to better tailor relevance based on domain knowledge.

Field Weights: In structured documents, different fields can be assigned different weights. For example, the "name" field might be five times as important as the "review" field.
Functions: Non-searchable fields like ratings, distances, or tags can be used to adjust the similarity score. Types of functions include magnitude, freshness, distance, and tags. These functions compute a boosting value that is multiplied by the rough similarity score to derive the final score.

Implementing Scoring Profiles

The scoring profiles can be set via the REST API or through the Azure portal where there is an intuitive interface for you to select and configure your weightings and functions.

Conclusion

That's a quick dive into how similarity and scoring work in Azure Cognitive Search, and how you can customize it for your needs. If you have feedback or would like to share how you're using relevance in your searches, please contact us at azuresearchrelevance@microsoft.com. We are keen to hear from you!

Thanks for tuning in and see you next time.

[Music]

Keywords

Azure Cognitive Search
Indexing
Querying
Lexical Analysis
Inverted Index
Term Frequency
Document Frequency
TF-IDF
BM25
Scoring Profiles
Field Weights
Magnitude Function
Freshness
Distance
Tags

FAQ

Q: What is Azure Cognitive Search? A: Azure Cognitive Search is a "search as a service" product on Azure offering rich full-text search capabilities.

Q: What are the two main processes in Azure Cognitive Search? A: The two main processes are indexing (getting documents into the search index) and querying (retrieving documents relevant to a specific query).

Q: How does text processing work in Azure Cognitive Search? A: Text processing involves lexical analysis to extract tokens, applying techniques like stemming and lemmatization, and removing stopwords and possessives. This data is then used to create an inverted index.

Q: What is an inverted index? A: An inverted index is a data structure that facilitates quick identification of documents matching specific query terms without scanning the entire content.

Q: How is the similarity score calculated? A: Similarity scores are calculated based on term frequency (how often a term appears in a document) and document frequency (how common a term is across documents). Algorithms like TF-IDF and BM25 are typically used.

Q: What are scoring profiles? A: Scoring profiles allow developers to fine-tune search relevance by assigning weights to different fields and applying functions to non-searchable fields like ratings and distances.

Q: How can scoring profiles be implemented? A: Scoring profiles can be implemented via the REST API or through the Azure portal, where an intuitive interface helps you configure weightings and functions.