SIGIR 2024 W1.3 [fp] Large Language Models can Accurately Predict Searcher Preferences

Introduction

In the rapidly evolving landscape of information retrieval and generative AI, the labeling of document relevance has emerged as a crucial task. This article discusses our work on utilizing large language models (LLMs) for the effective labeling of document relevance, highlighting our methods, findings, and implications for the future of search technologies.

The Importance of Labeling in Information Retrieval

Labeling documents as relevant or irrelevant is essential across various tasks within information retrieval systems. Such labels help in assessing search performance and improving user experiences. Ideally, labels should come from real users who can clearly indicate which documents met their needs for a given query. However, obtaining labels from users can be costly and impractical.

The Quality-Cost Continuum of Labels

Labels can be understood on a continuum of quality against cost. At one extreme lie the highest quality labels derived from actual users, while at the other extreme are low-quality, low-cost labels produced by crowdsourcing efforts or non-expert annotators. We found that LLMs can efficiently occupy a sweet spot on this continuum, providing high-quality labels at lower costs when prompted correctly.

Our Methodology

We developed a methodology whereby a small corpus of high-quality human-generated labels was used to train and evaluate LLMs for generating a substantial corpus of machine-generated labels. Moreover, a cycle was established for validating and improving the machine-generated labels by sampling them and having expert annotators review them. Discrepancies between machine-generated and human-generated labels were treated systematically, allowing for an ongoing refinement of our labeling approach.

The Labeling Process

Building the Corpus: Using human-generated labels from real user queries, we designed prompts to instruct LLMs to label additional documents.
Validation: A random sample of machine-generated labels was reviewed by humans. Errors were flagged and sent up a hierarchy of annotators for further validation.
Iteration: This process was iterative, allowing the prompts to be refined based on the quality of the outputs, incorporating feedback from both annotators and the human review process.

Findings and Results

We observed that LLMs, particularly GPT-4, outperformed traditional crowdsourced labels and expert annotations on several types of relevance tasks. Additionally, LLMs were able to produce these high-quality labels much faster and at a lower cost.

Prompt Sensitivity

Our exploration into prompt design revealed that subtle variations in wording can lead to significant differences in annotation quality. Notably, a well-structured prompt could outperform others considerably, and certain prompt components had non-linear interactions that necessitated careful design and evaluation.

Implications and Future Work

The implications of our findings suggest that LLMs can be integral tools for generating high-quality, large-scale labels in search and information retrieval systems. As labeling technologies improve, they will enable new possibilities for assessing and enhancing user search experiences, such as measuring the conditional relevance of documents and adapting to user behavior dynamically.

Conclusion

In conclusion, our research confirms that LLMs, given appropriate prompts, can deliver superior label quality, speed, and cost-effectiveness compared to traditional sources of document relevance labels. This advancement paves the way for innovative search technologies and enhances our understanding of user information retrieval preferences.

Keywords

Large Language Models (LLMs)
Document Relevance
Information Retrieval
Labeling
User Queries
Annotators
Cost-Effectiveness

FAQ

Q1: What are large language models?
A1: Large language models are advanced AI systems capable of understanding and generating human-like text, enabling them to perform tasks such as labeling and content generation.

Q2: How does labeling affect information retrieval?
A2: Labeling indicates which documents are relevant or irrelevant for users, significantly impacting the performance and effectiveness of search systems.

Q3: What was the primary finding of the research?
A3: The research found that machine-generated labels from LLMs, specifically GPT-4, offered higher quality than those generated by the average human annotator or crowdsourced efforts, while being faster and cheaper.

Q4: How were the labels validated in the study?
A4: A random sample of machine-generated labels was reviewed by human annotators, and any discrepancies were elevated for further review by experts to ensure quality.

Q5: What is the significance of prompt designs in using LLMs?
A5: The design of prompts significantly influences the quality of outputs from LLMs. Small changes in wording can lead to substantial differences in the generated labels' quality.