“Evaluating Large Language Models (LLMs) - the recent practice & new approaches with LiveCodeBench”

Evaluating Large Language Models (LLMs) - The Recent Practices & New Approaches with LiveCodeBench

Introduction

It's slightly after 11, so why don't we kick this thing off? Hi everybody, my name is Greg Chase. I am the host of the AI Foundry.org podcast and a community organizer here at AI Foundry.ai. I am pleased to be joining my colleague Julia to discuss fascinating academic developments in AI, particularly focusing on how people bring AI into production. For this episode, we'll dive into the process of evaluating different large language models (LLMs) for various applications based on new and updated benchmarks.

Exciting Updates from Hugging Face

Julia began by sharing some exciting news: Hugging Face recently updated their large language models leaderboard with more challenging benchmarks, making the scores more revealing and less clustered. This update is particularly timely for our discussion.

The Significance of the Leaderboards

Julia highlighted how the previous leaderboard showed little difference between models, with many models almost touching human baseline performance. This longstanding plateau created an obvious need for more challenging benchmarks. Fortunately, Hugging Face has risen to the occasion.

Detailed Breakdown of the New Benchmarks

A significant development is the inclusion of these new benchmarks in the evaluation. New benchmarks include MMLU Pro and GSM 8K. For instance:

MMLU Pro: Multi-language understanding updated with more intricate and difficult questions.
GSM 8K: Focuses on mathematical understanding, which many models still struggle with.

We compared the performance of previous and new benchmark scores using plots, which showed that models had generally lower scores on the newer, more challenging benchmarks. For instance, the Llama 3 70b's performance dropped significantly from an 80% score on the older benchmark to below 50% on the new one.

High Variability Across Different Tasks

An interesting observation was the variance in performance across different tasks within the new leaderboard. Unlike the previous uniformity, some models excelled in specific tasks while lagging in others. For instance, the Quen model, which performed excellently on the new leaderboard, showed a stark disparity in its mathematics score compared to general understanding scores.

How Models Are Scored

Greg posed an interesting question about how models are scored. Julia explained that Hugging Face runs these tests themselves using automated systems to ensure fairness and consistency.

Noteworthy Models & Categories

We discussed various models, including Llama 3 and the Quen models, noting how their new scores aligned with community predictions. The Quen models, for instance, performed way better than expected in some domains, revealing the complexity and unique capabilities of each model.

Understanding Different Benchmarks

We delved deeper into specific benchmarks like MMLU Pro, which focuses on disciplines from law to computer science and philosophy. The updated leaderboards show a diverse set of challenges that make it difficult for models to score uniformly well across all areas.

The Process of Evaluation

Julia explained what goes into these comprehensive evaluations:

Prompts: Specific questions are posed to models.
Answers: Responses are matched against expected answers.
Consistency: Automated systems ensure that results are reproducible and fair.

Interesting Papers & Future Challenges

We also touched upon recent academic papers, such as the ones from the University of Waterloo and the University of Toronto, contributing to the development of more robust benchmarks. For example, the MMLU Pro is a recently revised benchmark attempting to test deeper logical connections and complex reasoning in LLMs.

Community Queries

Several questions were tackled, such as whether any companies are using LLM APIs to augment databases with qualitative data or if there are tournaments between models similar to chess tournaments. The final answer highlighted how models are compared in ongoing competitions but majorly through standardized benchmarks.

Conclusion

In summary, evaluating LLMs is an evolving field that requires ongoing updates and community collaboration. The new benchmarks from Hugging Face represent a significant step forward in distinguishing the strengths and weaknesses of each model.

Keyword

Hugging Face
Large Language Models (LLMs)
Leaderboard
Benchmarks
MMLU Pro
GSM 8K
Quen Models
Evaluation
Performance
Automated Scoring

FAQ

Why did Hugging Face update their leaderboard? Hugging Face updated the leaderboard to introduce more challenging benchmarks, making it easier to distinguish between different models' performance.
What are MMLU Pro and GSM 8K benchmarks? MMLU Pro is a multi-language understanding benchmark updated with more intricate questions, while GSM 8K focuses on mathematical understanding.
How are models scored on the Hugging Face leaderboard? Scoring is done by running automated inference tests on the models using specific question sets and then matching the model's answers to expected results.
What’s unique about the new benchmark scores? The new benchmarks show much greater variability in model performance across different tasks, unlike the older benchmarks where scores were closely clustered.
Are there any tournaments between LLMs? Currently, ongoing competitions like the LM Arena host continual evaluations, but standardized benchmarking remains the primary method for comparing models.