Why GPT Outperformed Gemini and Ollama in Calculation

Introduction

In the quest to create a reliable budgeting app, I encountered a significant challenge related to the performance of various language models (LLMs) in mathematical tasks. This realization prompted me to rethink my strategy regarding which model to utilize for managing calculations in my application.

To test the capabilities of the three LLMs—Gemini, Ollama, and GPT—I decided to present them with the same dataset. After creating a pertinent function and updating the instructions accordingly, I proceeded with the evaluation. The results revealed a clear hierarchy in performance.

Gemini exhibited a noticeable inability to adhere to the instructions, opting instead to perform calculations independently. Unfortunately, this resulted in inaccuracies in the final total. On the other hand, Ollama’s performance was equally disappointing; not only did it fail to return the relevant transactions, but it also sent a response that included letters, despite the schema explicitly requiring numerical values.

In stark contrast, GPT emerged as the definitive winner in this test. It not only called the function I provided but also generated the correct answer in a manner that aligned perfectly with the expectations set forth by the instructions. This experience has raised an important question: Is it worth investing further into these language models, given the discrepancies in their performance?

Keywords

GPT
Gemini
Ollama
Calculation
Budgeting app
Performance evaluation
Numerical values
Function call

FAQ

Q: What was the purpose of testing the three LLMs?
A: The purpose was to evaluate their capabilities for handling calculations in a budgeting app.

Q: What were the outcomes of the tests?
A: GPT successfully called the provided function and returned the correct answer, while Gemini and Ollama failed to meet the expected requirements.

Q: Why is Gemini considered less effective in this context?
A: Gemini opted to calculate totals itself and produced incorrect results, despite being instructed to use a predefined function.

Q: What issues did Ollama encounter?
A: Ollama not only failed to return relevant transactions but also incorrectly included letters instead of numerical values when required.

Q: What conclusion can be drawn from this evaluation?
A: GPT proved to be the most reliable model for calculations in this instance, raising questions about the effectiveness and worth of the other models when it comes to mathematical accuracy.

Why GPT Outperformed Gemini and Ollama in Calculation

Introduction

Keywords

FAQ

One more thing