ChatGPT vs Llama - Which LLM writes better OpenAPI?

Introduction

In the rapidly evolving world of artificial intelligence, large language models (LLMs) have proven their abilities to generate a variety of content, including code. We've seen notable examples of this functionality in GitHub Copilot and ChatGPT, but how adept are these models at designing an API? In this article, we pit two powerful LLMs against each other, ChatGPT 4 and Llama 3.1, to evaluate their proficiency in generating OpenAPI documents for a fictional link-shortening application called "Mini Links."

The Task

The prompt provided to both models was straightforward: create an API using the OpenAPI specification for Mini Links. The application should support creating, updating, listing, and deleting links, while also allowing for user management. The requirements included adherence to security standards, use of JSON schema, and provision of examples. Outputs were to be formatted as JSON.

Round 1: ChatGPT 4

ChatGPT 4 was the first contender to respond to the prompt. Its response was generated quickly, and the resultant OpenAPI document was saved for evaluation. Using the Rate My Open API tool, the document was scored, yielding an overall score of 55.

Breakdown of ChatGPT's Score:

Documentation: 56
Completeness: 53
SDK Generation: 78
Security: 40

Despite not hitting a high score, the OpenAPI document was functional and contained no severe errors. Following the initial scoring, feedback was provided to ChatGPT based on the results. It was tasked with improving its output using the critiques received. After applying the suggested changes, ChatGPT's updated API document achieved an impressive score of 96. However, it introduced some new errors while significantly improving on key elements.

Round 2: Llama 3.1

The spotlight then turned to Llama 3.1. The same prompt was used, and the response came back remarkably fast. The generated JSON was saved and submitted to Rate My Open API for evaluation. Unfortunately, Llama received a modest overall score of 47.

Breakdown of Llama's Score:

Documentation: Low 50s
Completeness: Low
SDK Generation: 62
Security: 30

The scoring indicated that Llama's initial document suffered from more substantial deficiencies compared to ChatGPT's. Similar to ChatGPT, Llama received feedback and was asked to incorporate the improvements suggested. However, once it resubmitted its revised document, it received an overall score of only 49, which was merely an incremental increase from its original score.

Conclusion

In this head-to-head comparison, it is evident that ChatGPT 4 outperformed Llama 3.1, demonstrating a superior ability to understand and implement OpenAPI specifications. ChatGPT not only improved significantly upon receiving feedback, but also provided a more complete and secure API design from the outset.

While Llama shows promise, it will need further refinement to compete at this level. This competition serves as a compelling demonstration of the capabilities of LLMs in API design, and future contests will further explore these abilities across different models. Stick around for more evaluations of LLM performance in API design and other coding tasks, and consider trying out Rate My Open API to evaluate your own API documents.

Keywords

OpenAPI
ChatGPT
Llama
API design
Rate My Open API
JSON schema
AI models
Performance evaluation

FAQ

Q: What was the task for the LLMs?
A: The task was to design an OpenAPI document for a link-shortening application called Mini Links, which involved managing links and users.

Q: How did ChatGPT perform?
A: ChatGPT initially scored 55 but improved to 96 after applying feedback to its OpenAPI document.

Q: What was Llama's initial score?
A: Llama scored 47 on its first attempt and improved to only 49 after receiving feedback.

Q: Which model outperformed the other?
A: ChatGPT 4 outperformed Llama 3.1 in both initial scoring and improvement capability.

Q: Where can I evaluate my own OpenAPI documents?
A: You can evaluate your OpenAPI documents at Rate My Open API.