Is "Strawberry" really that good? (gpt-4o VS gpt-o1-preview)

Introduction

Welcome back to the channel! In this article, we explore OpenAI's new reasoning models, gpt-o1-preview and gpt-o1-mini, which were released just a few days ago. I've spent some time testing these models, and I'm excited to share my findings.

Overview of OpenAI's Models

OpenAI's flagship model, GPT-4, has been an incredible asset, known for its advanced reasoning capabilities, large context window, and exceptional outputs in areas such as coding, math, and storytelling. However, the new reasoning-focused models, gpt-o1-preview and gpt-o1-mini, are specifically crafted to enhance reasoning.

These models take a moment to "think" before responding, which significantly improves output quality, especially for coding and mathematical problems. During testing, I found that these models excel at logical reasoning and can produce superior results compared to their predecessors.

Testing gpt-o1-preview

One of the first examples I wanted to demonstrate involved a common-sense reasoning problem about a physical scenario:

Prompt: “A small strawberry is put in a normal cup, which is upside down on a table, then put inside the microwave. Where is the strawberry, and explain your reasoning behind it?”

The new model displayed a thoughtful approach, effectively evaluating the scenario and giving a comprehensive answer. This is a notable improvement in a context where previous models often struggled.

During the testing process, I observed that gpt-o1-preview outperformed GPT-4 in reasoning tasks, achieving an impressive rank in competitive programming questions and excelling in physics, biology, and chemistry problems. The gpt-o1-mini variant is designed to be faster and more cost-effective while still maintaining high-quality outputs.

Comparing Outputs: Business Plan Example

To illustrate the performance differences, I tested gpt-o1-preview against GPT-4 with the same business plan prompt:

Prompt: “Come up with a detailed business plan for a mining company, considering all steps in the supply chain and fulfillment.”

Both models provided comprehensive plans, but gpt-o1-preview took longer to process, mapping out the steps thoroughly. It offered more detailed insights, highlighting aspects like environmental responsibility and a structured implementation timeline.

Coding Example: Snake Game

Next, I looked at coding capabilities by requesting a simple snake game in JavaScript, along with HTML and CSS.

The outputs varied significantly. While GPT-4 provided a functioning and responsive game, gpt-o1-preview’s version lacked the same level of completeness. This disparity suggested that the gpt-o1-preview is still in its early stages and may require further refinement.

Conclusion

In conclusion, the gpt-o1-preview and gpt-o1-mini models show great promise in reasoning tasks, making them excellent tools for more complex problems. However, the inconsistencies noted in the coding tests indicate that there is still work to be done. As these models continue to evolve, they may yield even better results moving forward.

Keep an eye out for future tests as I explore the API capabilities of these models. I plan to compare them in various applications to gauge their performance further.

Keywords

OpenAI
gpt-4
gpt-o1-preview
gpt-o1-mini
reasoning models
business plan
coding
snake game
physics
environmental responsibility

FAQ

1. What are the new models introduced by OpenAI?
OpenAI introduced gpt-o1-preview and gpt-o1-mini, which are focused on enhancing reasoning capabilities.

2. How do these new models compare to GPT-4?
While GPT-4 is known for its strong outputs, the new models excel in reasoning tasks and produce higher quality outputs when given time to think.

3. What kind of tasks do gpt-o1-preview and gpt-o1-mini excel at?
These models particularly excel in coding, mathematical reasoning, and logical problem-solving tasks.

4. Are the new models more cost-effective than GPT-4?
Yes, gpt-o1-mini is approximately 80% cheaper through the API, while still providing quality outputs for applications requiring reasoning.

5. What are some examples of prompts tested with the new models?
Examples include a detailed business plan for a mining company and coding a snake game in JavaScript, HTML, and CSS.