Productionizing GenAI Models – Lessons from the world's best AI teams: Lukas Biewald

Introduction

As we explore the challenges and successes of productionizing AI, it's evident that many within this field are grappling with similar issues. A surprising majority (>70%) of you have already incorporated large language models (LLMs) into your organizational practices. However, the reality on the ground reveals that this transition isn't as seamless as anticipated.

The discussion often revolves around whether to use custom-built solutions or purchase ready-made ones. Interestingly, a notable portion of you are venturing into customization (30% custom solutions) while fewer rely on purchased solutions, like GitHub's Copilot.

The State of AI Production

The democratization of AI, unlike what was expected—likely driven by graphical interfaces or AutoML—is now occurring through chat and conversational interfaces. It’s clear that generative applications are not just a trendy novelty but are increasingly embedded in every sector. Despite this significant uptake, the challenges of productionizing AI remain formidable. The gap between demos and actual production applications is stark; often, AI products are easy to showcase but difficult to refine for use in real-world scenarios.

At Weights & Biases (W&B), we have worked with various AI professionals, including Foundation Model Builders and AI Engineers, to address these challenges. Our platform has successfully supported numerous AI applications across various industries. However, we've recognized a crucial shift: many software developers without prior AI experience are now contributing to building AI applications. This development is noteworthy because it expands the talent pool available for AI projects.

The Unique Challenges of AI Engineering

What differentiates AI engineering from software development is its experimental nature. Software progresses through a relatively linear approach where features are updated or added. In contrast, AI development is exploratory. When developing with LLMs, many trials and errors are involved, and the results can often be unpredictable. The intellectual property (IP) generated through this process is not merely the end models or prompts but the learnings gathered from these experiments.

Reproducibility is paramount. For effective iteration and collaboration, it's vital to track every experiment. When the essential learnings leave with a departing engineer, the IP does too. Thus, it becomes critical to have systems in place that document these experiments, allowing for collaborative growth and efficiency.

A Real-World Example

In illustrating these principles, I turned a personal project involving my daughter asking Alexa to play her favorite song into an example of AI production. I was surprised to find that even simple tasks like this could highlight the limitations of readily available technology. Designing my simple conversational interface required me to build a skill library—a set of functions able to interpret speech into actionable API calls.

Using models like Llama and Whisper, I initially struggled with accuracy but went through an iterative process of prompt engineering, error analysis, and fine-tuning. Through tools made accessible today, I managed to achieve significant improvement in the model's accuracy to create a functioning model. Each step—prompt engineering, model switching, fine-tuning—taught me valuable lessons that mirror those experienced in professional AI development.

Key Lessons Learned

Build an Evaluation Framework: One of the foundational aspects of transitioning to production is developing a robust evaluation matrix. Unfortunately, many current assessments rely on subjective impressions rather than data-driven results. Proper evaluation allows companies to make informed decisions on whether to keep or replace models effectively.
Start with a Prototype: Avoid the common pitfall of over-polishing your initial work. Rapid iteration through lightweight prototypes helps incorporate real user feedback more effectively.
Incorporate User Feedback: Ensuring that end-user feedback is integrated into the process leads to faster iterations and a better-user experience.
Iterate, Iterate, Iterate: The journey from demo to deployment requires continual improvement and refining of the AI applications.

Conclusion

At our booth at this conference, we’re excited to delve deeper into these themes and discuss how our tools can aid you in facing the complex challenges of getting AI applications into production.

Keywords

Productionizing AI
LLMs
Custom Solutions
Evaluation Framework
User Feedback
Iteration
Reproducibility
Prompt Engineering

FAQ

1. What is the difference between productionizing AI and software development?
AI development is primarily experimental and non-linear, while software development follows a more structured and linear approach.

2. Why is reproducibility important in AI projects?
Reproducibility ensures that learnings from various experiments are documented, allowing teams to iterate and collaborate effectively.

3. What steps can be taken to improve the accuracy of AI models?
Approaches such as prompt engineering, model switching, and fine-tuning, along with thorough evaluation and iterative testing, can substantially enhance the performance of AI models.

4. How can companies effectively gather user feedback?
Companies can incorporate user feedback through prototypes, direct surveys, and observational studies post-deployment, aligning product adjustments with end-user needs.

5. What tools can help in productionizing AI applications?
Weights & Biases offers platforms designed specifically for tracking experiments, building evaluation frameworks, and facilitating collaboration among AI engineering teams.