How I Programmed My AI Vtuber

[Music]

Introduction

In this video, I’m going to show you the process of creating an AI program that can game, talk, interact with chat, and even share political beliefs. She is so lightweight that she can even run on a tiny laptop. This article will guide you through the planning, setup, and execution phases, leading up to the moment we go live with our AI VTuber.

Planning Phase

To start, I created a simple flowchart to outline the AI's process. The AI will read a comment from the YouTube live stream chat, send it to a language model like GPT-3 to generate a response, and finally use a text-to-speech program to vocalize the response. This cycle will repeat continuously.

I had two special criteria:

The program should be lightweight enough to run on the crappiest hardware possible.
It should be a basic starting point so that others can build upon it.

Reading the Chat

The first step is to read the messages sent by viewers. I explored the official YouTube API for this but found it inconvenient and restrictive due to quotas. Instead, I found a GitHub project that allows easy reading of chat messages. All you need to do is input the stream ID, and it reads the chat messages.

Flowchart

Language Model Integration

Next, I needed to send the viewers' messages to a language model. I chose GPT-3 for its advanced capabilities and versatility. Although GPT-3 is not hosted locally, making it lightweight for the user, it is regulated, meaning it won't go into radical tangents and get banned. The integration involved copying and modifying existing code to fit our needs.

## Introduction
import openai

openai.api_key = 'your-openai-api-key'

def get_gpt3_response(prompt):
    response = openai.Completion.create(
        model="text-davinci-002",
        prompt=prompt,
        max_tokens=50
    )
    return response.choices[0].text.strip()

Text-to-Speech Integration

For text-to-speech (TTS), I chose 11 Labs as they offer a private version of Tortoise TTS. Despite some concerns about pricing and tiers, they provide easy-to-implement code with API keys.

## Introduction
import requests

def text_to_speech_11labs(api_key, text):
    response = requests.post(
        'https://api.11labs.com/v1/tts',
        headers=('Authorization': f'Bearer {api_key)'},
        json=('text': text)
    )
    return response.content

VTuber Model and Streaming Setup

For the VTuber model, I used a fan-created model of Toho Komeiji Koishi and loaded it into VTube Studio. Using a virtual camera mapped to another program that plays Touhou in the background, the AI VTuber's head and mouth movements were synchronized with the game's actions. This added a visual element to the stream and made it more engaging.

Going Live

When everything was set up, it was time to go live. The stream involved:

Reading live chat messages
Sending them to GPT-3 for responses
Converting the responses into speech with 11 Labs API
Streaming Touhou gameplay in the background

Here are some of the things that the AI said during the stream:

"No, Ukraine is a troubled country that has been plagued by corruption and civil unrest for years."
"Taiwan has existed for thousands of years and is an independent state with its own government, economy, culture, and military."

Open Sourcing the Program

I am open-sourcing the program, which means anyone can build upon it, optimize it, or add their own features. This program is just a starting point, and there’s much room for improvement, such as using a fine-tuned language model and running Tortoise TTS locally.

GitHub Repository

Conclusion

I hope this inspires you to create your own AI VTubers. Let's see more innovation in this exciting space.

Summary

This article documents the creation of a lightweight AI VTuber capable of interacting with a live chat, using GPT-3 for language understanding and 11 Labs for text-to-speech synthesis. It also covers the setup of a VTuber model and its integration with Touhou gameplay.

Keywords

AI VTuber
GPT-3
Text-to-Speech
YouTube Live Chat API
OpenAI
11 Labs
VTube Studio
Touhou

FAQ

Q: What was the main objective of creating this AI VTuber? A: The main objective was to create a lightweight VTuber program that can interact with live chat and run on low-spec hardware.

Q: Why was GPT-3 chosen for this project? A: GPT-3 was chosen for its advanced language capabilities and versatility, allowing it to generate coherent and contextually appropriate responses.

Q: Why was 11 Labs chosen for text-to-speech? A: 11 Labs provides a high-quality text-to-speech API that is relatively easy to integrate with Python.

Q: What are the limitations of using official YouTube API for reading chat? A: The official YouTube API imposes quotas and has a cumbersome setup process, which can disrupt the stream if the quota is exceeded.

Q: How was the VTuber model created? A: The VTuber model was a fan-created model of Toho Komeiji Koishi, loaded into VTube Studio for integration.

Q: Is the program open-source? A: Yes, the program is open-sourced, allowing anyone to build upon it and add custom features.