Training Any Language in AI Voice Cloning

Training Any Language in AI Voice Cloning - Tortoise TTS

Today, I'm going to cover how you might train other languages in Tortoise TTS. This is assuming that you are generally familiar with the training of Tortoise TTS, and if you're not, I have some previous videos that you should check out first, which go over the Tortoise TTS training or the AI voice cloning repository.

Successful Training of Japanese

First, I'm going to go over my successful training of Japanese. To demonstrate this, I have the AI voice cloning repository open and a sample audio of the actual voice that I trained on. I'll also play a sample of that voice reading out a new input prompt using the trained model. Here’s the short sentence it read:

Audio Sample: あなたの音声を聞かせてください
Generated Sample: あなたの音声を聞かせてください

Moreover, it works with longer sentences, showing coherence without much issue. This Japanese model was trained specifically on Subaru from "Re:Zero". I fine-tuned the model based on a successful Japanese model I trained initially.

Key Components for Training

To train a new language in Tortoise TTS, you'll need an extensive amount of data from the language you want to train on, as well as a tokenizer. For me, the tokenizer was the hardest part to understand. In Tortoise, there is a specialized tokenizer that is 255 characters wide, designed explicitly for the Tortoise architecture.

Understanding Tokenizers

Tokenizers break words into different pairings or tokens so that the model can relate these pairings to audio. If the language isn't compatible with the standard tokenizer (like with Japanese Kanji), you may either have to create a new tokenizer or convert texts into a compatible form. In my case, I converted all Japanese text into Latin characters, or "Romaji".

Preparing Data

I used 840 hours of Japanese audio data. Although I heard that you need around 10,000 hours, my model performed well with just 840 hours. The key is to represent the entire language with as much variety as possible. I used a custom dataset maker outside of Tortoise to expedite the preparation process.

Training Configuration

For the training configuration, various settings need to be tweaked:

Set the number of epochs based on your dataset size.
Crank the learning rate to 0.01 for faster convergence.
Set text learning rate ratio to 1.
Use cosine annealing with multiple learning rate restarts.
Adjust batch sizes and gradient accumulation according to GPU capabilities.
Save the training progress frequently to avoid data loss due to crashes.

Fine-Tuning

If you already have a well-trained language model, you can fine-tune it on a specific voice. Use the successful model as a base and set the learning rate lower to avoid overtraining. Update the source model to your successfully trained language model and proceed with the training as usual.

Troubleshooting

During my training, I often encountered crashes, which disrupted the long training periods. So, I highly recommend setting frequent save intervals, even though it consumes more storage space.

Conclusion

Training any language in Tortoise TTS involves patience, data preparation, and tweaking configurations. With proper understanding and careful adjustments, you can achieve satisfactory results in voice cloning for different languages.

Keyword

Tortoise TTS
AI Voice Cloning
Tokenizer
Training Configuration
Japanese Language Model
Fine-Tuning
Dataset Preparation
Cosine Annealing
Learning Rate
Gradient Accumulation

FAQ

1. What do you need to train a new language in Tortoise TTS?

You need a lot of data from the target language and a compatible tokenizer.

2. How do you handle languages with non-Latin scripts like Japanese?

Convert the non-Latin scripts into their Latin equivalents (e.g., Romaji).

3. What is a key issue when setting up a tokenizer?

Ensuring the tokenizer correctly represents your dataset for accurate training.

4. How much data is necessary for training?

Although more is better, 800-1000 hours of high-quality audio data can yield good results.

5. What learning rate settings did you find most effective?

A high learning rate (0.01) for initial training and a lower rate for fine-tuning specific voices.

6. How do you manage continuous training if the process is interrupted?

Use the resume state path in the configuration to continue training from the last saved checkpoint.

7. What should you monitor to avoid overtraining?

Watch the loss metrics; a loss that goes to zero too quickly indicates potential overtraining and model degradation.

8. What are the challenges faced during long training sessions?

Possible crashes and the need for frequent save intervals to prevent data loss.

9. How do you fine-tune a specific voice?

Use the trained language model as a base and adjust learning rates for precise tuning.