NaturalSpeech2: A Multilingual Text-to-Speech Synthesis System

Introduction

In this article, we will discuss the implementation details of NaturalSpeech2, a multilingual text-to-speech synthesis system. NaturalSpeech2 leverages latent diffusion models for zero-shot and short speech synthesis. It uses a neural audio codec with residual vector quantizers to convert speech waveforms into compact latent representations. These latent representations are then passed through a diffusion model to generate speech waveforms.

Overview of NaturalSpeech2

The NaturalSpeech2 system consists of several components, including a neural audio codec, a phoneme encoder, a duration predictor, a pitch predictor, and a latent diffusion model.

The neural audio codec converts the speech waveform into a latent representation using an encoder-decoder architecture. This latent representation is then reconstructed back into speech using the decoder.
The phoneme encoder, duration predictor, and pitch predictor process the input text and provide contextual information as a condition for the diffusion model.
The latent diffusion model predicts the latent representation conditioned on the text input and the contextual information. It generates a latent vector, which is then used by the codec decoder to produce the final synthesized speech waveform.

The paper also introduces several design choices in NaturalSpeech2, such as using continuous vectors instead of discrete tokens, leveraging diffusion models instead of autoregressive models, and incorporating speech prompting mechanisms for in-context learning and zero-shot synthesis.

Data Set and Model Configuration

In the training phase, NaturalSpeech2 uses the English subset of the multilingual LibriTTS dataset for the neural audio codec and diffusion model. This dataset contains 44,000 hours of transcribed speech data from Liberty Fox audiobooks with 2742 male speakers and 2748 female speakers. For evaluation, two benchmark datasets, Liberty Speech and VCTK, are employed, consisting of 40 distinct speakers and 5.4 hours of annotated speech data.

The model configuration includes a six-layer Transformer phoneme encoder, a 30-layer 1D convolutional duration and pitch predictor, a six-layer Transformer speech prompt encoder, and a 40-layer WaveNet architecture for the diffusion model.

Keyword

NaturalSpeech2
Text-to-speech synthesis
Latent diffusion models
Neural audio codec
Residual vector quantizers
Phoneme encoder
Duration predictor
Pitch predictor
Speech prompting mechanism
Multilingual text-to-speech

FAQ

What is the purpose of NaturalSpeech2? NaturalSpeech2 is a multilingual text-to-speech synthesis system that leverages latent diffusion models to generate high-quality and expressive speech. It aims to achieve zero-shot synthesis and improve the robustness and expressiveness of the synthesized speech across different speaker identities and styles.
How does NaturalSpeech2 generate speech waveforms? NaturalSpeech2 uses a neural audio codec to convert speech waveforms into compact latent representations. These latent representations are then passed through a diffusion model, which predicts the latent vectors conditioned on the input text and the contextual information. The codec decoder then reconstructs the latent vectors back into speech waveforms.
What is the advantage of using continuous vectors over discrete tokens? Using continuous vectors instead of discrete tokens allows for better speech reconstruction quality and reduces the length of the sequence, resulting in more precise waveform reconstruction. Continuous vectors also provide higher information density, allowing for more fine-grained speech synthesis.
How does the speech prompting mechanism work in NaturalSpeech2? The speech prompting mechanism is designed to facilitate in-context learning and enhance zero-shot synthesis in NaturalSpeech2. During training, a random segment of the latent representation is taken as the speech prompt. During inference, a reference speech from a specific speaker is used as the prompt. This allows the model to generate speech that follows the characteristics of the prompt, improving the zero-shot inference capabilities.
What datasets are used to train and evaluate NaturalSpeech2? NaturalSpeech2 is trained on the English subset of the multilingual LibriTTS dataset, which contains transcribed speech data derived from Liberty Fox audiobooks. For evaluation, benchmark datasets such as Liberty Speech and VCTK are employed to assess the performance of NaturalSpeech2.
What are the main components of NaturalSpeech2? The main components of NaturalSpeech2 include a neural audio codec, a phoneme encoder, a duration predictor, a pitch predictor, and a latent diffusion model. These components work together to convert text input into synthesized speech waveforms.
How does the diffusion model in NaturalSpeech2 differ from autoregressive models? The diffusion model in NaturalSpeech2 is non-autoregressive, which means it does not have the bottleneck issues and error propagation often associated with autoregressive models. This makes the diffusion model more stable and robust for speech synthesis.
What is the role of the latent diffusion model in NaturalSpeech2? The latent diffusion model predicts the latent vector representation conditioned on the input text and the contextual information. It generates the latent vectors that are then used by the codec decoder to produce the final synthesized speech waveform.
How is the duration and pitch information incorporated into NaturalSpeech2? The duration and pitch predictors in NaturalSpeech2 process the input text and provide informative hidden vectors as conditions for the diffusion model. The ground truth duration and pitch information from the speech waveform are used as targets for training these predictors.
What are the benefits of using NaturalSpeech2 for text-to-speech synthesis? NaturalSpeech2 offers high-fidelity and expressive speech synthesis with strong zero-shot synthesis capabilities. It outperforms existing speech synthesis models and provides robustness and expressiveness across different speaker identities and styles.