Get better sounding AI voice output from Elevenlabs.

Introduction

Transforming text to speech that sounds almost lifelike isn't just a dream anymore thanks to Elevenlabs. This detailed guide will walk you through the various settings, sliders, voice selections, prompting techniques, and more to help you master Elevenlabs' text-to-speech capabilities.

Voice Selection

Selecting the right voice is like picking the right human actor. If you need a fast-talking, punchy voice, opting for someone like Morgan Freeman wouldn't make much sense. Similarly, when browsing the Elevenlabs voice library or creating one in the Voice Lab, ensure the sample clip matches the style of your project.

Choosing the Right Model

Elevenlabs Multilingual V2

Languages: 29
Features: Very stable, accurate, handles accents well, and offers language diversity.

Elevenlabs Multilingual V1

Languages: 9
Notes: Experimental model, less accurate; avoid unless necessary.

Elevenlabs English V1

Languages: English only
Notes: Fastest but least accurate; also features a smaller training data set.

Elevenlabs Turbo V2

Languages: English only
Features: Fast generations, but lacks a style slider and may not be as accurate as Multilingual V2.

For most projects, Multilingual V2 is your best bet. It's stable, natural, and accurate.

Setting Sliders

Stability Slider

Lower: More emotional range but can lead to odd performances and overly fast speech.
Higher: More stable voice but can become monotonous.
Starting Point: Default setting or between 40-50.

Similarity Slider

Lower: Less like the original voice.
Higher: More like the original voice but can include artifacts.
Starting Point: 75-80 is a good setting.

Style Exaggeration

Zero: Style exaggeration off.
Higher: Emphasizes the style of the original voice but can decrease stability.

Speaker Boost

Checkbox: Increases similarity to the original recording but slows down generation.

Settings are non-deterministic, meaning each time you generate, you will get slightly different results. The sweet spot for many is 40-50 for stability and 75-80 for similarity.

Prompting

Adding Pauses

Programmatic Syntax: <break time="1.5s"/> adds a 1.5-second pause.
Dashes: Use M-dashes or multiple dashes.
Ellipses: Three dots for hesitation e.g., "I... guess so."

Pronunciation

Programmatic Syntax: Use SSML with IPA or CMU ARPAbet (complex).
Phonetic Spelling: Fun and flexible. E.g., "samurai" as "samoorai," "samurai," etc.

Emotion

Contextual Cues: Write the text like a book, including cues such as "he said angrily."
Punctuation: Commas, periods, exclamation marks, and question marks help guide intonation.
Caps Lock: Emphasizing words or sentences with all caps often works.

Pacing

Avoid Multi-Clipping: Submit one sample file with natural pauses.
**Editing Software:** Use tools like Descript for creating one clean file.
Write Descriptively: Add textual cues for the desired pacing e.g., "he said slowly."

Combining these tips with the sliders can help you get the optimal voice. Lowering the similarity slider when using prompts can make the AI more flexible.

Additional Tips and Tricks

Keep generating until you get the take you like. Consider it as working with a human actor. If the first take doesn't work, try again and again until it's perfect.

Keywords

Text-to-speech
Elevenlabs
Voice selection
Multilingual V2
Stability slider
Similarity slider
Style exaggeration
Speaker boost
Pauses
Pronunciation
Emotion
Pacing

FAQ

Q: Which Elevenlabs model should I use for the best overall performance?

A: Multilingual V2 is generally the best option for its stability, accuracy, and wide language support.

Q: How can I ensure that the generated speech has the right emotional tone?

A: You can write your script with emotional cues and adjust the stability slider for more or less emotional range. Adding punctuation and using descriptive text can also help guide the AI.

Q: How can I add pauses in the generated speech?

A: Use programmatic syntax like <break time="1.5s"/>, or try adding dashes, M-dashes, or ellipses for brief pauses.

Q: What should I do if the AI pronounces a word incorrectly?

A: You can use phonetic spelling to adjust pronunciation or employ SSML tags with IPA or CMU ARPAbet for precise control.

Q: Why does my cloned voice sound too fast?

A: This could be due to submitting multiple sample clips without pauses. Try merging your samples into a single file with natural gaps.

Q: Are the Elevenlabs settings deterministic?

A: No, each generation will be slightly different. Use higher stability settings and keep generating until you get the desired result.

Q: How can I reduce unwanted background noise in my cloned voice?

A: Ensure your original recordings are as clean as possible, free of background noise, sibilance, or electronic interference.

Q: Can I use Elevenlabs for free?

A: Yes, there is a free tier available to test out these features and tips.