Get better sounding AI voice output from Elevenlabs.
Education
Introduction
Transforming text to speech that sounds almost lifelike isn't just a dream anymore thanks to Elevenlabs. This detailed guide will walk you through the various settings, sliders, voice selections, prompting techniques, and more to help you master Elevenlabs' text-to-speech capabilities.
Voice Selection
Selecting the right voice is like picking the right human actor. If you need a fast-talking, punchy voice, opting for someone like Morgan Freeman wouldn't make much sense. Similarly, when browsing the Elevenlabs voice library or creating one in the Voice Lab, ensure the sample clip matches the style of your project.
Choosing the Right Model
Elevenlabs Multilingual V2
- Languages: 29
- Features: Very stable, accurate, handles accents well, and offers language diversity.
Elevenlabs Multilingual V1
- Languages: 9
- Notes: Experimental model, less accurate; avoid unless necessary.
Elevenlabs English V1
- Languages: English only
- Notes: Fastest but least accurate; also features a smaller training data set.
Elevenlabs Turbo V2
- Languages: English only
- Features: Fast generations, but lacks a style slider and may not be as accurate as Multilingual V2.
For most projects, Multilingual V2 is your best bet. It's stable, natural, and accurate.
Setting Sliders
Stability Slider
- Lower: More emotional range but can lead to odd performances and overly fast speech.
- Higher: More stable voice but can become monotonous.
- Starting Point: Default setting or between 40-50.
Similarity Slider
- Lower: Less like the original voice.
- Higher: More like the original voice but can include artifacts.
- Starting Point: 75-80 is a good setting.
Style Exaggeration
- Zero: Style exaggeration off.
- Higher: Emphasizes the style of the original voice but can decrease stability.
Speaker Boost
- Checkbox: Increases similarity to the original recording but slows down generation.
Settings are non-deterministic, meaning each time you generate, you will get slightly different results. The sweet spot for many is 40-50 for stability and 75-80 for similarity.
Prompting
Adding Pauses
- Programmatic Syntax:
<break time="1.5s"/>
adds a 1.5-second pause. - Dashes: Use M-dashes or multiple dashes.
- Ellipses: Three dots for hesitation e.g., "I... guess so."
Pronunciation
- Programmatic Syntax: Use SSML with IPA or CMU ARPAbet (complex).
- Phonetic Spelling: Fun and flexible. E.g., "samurai" as "samoorai," "samurai," etc.
Emotion
- Contextual Cues: Write the text like a book, including cues such as "he said angrily."
- Punctuation: Commas, periods, exclamation marks, and question marks help guide intonation.
- Caps Lock: Emphasizing words or sentences with all caps often works.
Pacing
- Avoid Multi-Clipping: Submit one sample file with natural pauses.
- **Editing Software:** Use tools like Descript for creating one clean file.
- Write Descriptively: Add textual cues for the desired pacing e.g., "he said slowly."
Combining these tips with the sliders can help you get the optimal voice. Lowering the similarity slider when using prompts can make the AI more flexible.
Additional Tips and Tricks
Keep generating until you get the take you like. Consider it as working with a human actor. If the first take doesn't work, try again and again until it's perfect.
Keywords
- Text-to-speech
- Elevenlabs
- Voice selection
- Multilingual V2
- Stability slider
- Similarity slider
- Style exaggeration
- Speaker boost
- Pauses
- Pronunciation
- Emotion
- Pacing
FAQ
Q: Which Elevenlabs model should I use for the best overall performance?
- A: Multilingual V2 is generally the best option for its stability, accuracy, and wide language support.
Q: How can I ensure that the generated speech has the right emotional tone?
- A: You can write your script with emotional cues and adjust the stability slider for more or less emotional range. Adding punctuation and using descriptive text can also help guide the AI.
Q: How can I add pauses in the generated speech?
- A: Use programmatic syntax like
<break time="1.5s"/>
, or try adding dashes, M-dashes, or ellipses for brief pauses.
Q: What should I do if the AI pronounces a word incorrectly?
- A: You can use phonetic spelling to adjust pronunciation or employ SSML tags with IPA or CMU ARPAbet for precise control.
Q: Why does my cloned voice sound too fast?
- A: This could be due to submitting multiple sample clips without pauses. Try merging your samples into a single file with natural gaps.
Q: Are the Elevenlabs settings deterministic?
- A: No, each generation will be slightly different. Use higher stability settings and keep generating until you get the desired result.
Q: How can I reduce unwanted background noise in my cloned voice?
- A: Ensure your original recordings are as clean as possible, free of background noise, sibilance, or electronic interference.
Q: Can I use Elevenlabs for free?
- A: Yes, there is a free tier available to test out these features and tips.