Speech to Text & Text to Speech with Microsoft and OutSystems

Introduction

Hello, and welcome to the fourth episode of the AI Dev Series featuring Jean Salvado from Microsoft. In previous episodes, we've explored various Azure AI services, and today we're diving into Azure Speech Services, specifically Speech to Text and Text to Speech functionalities.

Azure Text to Speech Offerings

Azure offers a variety of text-to-speech services:

Pre-built Neural Voices: Available out of the box and supported in multiple regions. These come with accessible APIs, SDKs, and a web portal for easy interaction.
Custom Neural Voice: This service includes Professional, Lite, and Personal options. Each caters to different requirements, such as shorter or longer audio samples for custom training.
Multiple Styles and Languages: A single voice can support multiple styles (e.g., poetry, e-learning) and languages, providing flexibility in usage.

An example demonstrates this, where different voices read poetry in various languages.

In the Garden of Life Every Rose Has Its Dawn

Creating Custom Neural Voices

Personal Voice: Can be created with as little as six seconds of audio, albeit not perfect, it's strikingly similar to the original voice.
Professional vs. Personal: Professional requires more data and produces higher quality. It's used for brand voices and other commercial needs.

We experimented with custom neural voices and showcased the utility of different styles, such as angry, sad, and whispering voices, using pre-existing samples.

Use Cases for Text to Speech

Text to speech technologies find applications across multiple domains:

Voice Assistants and Gaming: Implementing voices for more realistic interactions.
Media and Entertainment: Stereo-optimized dubs for films, audiobooks, etc.
Accessibility: Helping people with disabilities access content better.
Healthcare and Education: Training aids and real-time consultation transcriptions.

Speech to Text Technology

The service provides real-time and batch speech-to-text services. A relatively new addition is Fast Transcription, capable of transcribing a 30-minute audio file in under a minute.

Custom Speech

For environments with significant noise or unique jargons, custom speech models can be created. The process involves creating a project, uploading audio for testing, training, and deploying the custom model.

Responsible AI

Microsoft places significant importance on AI ethics, ensuring all technologies follow strict responsible AI principles.

Integration with OutSystems

The OutSystems AI connector supports basic text-to-speech and speech-to-text functionalities with straightforward setup and usage:

Setup:
- Create a speech service in the Azure Portal.
- Retrieve the endpoint and keys for configuration.
Implementation:
- Utilize simple actions for converting speech to text and vice versa.
- Handle optional configurations for language, voice selection, etc.

Demo Scenario

A use case scenario demonstrates a text-based personal assistant application transforming into a multi-modal interactive assistant. The flow involves recording speech, converting it to text, sending the text to an AI service for a response, then converting the response back into speech.

Conclusion

This session highlights the potential and integration ease of Azure Speech Services with OutSystems, encouraging broader adoption of multi-modal AI applications.

Keywords

Azure Speech Services
Text to Speech
Speech to Text
Custom Neural Voices
Multi-modality
Real-time Transcription
Fast Transcription
Custom Speech
OutSystems Integration

FAQ

Q1: What are the offerings in Azure Text to Speech services?

Azure provides pre-built neural voices, custom neural voices in Professional, Lite, and Personal options, and support for multiple styles and languages.

Q2: How can I create a custom neural voice?

You can create custom voices by training with your audio samples. Depending on your needs, you can opt for Professional or Lite versions requiring different amounts of audio data.

Q3: What is Fast Transcription in Azure Speech to Text services?

Fast Transcription can transcribe up to a 30-minute audio file in under a minute, providing a quicker solution compared to real-time transcription.

Q4: How do I set up Azure Speech Services in OutSystems?

Create a speech service in the Azure Portal, retrieve endpoint details and keys, and configure them in OutSystems.

Q5: What are some use cases for Azure Speech Services?

The services can be used in voice assistants, gaming, media, healthcare, education, and accessibility applications.

Q6: What is the importance of responsible AI principles?

Responsible AI principles ensure that AI technologies are used ethically and responsibly, preventing misuse and protecting user data and rights.

This completes our detailed article transformation along with keywords and FAQs to help you understand and leverage Microsoft Azure Speech Services in your projects.