Choose an Azure AI speech recognition and generation technology

04/28/2025

Azure AI services help workload designers and developers create intelligent, cutting-edge, market-ready, and responsible applications with out-of-the-box and prebuilt and customizable APIs and models.

This article covers AI services that provide speech recognition and generation capabilities such as speech-to-text and text-to-speech conversions, audio translation, and speaker recognition. It also includes reading support for people with learning differences.

Note

To gather insights on terms or phrases or get detailed contextual analysis of spoken or written language, see Choose an Azure AI targeted language processing technology.

Services

The following AI services can provide speech recognition and generation capabilities for your workload.

Microsoft Azure AI Speech provides natural language processing for text analysis.
- Use Speech when you need to transcribe or translate spoken speech and identify speakers in a conversation. You can also use Speech as a lower-cost alternative for natural-sounding speech generation compared to the higher-quality Whisper system in the OpenAI models.
- Don't use Speech for chat, content summarization, moderation, or guiding users through scripts. Use other models for those things instead.
Immersive Reader is a tool that implements proven techniques to improve reading comprehension for emerging readers, language learners, and people with learning differences.
- Use Immersive Reader to provide an improved readability experience tailored for language learners or people with learning differences.
- Don't use Immersive Reader for traditional text-to-speech use cases.

Speech

Speech provides speech-to-text and text-to-speech capabilities with a Speech resource. You can transcribe speech-to-text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and use speaker recognition during conversations. Create custom voices, add specific words to your base vocabulary, or build your own models. Run Speech anywhere, whether in the cloud or at the edge in containers.

Speech is available for multiple languages and regions.

Capabilities

The following table provides a list of capabilities available in Speech.

Capability	Description
Batch transcription	Transcribes a large amount of audio data in storage. Both the speech-to-text REST API and Speech CLI support batch transcription.
Intent recognition	An intent is something that the user wants to do, such as book a flight, check the weather, or make a call. Intent recognition enables your applications, tools, and devices to determine what the user wants to initiate or do based on options. You define user intent in the intent recognizer or conversational language understanding model.
Pronunciation assessment	Evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio.
Speaker recognition	Speaker recognition can help determine who is speaking in an audio clip. The service verifies and identifies speakers through their unique voice characteristics by using voice biometry.
Speech-to-text	Converts audio streams to text in real time or in batch processing.
Text-to-speech	Enables your applications, tools, or devices to convert text into humanlike synthesized speech.
Speech translation	Provides multiple-language speech-to-speech and speech-to-text translation of audio streams.
Video translation	Translates and generates videos in multiple languages automatically.

Use cases

The following table describes some of the ways that you can use Speech.

Use case	Capability to use	Description
Audio content creation	Speech-to-text	Make interactions with chatbots and voice assistants more natural and engaging by using neural voices. Convert digital texts such as e-books into audiobooks and enhance in-car navigation systems.
Call center transcription	Speech-to-text	Transcribe calls in real time or process a batch of calls, redact personally identifying information, and extract insights such as sentiment to help with your call center use case.
Captioning	Speech-to-text	Synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
Language learning	Speech-to-text	Provide pronunciation assessment feedback to language learners, support real-time transcription for remote learning conversations, and read aloud teaching materials with neural voices.
Voice assistants	Text-to-speech	Create natural, humanlike conversational interfaces for applications and experiences. The voice assistant feature provides fast and reliable interaction between a device and an assistant implementation.

Immersive Reader

Immersive Reader, part of AI services, is an inclusively designed tool that implements proven techniques to improve reading comprehension for new readers, language learners, and people with learning differences such as dyslexia. With the Immersive Reader client library, you can use the same technology used in Microsoft Word and Microsoft OneNote to provide an enhanced experience for your workload's users.

Capabilities

The following capabilities are available for your workload to help users achieve their reading comprehension goals.

Isolate content to improve readability.
Display pictures for common words and terms.
Help understand parts of speech and grammar by highlighting verbs, nouns, and pronouns.
Read content aloud, such as user-selected text in your workload's UI.
Translate content into many languages in real time. This method helps improve comprehension for readers learning a new language.
Break words into syllables to improve readability or to sound out new words.

Share via

Services

Speech

Capabilities

Use cases

Immersive Reader

Capabilities

Next steps

Share via

Choose an Azure AI speech recognition and generation technology

Services

Speech

Capabilities

Use cases

Immersive Reader

Capabilities

Next steps

Related resources

Feedback

Additional resources