Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Text to speech avatar converts text into a digital video of a photorealistic human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice. You can synthesize the text to speech avatar video asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use Text to speech avatar in Foundry to create video content without coding.
By using text to speech avatar's advanced models, you can deliver life-like and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.
Tip
To convert text to speech with a no-code approach, try the Microsoft Foundry Text to Speech avatar.
Avatar capabilities
Text to speech avatar capabilities include:
- Converts text into a digital video of a photorealistic human speaking with natural-sounding voices powered by Azure AI text to speech.
- Provides a collection of standard avatars. See Standard avatars for a full list of supported standard avatars.
- Azure AI text to speech generates the voice of the avatar. For more information, see Avatar voice and language.
- Synthesizes text to speech avatar video asynchronously with the batch synthesis API or in real-time.
- Use Text to speech avatar tool in Microsoft Foundry for creating video content without coding.
- Enables real-time avatar conversations through the Voice Live in Foundry.
- Create voice agent with avatar in Voice Live.
By using text to speech avatar's advanced neural network models and Photo avatar's VASA-1 models, you can deliver lifelike and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.
Avatar voice and language
You can choose from a range of standard voices for the avatar. The language support for text to speech avatar is the same as the language support for text to speech. For details, see Language and voice support for the Speech service. You can access standard text to speech avatars through the Microsoft Foundry Text to Speech avatar or via API.
The voice in the synthetic video can be an Azure Speech in Foundry Tools standard voice or the custom voice of voice talent selected by you.
Avatar type
- Video Avatar: The avatar is generated by using a fine-tuned model with a video recording for fine tuning. It supports half-body and full-body representations.
- Photo Avatar: The avatar is created from a single input image as prompt and is limited to a head-only representation.
Avatar video output
For video avatar or avatar with body, both batch synthesis and real-time synthesis resolution default to 1920 x 1080. You can choose to train 4K resolution custom avatars, and the frames per second (FPS) rate is 25. For batch synthesis, the codec can be H264, HEVC, or AV1 if the format is mp4. It can be VP9 or AV1 if the format is webm. Only vp9 can contain an alpha channel. For real-time synthesis, the codec is H264. You can configure the video bitrate in the request for both batch synthesis and real-time synthesis. The default value is 2,000,000. More detailed configurations can be found in the sample code.
Photo avatar resolution is 512x512 for both batch synthesis and real-time synthesis.
Video Avatar
| Batch synthesis | Real-time synthesis | |
|---|---|---|
| Resolution | 1920 x 1080/3840 x 2160 | 1920 x 1080/3840 x 2160 |
| FPS | 25 | 25 |
| Codec | H264/HEVC/VP9/AV1 | H264 |
Photo Avatar
| Batch synthesis | Real-time synthesis | |
|---|---|---|
| Resolution | 512x512 | 512x512 |
| FPS | 25 | 25 |
| Codec | H264/HEVC/VP9 | H264 |
Custom text to speech avatar
You can create custom text to speech avatars that are unique to your product or brand. For a custom video avatar, all it takes to get started is 10 minutes of video recordings. For a custom photo avatar, you only need one photo. If you fine-tune a professional voice for the actor, the avatar can be highly realistic.
Several options are available for the voice part of a custom avatar:
1. Voice sync for avatar
Voice sync for avatar is the most efficient custom voice option for a custom video avatar. It trains alongside the custom avatar by using audio from the training video. The voice exclusively associates with the custom avatar and can't be used independently. Voice sync for avatar is only available for the custom video avatar. For more information, see Voice sync for avatar.
2. Professional voice
Professional voice is a type of custom voice that provides higher voice quality. Professional voice fine-tuning and custom text to speech avatar have separate processes for obtaining limited access and training models. You can use them independently or together. If you plan to also use professional voice fine-tuning with a text to speech avatar, you need to deploy or copy your fine-tuned professional voice model to one of the avatar supported regions.
3. Personal voice
Personal voice provides audio quality comparable to the voice sync for avatar and can be used either with avatars or independently.
For more information, see What is custom text to speech avatar.
Sample code
Sample code for text to speech avatar is available on GitHub. These samples cover the most popular scenarios:
Pricing
- Throughout an avatar real-time session or batch content creation, you pay separately for the text to speech.
- Voice sync for an avatar (through custom avatar training) costs the same as a personal voice for voice creation and synthesis. The storage of the voice is free.
- To learn how billing works for the text-to-speech avatar feature, see text to speech avatar pricing note.
- For detailed pricing, see Speech service pricing. Avatar pricing is visible only for service regions where the feature is available. For the current list of supported regions, see the Speech service regions table.
Available locations
For the current list of regions that support text to speech avatar, see the Speech service regions table.
Responsible AI
Microsoft cares about the people who use AI and the people who are affected by it as much as it cares about technology. For more information, see the Responsible AI transparency notes and disclosure for voice and avatar talent.