This article answers commonly asked questions about the text to speech (TTS) capability. If you can't find answers to your questions here, check out other support options.
General
How does the billing work for text to speech?
Text to speech usage is billed per character. Check the definition of billable characters in the pricing note.
What is the rate limit for the text to speech synthesis requests?
The text to speech synthesis rate scales automatically as it receives more requests. A default rate limit is set per speech resource. The rate is adjustable with business justifications and no extra charges are incurred for rate limit increase. Check more details in Speech service quotas and limits.
How would we disclose to the end user that the voice is a synthetic voice?
We recommend that every user should follow our code of conduct when using the text to speech capability. There are several ways to disclose the synthetic nature of the voice including implicit and explicit byline. Refer to Disclosure design guidelines.
How can I reduce the latency for my voice app?
We provide several tips for you to lower the latency and bring the best performance to your users. See Lower speech synthesis latency using Speech SDK.
What output audio formats does text to speech support?
Azure AI text to speech supports various streaming and non-streaming audio formats, with the commonly used sampling rates. All TTS prebuilt neural voices are created to support high-fidelity audio outputs with 48 kHz and 24 kHz. The audio can be resampled to support other rates as needed. See Audio outputs.
Can the voice be customized to stress specific words?
Adjusting the emphasis is supported for some voices depending on the locale. See the emphasis tag.
Can we have multiple strength for each emotion, like sad, slightly sad, and so on, in?
Adjusting the style degree is supported for some voices depending on the locale. See the mstts:express-as tag.
Is there a mapping between Viseme IDs and mouth shape?
Yes. See Get facial position with viseme.
Audio Content Creation
How can I reference a lexicon file that I created on the Audio Content Creation platform in my code?
First, you can open the lexicon file on the Audio Content Creation and obtain the lexicon file ID, which is located before "?fileKind=CustomLexiconFile" in the file path. For example, if the file path is https://speech.microsoft.com/portal/d391a094f76846acbcd11dc2ba835f4f/audiocontentcreation/file/6cbc2527-8d57-4c1b-b9d9-3ea6d13ca95c?fileKind=CustomLexiconFile
, the lexicon file ID is 6cbc2527-8d57-4c1b-b9d9-3ea6d13ca95c
. Then, switch a file referencing this lexicon to SSML format on the Audio Content Creation. In the SSML file, locate the <!--ID=FCB
xml node, where you can find the URI of the lexicon file based on the mentioned file ID. Finally, reference the lexicon file URI link using the SSML lexicon element in your code. For instance, if you locate the XML node <!--ID=FCB5B6FB566-33CA-4B68-BEAF-B013C53B3368;Version=1|{"Files":{"6cbc2527-8d57-4c1b-b9d9-3ea6d13ca95c":{"FileKind":"CustomLexiconFile","FileSubKind":"CustomLexiconFile","Uri":"https://cvoiceprodwus2.blob.core.windows.net/acc-public-files/d391a094f76846acbcd11dc2ba835f4f/e9a6a5a2-9cef-47f4-b961-d175be75d92f.xml"}}}
, you can obtain the lexicon file URI https://cvoiceprodwus2.blob.core.windows.net/acc-public-files/d391a094f76846acbcd11dc2ba835f4f/e9a6a5a2-9cef-47f4-b961-d175be75d92f.xml
.
Custom neural voice
How much data is required to create a custom neural voice?
Custom neural voice (CNV) supports two project types: CNV Pro and CNV Lite. With CNV Pro, at least 300 lines of recordings (or approximately 30 minutes of speech) are required as training data for custom neural voice. We recommend 2,000 lines of recordings (or approximately 2-3 hours of speech) to create a voice for production use. With CNV Lite, you can create a voice with just 20 recorded samples. CNV Lite is best for quick trials, or when you don't have access to professional voice actors. For the script selection criteria, see Record custom voice samples.
Can we include duplicate text sentences in the same set of training data?
No. The service will flag the duplicate sentences and just keep the first imported one. For the script selection criteria, see Record custom voice samples.
Can we include multiple styles in the same set of training data?
We recommend that you keep the style consistent in one set of training data. If the styles are different, put them into different training sets. In this case, consider using the multi-style voice training feature of custom neural voice. For the script selection criteria, see Record custom voice samples.
Does switching styles via SSML work for custom neural voices?
Switching styles via SSML works for both prebuilt multi-style voices and CNV multi-style voices. With multi-style training, you can create a voice that speaks in different styles, and you can also adjust these styles via SSML.
How does cross-lingual voice work with languages that have different pronunciation structure and assembly?
Sentence structure and pronunciation naturally vary across languages such as English and Japanese. Each neural voice is trained with audio data recorded by native speaking voice talent. For cross lingual voice, we transfer the major features like timbre to sound like the original speaker and preserve the right pronunciation. For example, a cross-lingual voice uses the native way to speak Japanese and still sounds similar (but not exactly) like the original English speaker.
Can I use custom neural voice to customize pronunciation for my domain?
Custom neural voice enables you to create a brand voice for your business. You can optimize it for your domain as well. We recommend you include domain-specific samples in your training data for higher naturalness. However, the pronunciation is defined by the Speech service by default, and we don't support pronunciation customization during the CNV training. If you want to customize pronunciation for your voice, use SSML. See Pronunciation with Speech Synthesis Markup Language (SSML).
After one training can I train my voice again?
You can train again. Each training creates a new voice model. You are charged for each training.
Is the model version the same as the engine version?
No. The model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time. Azure AI services text to speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model. See Update engine version for your voice model.
Can we limit the number of trainings using Azure Policy or other features? Or is there any way to avoid false training?
If you want to limit the permission to training, you can limit the user roles and access. Refer to Role-based access control for Speech resources.
Can Microsoft add a mechanism to prevent unauthorized use or misuse of our voice when it's created?
The voice model can only be used by yourselves using your own token. Microsoft also doesn't use your data. See Data, privacy, and security. You can also request to add watermarks to your voice to protect your model. See Microsoft Azure Neural TTS introduces the watermark algorithm for synthetic voice identification.
Do you have any tips about contracts or negotiation with voice actors?
We have no recommendations on contracts and it's up to the customer and the voice talent to negotiate the terms. However, you should make sure the voice talent understands the capabilities of text to speech, including its potential risks, and provide explicit consent to create a synthetic version of their voice in both the contract and a verbal statement. See Disclosure for voice talent.
Do we need to return the written permission from the voice talent back to Microsoft?
Microsoft doesn't need the written permission, but you must obtain consent from your voice talent. The voice talent will also be required to record the consent statement and it must be uploaded into Speech Studio before training can begin. See Set up voice talent for custom neural voice.