Q: Is there a mapping between Viseme IDs and mouth shape?

Yes. See Get facial position with viseme .

Question 1

How does the billing work for text to speech?

Accepted Answer

Text to speech usage is billed per character. Check the definition of billable characters in the pricing note.

Question 2

What is the rate limit for the text to speech synthesis requests?

Accepted Answer

The text to speech synthesis rate scales automatically as it receives more requests. A default rate limit is set per speech resource. The rate is adjustable with business justifications and no extra charges are incurred for rate limit increase. Check more details in Speech service quotas and limits.

Question 3

How would we disclose to the end user that the voice is a synthetic voice?

Accepted Answer

We recommend that every user should follow our code of conduct when using the text to speech capability. There are several ways to disclose the synthetic nature of the voice including implicit and explicit byline. Refer to Disclosure design guidelines.

Question 4

How can I reduce the latency for my voice app?

Accepted Answer

We provide several tips for you to lower the latency and bring the best performance to your users. See Lower speech synthesis latency using Speech SDK.

Question 5

What output audio formats does text to speech support?

Accepted Answer

Azure AI text to speech supports various streaming and non-streaming audio formats, with the commonly used sampling rates. All TTS prebuilt neural voices are created to support high-fidelity audio outputs with 48 kHz and 24 kHz. The audio can be resampled to support other rates as needed. See Audio outputs.

Question 6

Can the voice be customized to stress specific words?

Accepted Answer

Adjusting the emphasis is supported for some voices depending on the locale. See the emphasis tag.

Question 7

Can we have multiple strength for each emotion, like sad, slightly sad, and so on, in?

Accepted Answer

Adjusting the style degree is supported for some voices depending on the locale. See the mstts:express-as tag.

Question 8

Is there a mapping between Viseme IDs and mouth shape?

Accepted Answer

Yes. See Get facial position with viseme.

Question 9

How can I reference a lexicon file that I created on the Audio Content Creation platform in my code?

Accepted Answer

First, you can open the lexicon file on the Audio Content Creation and obtain the lexicon file ID, which is located before "?fileKind=CustomLexiconFile" in the file path. For example, if the file path is https://speech.microsoft.com/portal/d391a094f76846acbcd11dc2ba835f4f/audiocontentcreation/file/6cbc2527-8d57-4c1b-b9d9-3ea6d13ca95c?fileKind=CustomLexiconFile, the lexicon file ID is 6cbc2527-8d57-4c1b-b9d9-3ea6d13ca95c. Then, switch a file referencing this lexicon to SSML format on the Audio Content Creation. In the SSML file, locate the

Question 10

How much data is required to create a custom neural voice?

Accepted Answer

Custom neural voice (CNV) supports two project types: CNV Pro and CNV Lite. With CNV Pro, at least 300 lines of recordings (or approximately 30 minutes of speech) are required as training data for custom neural voice. We recommend 2,000 lines of recordings (or approximately 2-3 hours of speech) to create a voice for production use. With CNV Lite, you can create a voice with just 20 recorded samples. CNV Lite is best for quick trials, or when you don't have access to professional voice actors. For the script selection criteria, see Record custom voice samples.

Question 11

Can we include duplicate text sentences in the same set of training data?

Accepted Answer

No. The service will flag the duplicate sentences and just keep the first imported one. For the script selection criteria, see Record custom voice samples.

Question 12

Can we include multiple styles in the same set of training data?

Accepted Answer

We recommend that you keep the style consistent in one set of training data. If the styles are different, put them into different training sets. In this case, consider using the multi-style voice training feature of custom neural voice. For the script selection criteria, see Record custom voice samples.

Question 13

Does switching styles via SSML work for custom neural voices?

Accepted Answer

Switching styles via SSML works for both prebuilt multi-style voices and CNV multi-style voices. With multi-style training, you can create a voice that speaks in different styles, and you can also adjust these styles via SSML.

Question 14

How does cross-lingual voice work with languages that have different pronunciation structure and assembly?

Accepted Answer

Sentence structure and pronunciation naturally vary across languages such as English and Japanese. Each neural voice is trained with audio data recorded by native speaking voice talent. For cross lingual voice, we transfer the major features like timbre to sound like the original speaker and preserve the right pronunciation. For example, a cross-lingual voice uses the native way to speak Japanese and still sounds similar (but not exactly) like the original English speaker.

Question 15

Can I use custom neural voice to customize pronunciation for my domain?

Accepted Answer

Custom neural voice enables you to create a brand voice for your business. You can optimize it for your domain as well. We recommend you include domain-specific samples in your training data for higher naturalness. However, the pronunciation is defined by the Speech service by default, and we don't support pronunciation customization during the CNV training. If you want to customize pronunciation for your voice, use SSML. See Pronunciation with Speech Synthesis Markup Language (SSML).

Question 16

After one training can I train my voice again?

Accepted Answer

You can train again. Each training creates a new voice model. You are charged for each training.

Question 17

Is the model version the same as the engine version?

Accepted Answer

No. The model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time. Azure AI services text to speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model. See Update engine version for your voice model.

Question 18

Can we limit the number of trainings using Azure Policy or other features? Or is there any way to avoid false training?

Accepted Answer

If you want to limit the permission to training, you can limit the user roles and access. Refer to Role-based access control for Speech resources.

Question 19

Can Microsoft add a mechanism to prevent unauthorized use or misuse of our voice when it's created?

Accepted Answer

The voice model can only be used by yourselves using your own token. Microsoft also doesn't use your data. See Data, privacy, and security. You can also request to add watermarks to your voice to protect your model. See Microsoft Azure Neural TTS introduces the watermark algorithm for synthetic voice identification.

Question 20

Do you have any tips about contracts or negotiation with voice actors?

Accepted Answer

We have no recommendations on contracts and it's up to the customer and the voice talent to negotiate the terms. However, you should make sure the voice talent understands the capabilities of text to speech, including its potential risks, and provide explicit consent to create a synthetic version of their voice in both the contract and a verbal statement. See Disclosure for voice talent.

Question 21

Do we need to return the written permission from the voice talent back to Microsoft?

Accepted Answer

Microsoft doesn't need the written permission, but you must obtain consent from your voice talent. The voice talent will also be required to record the consent statement and it must be uploaded into Speech Studio before training can begin. See Set up voice talent for custom neural voice.

Share via

General

How does the billing work for text to speech?

What is the rate limit for the text to speech synthesis requests?

How would we disclose to the end user that the voice is a synthetic voice?

How can I reduce the latency for my voice app?

What output audio formats does text to speech support?

Can the voice be customized to stress specific words?

Can we have multiple strength for each emotion, like sad, slightly sad, and so on, in?

Is there a mapping between Viseme IDs and mouth shape?

Audio Content Creation

How can I reference a lexicon file that I created on the Audio Content Creation platform in my code?

Custom neural voice

How much data is required to create a custom neural voice?

Can we include duplicate text sentences in the same set of training data?

Can we include multiple styles in the same set of training data?

Does switching styles via SSML work for custom neural voices?

How does cross-lingual voice work with languages that have different pronunciation structure and assembly?

Can I use custom neural voice to customize pronunciation for my domain?

After one training can I train my voice again?

Is the model version the same as the engine version?

Can we limit the number of trainings using Azure Policy or other features? Or is there any way to avoid false training?

Can Microsoft add a mechanism to prevent unauthorized use or misuse of our voice when it's created?

Do you have any tips about contracts or negotiation with voice actors?

Do we need to return the written permission from the voice talent back to Microsoft?

Next steps

Share via

Text to speech FAQ

General

How does the billing work for text to speech?

What is the rate limit for the text to speech synthesis requests?

How would we disclose to the end user that the voice is a synthetic voice?

How can I reduce the latency for my voice app?

What output audio formats does text to speech support?

Can the voice be customized to stress specific words?

Can we have multiple strength for each emotion, like sad, slightly sad, and so on, in?

Is there a mapping between Viseme IDs and mouth shape?

Audio Content Creation

How can I reference a lexicon file that I created on the Audio Content Creation platform in my code?

Custom neural voice

How much data is required to create a custom neural voice?

Can we include duplicate text sentences in the same set of training data?

Can we include multiple styles in the same set of training data?

Does switching styles via SSML work for custom neural voices?

How does cross-lingual voice work with languages that have different pronunciation structure and assembly?

Can I use custom neural voice to customize pronunciation for my domain?

After one training can I train my voice again?

Is the model version the same as the engine version?

Can we limit the number of trainings using Azure Policy or other features? Or is there any way to avoid false training?

Can Microsoft add a mechanism to prevent unauthorized use or misuse of our voice when it's created?

Do you have any tips about contracts or negotiation with voice actors?

Do we need to return the written permission from the voice talent back to Microsoft?

Next steps

Feedback

Additional resources