Share via

Azure Speech DragonHDLatestNeural produces electrical/static noise during speech in both raw MP3 and RIFF WAV outputs

August 0 Reputation points
2026-04-16T23:20:12.64+00:00

Hi,

I’m seeing a consistent audio quality issue with Azure Speech HD voices, and I want to confirm whether this is a known DragonHDLatestNeural problem or something else in my setup.

The issue is not just “bad playback” or a mislabeled WAV file. The sound is an electrical/static noise that appears whenever the voice is speaking. Silence is mostly fine, but the static rides on top of the spoken audio.

What I tested:

  • Voice: en-US-Steffan:DragonHDLatestNeural
  • Service: Azure Speech TTS
  • Input: SSML/plain narration text
  • Output formats tested directly from Azure:
    • audio-48khz-192kbitrate-mono-mp3
      • riff-48khz-16bit-mono-pcm

Important detail:

  • I tested both a direct raw MP3 output and a direct RIFF WAV output from Azure.
  • The problem is still present in the raw Azure output itself.
  • Because of that, this does not look like the usual “RAW PCM saved as .wav” header/container issue.

What I want to understand:

  1. Is this a known issue with DragonHDLatestNeural voices?
  2. Can HD voices sometimes generate static/noise artifacts on certain text or SSML patterns?
  3. Are there recommended output formats, regions, or voice settings for avoiding this?
  4. Is there any guidance for diagnosing whether this is voice-model output vs. service-side encoding?
Azure Language in Foundry Tools
Azure Language in Foundry Tools

An Azure service that provides natural language capabilities including sentiment analysis, entity extraction, and automated question answering.

0 comments No comments

2 answers

Sort by: Most helpful
  1. Thanmayi Godithi 9,445 Reputation points Microsoft External Staff Moderator
    2026-04-25T07:37:55.9666667+00:00

    Hi @August,

    Based on current documentation, there is no officially documented known issue where DragonHDLatestNeural voices (including en-US-Steffan:DragonHDLatestNeural) consistently generate electrical or static noise as an expected behavior. Azure Speech HD voices are designed to produce high‑fidelity audio, and static artifacts are not described as a normal characteristic of the model or supported output formats.

    That said, since you’re able to reproduce the noise directly in the raw Azure‑generated MP3 and RIFF PCM outputs, this does help rule out common client‑side causes such as WAV header/container mismatches or local playback issues. The behavior appearing only during voiced segments (with silence remaining clean) does suggest a synthesis‑stage artifact rather than a decoding problem.

    To further isolate the scope, here are the recommended diagnostics:

    1. Swap voices and regions
    • Test another HD voice (for example, en-US-Ava:DragonHDLatestNeural) in the same region.
    • As a control, test a non‑HD neural voice (such as en-US-AriaNeural).
    • If the noise occurs only on Steffan or only in a specific region, that points toward a voice‑ or region‑specific issue rather than a client setup.
    1. Test alternate output formats
    • In addition to 48 kHz MP3 (192 kbps) and RIFF PCM, try 24 kHz formats (MP3 or RIFF PCM) and compare.
    • Continue inspecting the raw binary output without any post‑processing to confirm behavior remains consistent.
    1. Minimize input complexity
    • Test with a very short, plain‑text phrase (for example, “Hello world”).
    • If the artifact disappears, it may be correlated with specific SSML constructs or text patterns, which has been seen in isolated HD‑voice quality reports.
    1. Cross‑validate call paths
    • If you’re using the Speech SDK, ensure it’s on the latest version.
    • As you’ve already done, validating via the REST API is a good approach to rule out SDK or streaming effects.

    Kindly review the above steps and let me know if you need further assistance.


  2. Q&A Assist
    2026-04-16T23:20:28.8366667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more
    1. The provided information does not document any known issue where DragonHDLatestNeural voices (such as en-US-Steffan:DragonHDLatestNeural) consistently produce electrical/static noise in the synthesized audio.
    2. The documentation describes DragonHDLatestNeural voices as high‑definition neural voices based on large language models, with enhanced contextual understanding and expressive capabilities, but it does not mention static/noise artifacts as an expected or known behavior for specific text or SSML patterns.
    3. Supported usage and configuration:
      • DragonHDLatestNeural voices are used through the standard Azure Speech SDK and REST APIs, with the same output formats as non‑HD voices.
      • HD voices can be referenced in SSML using the voicename:basemodel:version format, for example:
             <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'>
               <voice name='en-US-Ava:DragonHDLatestNeural' parameters='temperature=0.8'>Here is a test</voice>
             </speak>
        
      • The documentation lists supported SSML elements for DragonHD voices (for example, <phoneme>, <say-as>, <sub>, <p>, <s>), but does not associate any of them with noise/static issues.
      • There is no guidance in the provided material that recommends specific output formats, regions, or voice parameters (such as temperature, top_p, top_k, or cfg_scale) as a way to avoid static or encoding artifacts. Parameter tuning is documented only for controlling expressiveness, stability, and style for DragonHDOmni voices, not for mitigating noise.
    4. Diagnosing model vs. encoding issues:
      • The context does not provide a diagnostic procedure specific to distinguishing between voice‑model artifacts and service‑side encoding problems for Azure Speech HD voices.
      • It does, however, emphasize that HD voices are standard Azure Speech voices that share the same SDK/REST surface and that they support multiple languages and advanced prosody, without calling out any special handling or known defects in the audio encoding pipeline.

    Within the supplied documentation, there is no statement that:

    • DragonHDLatestNeural voices are known to generate static/electrical noise; or
    • Certain SSML patterns or text inputs are known to cause such artifacts; or
    • Particular formats/regions/settings are recommended specifically to avoid static.

    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.