"Can I use the Azure Speech-to-Text fast transcription REST API for short audio to perform pronunciation assessment? How do I use it?

Question

"Can I use the Azure Speech-to-Text fast transcription REST API for short audio to perform pronunciation assessment? How do I use it?

稀渺陈 20

My problem is that the mode right now is too expensive for my work, which is $1.3 per hour. I want to try to use fast transcription mode to to perform pronunciation assessment,which may finally cost $0.66 per hour. Can I?

Here is the example code from the VScode plugin Azure AI Speech Toolkit ,please tell me how to switch to it.

def pronunciation_assessment_with_rest_api():

def pronunciation_assessment_with_rest_api():
    """Performs pronunciation assessment asynchronously with REST API for a short audio file.
    See more information at https://learn.microsoft.com/azure/ai-services/speech-service/rest-speech-to-text-short
    """

    # A generator which reads audio data chunk by chunk.
    # The audio_source can be any audio input stream which provides read() method,
    # e.g. audio file, microphone, memory stream, etc.
    def get_chunk(audio_source, chunk_size=1024):
        yield WaveHeader16K16BitMono
        while True:
            time.sleep(chunk_size / 32000)  # to simulate human speaking rate
            chunk = audio_source.read(chunk_size)
            if not chunk:
                global upload_finish_time
                upload_finish_time = time.time()
                break
            yield chunk

    # Build pronunciation assessment parameters
    locale = "en-US"
    audio_file = open(AUDIO_PCM_FILE, "rb")
    reference_text = "Good morning."
    enable_prosody_assessment = True
    phoneme_alphabet = "SAPI"  # IPA or SAPI
    enable_miscue = True
    nbest_phoneme_count = 5
    pron_assessment_params_json = (
        '{"GradingSystem":"HundredMark","Dimension":"Comprehensive","ReferenceText":"%s",'
        '"EnableProsodyAssessment":"%s","PhonemeAlphabet":"%s","EnableMiscue":"%s",'
        '"NBestPhonemeCount":"%s"}'
        % (reference_text, enable_prosody_assessment, phoneme_alphabet, enable_miscue, nbest_phoneme_count)
    )
    pron_assessment_params_base64 = base64.b64encode(bytes(pron_assessment_params_json, "utf-8"))
    pron_assessment_params = str(pron_assessment_params_base64, "utf-8")

    # https://learn.microsoft.com/azure/ai-services/speech-service/how-to-get-speech-session-id#provide-session-id-using-rest-api-for-short-audio
    session_id = uuid.uuid4().hex

    # Build request
    url = f"https://{service_region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
    url = f"{url}?format=detailed&language={locale}&X-ConnectionId={session_id}"
    headers = {
        "Accept": "application/json;text/xml",
        "Connection": "Keep-Alive",
        "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
        "Ocp-Apim-Subscription-Key": speech_key,
        "Pronunciation-Assessment": pron_assessment_params,
        "Transfer-Encoding": "chunked",
        "Expect": "100-continue",
    }

    print(f"II URL: {url}")
    print(f"II Config: {pron_assessment_params_json}")

    # Send request with chunked data
    response = requests.post(url=url, data=get_chunk(audio_file), headers=headers)
    get_response_time = time.time()
    audio_file.close()

    # Show Session ID
    print(f"II Session ID: {session_id}")

    if response.status_code != 200:
        print(f"EE Error code: {response.status_code}")
        print(f"EE Error message: {response.text}")
        exit()
    else:
        print(f"II Response: {response.json()}")

    latency = get_response_time - upload_finish_time
    print(f"II Latency: {int(latency * 1000)}ms")

0 comments

3 answers

Your answer

Answer 1

As others have said, you can’t get pronunciation assessment on the fast transcription endpoint. It sounds like you’re trying to build something like SpeechSuper API. If you don’t care about building your own tool, you could always just use their API (but the pricing of that can also add up quickly since they charge based on the number of requests).

As a cheaper alternative, you could try an open-source pronunciation assessment tool like https://github.com/Thiagohgl/ai-pronunciation-trainer. The only caveat is this one uses Whisper for ASR and I’ve actually found that Whisper isn’t the best transcription option for all languages, so it depends on what languages you’re working with.

Answer 2

Hello 稀渺陈,

Welcome to Microsoft Q&A and Thank you for reaching out.

I understand that you want to perform pronunciation assessment on short audio while trying to reduce costs by using Azure’s fast transcription mode. Here’s some guidance:

Your current REST API example defaults to the conversation (real-time) endpoint, which is why it is billed at $1.3 per hour. This mode streams audio and performs recognition in real-time, supporting full pronunciation assessment features.

Azure provides a Batch/Short Audio REST API for fast transcription of short audio files. This mode is more cost-efficient (around $0.66 per hour) because it processes the entire audio after upload instead of maintaining a real-time session.

Unfortunately, pronunciation assessment is not supported on the fast transcription (batch/short audio) endpoint. It requires specific headers and models available only through the real-time/conversation endpoint (or the short-audio REST API for clips up to 30 seconds). Therefore, you cannot directly switch your REST call to fast transcription while retaining full pronunciation scoring.

Use the short-audio REST API for clips up to 30 seconds to reduce costs compared to longer real-time sessions.

Segment longer audio into smaller batches (less than 1–2 minutes) and send them to the real-time endpoint to minimize hourly billing.

Consider purchasing a commitment tier for Azure Speech services to gain additional cost savings.

Monitor usage carefully to ensure you stay within your budget while performing pronunciation assessment.

To enable pronunciation assessment, make sure your code targets the endpoint that supports it and includes the required headers. If you’d like, we can provide a sample modified workflow showing how to optimize cost for pronunciation assessment using shorter audio segments.

Please refer this Speech to text REST API for short audio, pronunciation assessment.

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!

SRILAKSHMI C 16,305 Reputation points Microsoft External Staff Moderator

2025-10-10T05:47:36.0633333+00:00

Hi 稀渺陈,

Following up to see if the above answer was helpful. If this answers your query, please do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank you!
SRILAKSHMI C 16,305 Reputation points Microsoft External Staff Moderator

2025-10-15T08:19:05.0766667+00:00

Hi 稀渺陈,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!

Answer 3

Hello,

Welcome to Microsoft Q&A,

Pronunciation Assessment (PA) isn’t supported by the Fast Transcription API. PA runs on a dedicated STT model and is available via the Speech SDK or the REST API for short audio (≤ 30 s clips), not the fast endpoint.

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-python

You can lower cost by:

using the short-audio REST PA flow on ≤ 30 s clips (you’re billed per second at the standard STT rate), and/or
buying a commitment tier on Speech pricing (PA bills at the standard STT rate).
- https://learn.microsoft.com/en-us/azure/ai-services/commitment-tier

Don’t call the Fast Transcription endpoint (for example: https://{region}.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?...). It doesn’t accept the PA header and won’t return PA fields. Use it only for quick plain transcripts, diarization, language ID, etc.

Please upvote and accept the answer if it helps!!

Share via

"Can I use the Azure Speech-to-Text fast transcription REST API for short audio to perform pronunciation assessment? How do I use it?

3 answers

Your answer