Share via

"Can I use the Azure Speech-to-Text fast transcription REST API for short audio to perform pronunciation assessment? How do I use it?

稀渺 陈 20 Reputation points
2025-10-04T05:16:15.5566667+00:00

My problem is that the mode right now is too expensive for my work, which is $1.3 per hour. I want to try to use fast transcription mode to to perform pronunciation assessment,which may finally cost $0.66 per hour. Can I?

Here is the example code from the VScode plugin Azure AI Speech Toolkit ,please tell me how to switch to it.

def pronunciation_assessment_with_rest_api():

def pronunciation_assessment_with_rest_api():
    """Performs pronunciation assessment asynchronously with REST API for a short audio file.
    See more information at https://learn.microsoft.com/azure/ai-services/speech-service/rest-speech-to-text-short
    """

    # A generator which reads audio data chunk by chunk.
    # The audio_source can be any audio input stream which provides read() method,
    # e.g. audio file, microphone, memory stream, etc.
    def get_chunk(audio_source, chunk_size=1024):
        yield WaveHeader16K16BitMono
        while True:
            time.sleep(chunk_size / 32000)  # to simulate human speaking rate
            chunk = audio_source.read(chunk_size)
            if not chunk:
                global upload_finish_time
                upload_finish_time = time.time()
                break
            yield chunk

    # Build pronunciation assessment parameters
    locale = "en-US"
    audio_file = open(AUDIO_PCM_FILE, "rb")
    reference_text = "Good morning."
    enable_prosody_assessment = True
    phoneme_alphabet = "SAPI"  # IPA or SAPI
    enable_miscue = True
    nbest_phoneme_count = 5
    pron_assessment_params_json = (
        '{"GradingSystem":"HundredMark","Dimension":"Comprehensive","ReferenceText":"%s",'
        '"EnableProsodyAssessment":"%s","PhonemeAlphabet":"%s","EnableMiscue":"%s",'
        '"NBestPhonemeCount":"%s"}'
        % (reference_text, enable_prosody_assessment, phoneme_alphabet, enable_miscue, nbest_phoneme_count)
    )
    pron_assessment_params_base64 = base64.b64encode(bytes(pron_assessment_params_json, "utf-8"))
    pron_assessment_params = str(pron_assessment_params_base64, "utf-8")

    # https://learn.microsoft.com/azure/ai-services/speech-service/how-to-get-speech-session-id#provide-session-id-using-rest-api-for-short-audio
    session_id = uuid.uuid4().hex

    # Build request
    url = f"https://{service_region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
    url = f"{url}?format=detailed&language={locale}&X-ConnectionId={session_id}"
    headers = {
        "Accept": "application/json;text/xml",
        "Connection": "Keep-Alive",
        "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
        "Ocp-Apim-Subscription-Key": speech_key,
        "Pronunciation-Assessment": pron_assessment_params,
        "Transfer-Encoding": "chunked",
        "Expect": "100-continue",
    }

    print(f"II URL: {url}")
    print(f"II Config: {pron_assessment_params_json}")

    # Send request with chunked data
    response = requests.post(url=url, data=get_chunk(audio_file), headers=headers)
    get_response_time = time.time()
    audio_file.close()

    # Show Session ID
    print(f"II Session ID: {session_id}")

    if response.status_code != 200:
        print(f"EE Error code: {response.status_code}")
        print(f"EE Error message: {response.text}")
        exit()
    else:
        print(f"II Response: {response.json()}")

    latency = get_response_time - upload_finish_time
    print(f"II Latency: {int(latency * 1000)}ms")
Azure AI Speech
Azure AI Speech

An Azure service that integrates speech processing into apps and services.

0 comments No comments

3 answers

Sort by: Most helpful
  1. Mark Thomas 0 Reputation points
    2026-03-18T00:49:18.18+00:00

    As others have said, you can’t get pronunciation assessment on the fast transcription endpoint. It sounds like you’re trying to build something like SpeechSuper API. If you don’t care about building your own tool, you could always just use their API (but the pricing of that can also add up quickly since they charge based on the number of requests). 

    As a cheaper alternative, you could try an open-source pronunciation assessment tool like https://github.com/Thiagohgl/ai-pronunciation-trainer. The only caveat is this one uses Whisper for ASR and I’ve actually found that Whisper isn’t the best transcription option for all languages, so it depends on what languages you’re working with. 

    0 comments No comments

  2. SRILAKSHMI C 16,305 Reputation points Microsoft External Staff Moderator
    2025-10-06T06:07:08.4166667+00:00

    Hello 稀渺 陈,

    Welcome to Microsoft Q&A and Thank you for reaching out.

    I understand that you want to perform pronunciation assessment on short audio while trying to reduce costs by using Azure’s fast transcription mode. Here’s some guidance:

    Your current REST API example defaults to the conversation (real-time) endpoint, which is why it is billed at $1.3 per hour. This mode streams audio and performs recognition in real-time, supporting full pronunciation assessment features.

    Azure provides a Batch/Short Audio REST API for fast transcription of short audio files. This mode is more cost-efficient (around $0.66 per hour) because it processes the entire audio after upload instead of maintaining a real-time session.

    Unfortunately, pronunciation assessment is not supported on the fast transcription (batch/short audio) endpoint. It requires specific headers and models available only through the real-time/conversation endpoint (or the short-audio REST API for clips up to 30 seconds). Therefore, you cannot directly switch your REST call to fast transcription while retaining full pronunciation scoring.

    Use the short-audio REST API for clips up to 30 seconds to reduce costs compared to longer real-time sessions.

    Segment longer audio into smaller batches (less than 1–2 minutes) and send them to the real-time endpoint to minimize hourly billing.

    Consider purchasing a commitment tier for Azure Speech services to gain additional cost savings.

    Monitor usage carefully to ensure you stay within your budget while performing pronunciation assessment.

    To enable pronunciation assessment, make sure your code targets the endpoint that supports it and includes the required headers. If you’d like, we can provide a sample modified workflow showing how to optimize cost for pronunciation assessment using shorter audio segments.

    Please refer this Speech to text REST API for short audio, pronunciation assessment.

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!


  3. Divyesh Govaerdhanan 10,850 Reputation points Volunteer Moderator
    2025-10-05T21:07:29.9666667+00:00

    Hello,

    Welcome to Microsoft Q&A,

    Pronunciation Assessment (PA) isn’t supported by the Fast Transcription API. PA runs on a dedicated STT model and is available via the Speech SDK or the REST API for short audio (≤ 30 s clips), not the fast endpoint.

    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-python

    You can lower cost by:

    Don’t call the Fast Transcription endpoint (for example: https://{region}.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?...). It doesn’t accept the PA header and won’t return PA fields. Use it only for quick plain transcripts, diarization, language ID, etc.

    Please upvote and accept the answer if it helps!!

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.