ConversationTranscriber stops recognizing audio after several Start/Stop cycles with PushAudioInputStream.

Julien S 66 Reputation points
2025-11-24T15:33:03.0733333+00:00

I am developing a client/server application that uses Microsoft Speech SDK’s ConversationTranscriber for speaker diarization. The production application runs on .NET Framework 4.6.2, written in C# 7.3, and uses Microsoft.CognitiveServices.Speech 1.47.0. To reproduce the issue, I created a simplified WPF test application using NAudio to simulate audio streaming from a client to the server via PushAudioInputStream.

Repro Steps

  1. Start the app and begin transcription (calls StartTranscribingAsync()).
  2. Start recording (starts pushing audio).
  3. Stop recording (only stops pushing audio).
  4. Repeat steps 2 and 3 several times.

After a few cycles, the transcriber begins pushing empty results every ~100–200 ms, even though audio is still being pushed. No errors are thrown, and the session appears to remain open. The PushAudioInputStream and ConversationTranscriber are created once at startup and reused for all cycles.

What Actually Happens

After several start/stop cycles, the transcriber continues to emit recognition events, but each result is empty (blank text) and occurs at a fixed interval of about 100 ms. Audio is still being pushed into the PushAudioInputStream, no errors are thrown, and the session remains active.

Expected Behavior

After multiple start/stop cycles, the transcriber should continue to process audio normally and produce accurate transcription results for the speech being pushed into the PushAudioInputStream. Recognition events should only occur when actual speech is detected, and should not emit empty results at fixed intervals.

Questions

  1. Is this a known limitation or bug with the Azure Speech SDK?
  2. Is there a recommended way to pause/resume audio input while preserving the conversation context and speaker mapping?
  3. Any advice or workarounds would be appreciated!

Code Snippet

I can provide a full Visual Studio solution to reproduce the issue if required.

    public async Task InitializeAsync()
    {
        var speechConfig = SpeechConfig.FromSubscription("<YOUR_KEY>", "<YOUR_REGION>");
        speechConfig.SpeechRecognitionLanguage = "en-US";

        var format = AudioStreamFormat.GetWaveFormatPCM(16000, 16, 1);
        _pushStream = AudioInputStream.CreatePushStream(format);
        var audioConfig = AudioConfig.FromStreamInput(_pushStream);

        _transcriber = new ConversationTranscriber(speechConfig, audioConfig);

        _transcriber.Transcribed += (s, e) =>
        {
            Console.WriteLine($"Result: '{e.Result.Text}'");
        };
    }

    public async Task StartAsync() => await _transcriber.StartTranscribingAsync();
    public async Task StopAsync() => await _transcriber.StopTranscribingAsync();

    public void PushAudio(byte[] buffer) => _pushStream.Write(buffer);
    public void CloseStream() => _pushStream.Close();

User's image

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
{count} votes

Answer accepted by question author
  1. Manas Mohanty 13,340 Reputation points Moderator
    2025-11-27T11:50:47.7066667+00:00

    Hi Julien S!

    Good day. Thank you for sharing your observaitons.

    After several cycles, the SDK can fall into a state where it keeps emitting “final” results with empty text (often at ~100–200 ms cadence) even though audio is still being pushed. This typically presents as Transcribed events whose Result.Reason is NoMatch rather than RecognizedSpeech. [github.com]

    Below is a concise way to reproduce and then stabilize the scenario, plus code you can drop into your WPF test app.


    Why this happens (and what to avoid)

    • Reusing the same PushAudioInputStream and ConversationTranscriber across many start/stop cycles can leave the internal session/segmentation state in a weird loop, producing empty final results. Similar reuse issues have been reported in SDK samples/issues when a push stream is reused without a full close/reset. The safest pattern is to create a fresh PushAudioInputStream and AudioConfig (and often a new transcriber) for each session, while reusing the SpeechConfig. [github.com], [github.com]
    • The SDK doesn’t provide a true pause/resume on the input stream; “pause” is effectively StopTranscribingAsync(), and “resume” is StartTranscribingAsync() (often with a new stream). The official method description even calls Stop “used to pause the conversation,” but in practice you’ll get the most predictable results by closing the stream and starting with a new stream/transcriber. [Conversati...soft Learn | Learn.Microsoft.com], [PushAudioI...soft Learn | Learn.Microsoft.com]
    • If you only stop pushing audio (without stopping transcribing) you’ll often see periodic empty results because the recognizer’s VAD/segmentation still advances while receiving near‑silence or malformed buffers. Always stop transcribing before you stop/tear down capture. [learn.microsoft.com]

    Minimal fix: reset the stream/transcriber each cycle

    Key ideas:

    1. Keep a single SpeechConfig for the lifetime of the app.
    2. For each “recording session,” construct a new PushAudioInputStreamAudioConfigConversationTranscriber.
    3. On stop, call StopTranscribingAsync(), then Close() the push stream, dispose the transcriber, and await SessionStopped before starting a new session.
    4. In your event handlers, ignore NoMatch results (they look like “empty text”). [github.com]

    Attached sample code for reference.

    using Microsoft.CognitiveServices.Speech;
    using Microsoft.CognitiveServices.Speech.Audio;
    using Microsoft.CognitiveServices.Speech.Transcription;
    using System;
    using System.Threading.Tasks;
    public sealed class TranscribeHarness : IDisposable
    {
    private readonly SpeechConfig _speechConfig;
    private ConversationTranscriber _transcriber;
    private PushAudioInputStream _pushStream;
    private AudioConfig _audioConfig;
    public TranscribeHarness(string key, string region)
    {
    _speechConfig = SpeechConfig.FromSubscription(key, region);
    _speechConfig.SpeechRecognitionLanguage = "en-US";
    // Optional: enable SDK logging during troubleshooting
    _speechConfig.SetProperty("Speech_LogFilename", "speechsdk.log"); // .NET property name
    _speechConfig.EnableAudioLogging(); // available in several SDKs; safe to try
    }
    public void InitializeNewSession()
    {
    // 16 kHz, 16-bit, mono PCM
    var format = AudioStreamFormat.GetWaveFormatPCM(16000, 16, 1);
    _pushStream = AudioInputStream.CreatePushStream(format);
    _audioConfig = AudioConfig.FromStreamInput(_pushStream);
    _transcriber = new ConversationTranscriber(_speechConfig, _audioConfig);
    // Wire events once per transcriber instance
    _transcriber.Transcribing += (s, e) =>
    {
    // Interim text for UI (may be empty for silence)
    Console.WriteLine($"[Transcribing] {e.Result.Text}");
    };
    _transcriber.Transcribed += (s, e) =>
    {
    if (e.Result.Reason == ResultReason.RecognizedSpeech)
    {
    Console.WriteLine($"[Final] {e.Result.Text} (SpeakerId={e.Result.SpeakerId})");
    }
    else if (e.Result.Reason == ResultReason.NoMatch)
    {
    // Ignore blank finals to avoid flooding the UI/logs
    Console.WriteLine("[Final: NoMatch]");
    }
    };
    _transcriber.Canceled += (s, e) =>
    {
    Console.WriteLine($"[Canceled] Reason={e.Reason} ErrorCode={e.ErrorCode} Details={e.ErrorDetails}");
    };
    _transcriber.SessionStarted += (s, e) => Console.WriteLine($"[SessionStarted] {e.SessionId}");
    _transcriber.SessionStopped += (s, e) => Console.WriteLine($"[SessionStopped] {e.SessionId}");
    }
    public Task StartAsync() => _transcriber.StartTranscribingAsync(); // "pause/resume" pattern is Start/Stop
    public void PushAudio(byte[] buffer) => _pushStream?.Write(buffer);
    public async Task StopAndResetAsync()
    {
    // Stop transcribing first to avoid periodic empties during silence
    if (_transcriber != null) await _transcriber.StopTranscribingAsync();
    // Close stream so the service sees an actual EOS
    _pushStream?.Close();
    // Dispose objects before next session
    _transcriber?.Dispose();
    _audioConfig?.Dispose();
    _transcriber = null;
    _audioConfig = null;
    _pushStream = null;
    }
    public void Dispose() => StopAndResetAsync().GetAwaiter().GetResult();
    }
    

    Usage in your cycle:

    C#

    await harness.StopAndResetAsync(); // cleanly end previous session
    harness.InitializeNewSession(); // new stream/transcriber
    await harness.StartAsync(); // start transcribing
    // ... call harness.PushAudio(buffer) while recording ...
    // when user stops recording:
    await harness.StopAndResetAsync(); // stop & teardown; then loop back to InitializeNewSession()
    
    

    This pattern aligns with the SDK’s guidance on Stop → Start for conversation sessions and avoids reusing the same stream/transcriber object across cycles. [Conversati...soft Learn | Learn.Microsoft.com], [PushAudioI...soft Learn | Learn.Microsoft.com]

    Preserving conversation context & speaker mapping

    If you need continuity of speaker labels (e.g., “Speaker1/Speaker2” or enrolled participants), use a Conversation object and join a new transcriber to the existing conversation when resuming. That lets you reset the audio pipeline while keeping the same conversation context. See the C# quickstart sample (shows ConversationTranscriber with conversation and how to handle RecognizedSpeech vs NoMatch). [github.com]

    Note: Without enrolled voices, real‑time diarization can assign generic speakers (Speaker1/2) but isn’t guaranteed to keep identities stable across separate sessions; anchoring to a Conversation helps, but fully stable mapping typically requires voice signatures for participants. [github.com]

    Additional checks to stabilize your WPF repro

    1. Event filtering: Only treat ResultReason.RecognizedSpeech as real output; log/ignore NoMatch. (Sample shows this exact check.) [github.com]
    2. Stop before stopping capture: Call StopTranscribingAsync() before stopping NAudio or muting the client feed; otherwise the recognizer continues segmenting silence and may emit empty finals. [learn.microsoft.com]
    3. Close the stream on stop; do not keep the same PushAudioInputStream for the next session. (Closing signals EOS and resets the pipeline cleanly.) [PushAudioI...soft Learn | Learn.Microsoft.com]
    4. Create a fresh transcriber each cycle. Reuse of push stream/transcriber across sessions is a known source of odd behavior in push‑stream scenarios. [github.com], [github.com]
    5. Audio format: Your 16 kHz/16‑bit/mono PCM format is correct. If you later move to compressed input (MP3/OGG‑Opus/etc.), ensure GStreamer is installed and on PATH or you’ll see unrecognized/empty results. [docs.azure.cn]
    6. Enable SDK logs while diagnosing (Speech_LogFilename, EnableAudioLogging) to capture segmentation and EOS events for a session. (Multiple examples use this property while troubleshooting.) [stackoverflow.com]

    Addressing your specific questions

    Is this a known limitation/bug? There are public reports of empty results / end‑of‑stream quirks and reuse issues around push streams and conversation sessions; the common workaround is not reusing the same PushAudioInputStream/recognizer across cycles and instead closing & recreating per session. [learn.microsoft.com], [stackoverflow.com], [github.com]

    • Recommended way to pause/resume while preserving context? The supported pattern is StopTranscribingAsync() (pause) and StartTranscribingAsync() (resume). To preserve context/speaker mapping across resumes, anchor your sessions to a Conversation and re‑join with a new transcriber when you resume; avoid reusing the old stream/transcriber. [Conversati...soft Learn | Learn.Microsoft.com], [github.com]
    • Workarounds/advice Use the reset‑per‑cycle approach above; filter out NoMatch finals; verify that your NAudio buffers aren’t zeros/silence during “recording stopped” phases; if you use compressed input later, install GStreamer and prefer WAV/PCM 16 kHz mono when possible for predictability.After several cycles, the SDK can fall into a state where it keeps emitting “final” results with empty text (often at ~100–200 ms cadence) even though audio is still being pushed. This typically presents as Transcribed events whose Result.Reason is NoMatch rather than RecognizedSpeech. [github.com] Below is a concise way to reproduce and then stabilize the scenario, plus code you can drop into your WPF test app.### References
    • Microsoft Q&A: ConversationTranscriber stops recognizing audio after several Start/Stop cycles with PushAudioInputStream (matches your scenario) [learn.microsoft.com]
    • Stack Overflow / Q&A threads on EOS & reuse behaviors with PushAudioInputStream and ConversationTranscriber (empty outputs / EOS, and reuse pitfalls) [stackoverflow.com], [stackoverflow.com], [github.com]
    • .NET API docs: ConversationTranscriber.StopTranscribingAsync() and PushAudioInputStream.Close() (pause/stop vs reset) [Conversati...soft Learn | Learn.Microsoft.com], [PushAudioI...soft Learn | Learn.Microsoft.com]
    • .NET API docs: PushAudioInputStream class (supported write/close semantics) [PushAudioI...soft Learn | Learn.Microsoft.com]
    • C# sample: handling RecognizedSpeech vs NoMatch in Transcribed event (ignore empty results) and conversation usage pattern [github.com]
    • Compressed input guidance (GStreamer requirements) for MP3/OGG, etc. [docs.azure.cn]

    Please let us know if it helps address the issue.

    Thank you

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.