Hi Julien S!
Good day. Thank you for sharing your observaitons.
After several cycles, the SDK can fall into a state where it keeps emitting “final” results with empty text (often at ~100–200 ms cadence) even though audio is still being pushed. This typically presents as Transcribed events whose Result.Reason is NoMatch rather than RecognizedSpeech. [github.com]
Below is a concise way to reproduce and then stabilize the scenario, plus code you can drop into your WPF test app.
Why this happens (and what to avoid)
- Reusing the same
PushAudioInputStreamandConversationTranscriberacross many start/stop cycles can leave the internal session/segmentation state in a weird loop, producing empty final results. Similar reuse issues have been reported in SDK samples/issues when a push stream is reused without a full close/reset. The safest pattern is to create a freshPushAudioInputStreamandAudioConfig(and often a new transcriber) for each session, while reusing theSpeechConfig. [github.com], [github.com] - The SDK doesn’t provide a true pause/resume on the input stream; “pause” is effectively
StopTranscribingAsync(), and “resume” isStartTranscribingAsync()(often with a new stream). The official method description even calls Stop “used to pause the conversation,” but in practice you’ll get the most predictable results by closing the stream and starting with a new stream/transcriber. [Conversati...soft Learn | Learn.Microsoft.com], [PushAudioI...soft Learn | Learn.Microsoft.com] - If you only stop pushing audio (without stopping transcribing) you’ll often see periodic empty results because the recognizer’s VAD/segmentation still advances while receiving near‑silence or malformed buffers. Always stop transcribing before you stop/tear down capture. [learn.microsoft.com]
Minimal fix: reset the stream/transcriber each cycle
Key ideas:
- Keep a single
SpeechConfigfor the lifetime of the app. - For each “recording session,” construct a new
PushAudioInputStream→AudioConfig→ConversationTranscriber. - On stop, call
StopTranscribingAsync(), thenClose()the push stream, dispose the transcriber, and awaitSessionStoppedbefore starting a new session. - In your event handlers, ignore NoMatch results (they look like “empty text”). [github.com]
Attached sample code for reference.
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using Microsoft.CognitiveServices.Speech.Transcription;
using System;
using System.Threading.Tasks;
public sealed class TranscribeHarness : IDisposable
{
private readonly SpeechConfig _speechConfig;
private ConversationTranscriber _transcriber;
private PushAudioInputStream _pushStream;
private AudioConfig _audioConfig;
public TranscribeHarness(string key, string region)
{
_speechConfig = SpeechConfig.FromSubscription(key, region);
_speechConfig.SpeechRecognitionLanguage = "en-US";
// Optional: enable SDK logging during troubleshooting
_speechConfig.SetProperty("Speech_LogFilename", "speechsdk.log"); // .NET property name
_speechConfig.EnableAudioLogging(); // available in several SDKs; safe to try
}
public void InitializeNewSession()
{
// 16 kHz, 16-bit, mono PCM
var format = AudioStreamFormat.GetWaveFormatPCM(16000, 16, 1);
_pushStream = AudioInputStream.CreatePushStream(format);
_audioConfig = AudioConfig.FromStreamInput(_pushStream);
_transcriber = new ConversationTranscriber(_speechConfig, _audioConfig);
// Wire events once per transcriber instance
_transcriber.Transcribing += (s, e) =>
{
// Interim text for UI (may be empty for silence)
Console.WriteLine($"[Transcribing] {e.Result.Text}");
};
_transcriber.Transcribed += (s, e) =>
{
if (e.Result.Reason == ResultReason.RecognizedSpeech)
{
Console.WriteLine($"[Final] {e.Result.Text} (SpeakerId={e.Result.SpeakerId})");
}
else if (e.Result.Reason == ResultReason.NoMatch)
{
// Ignore blank finals to avoid flooding the UI/logs
Console.WriteLine("[Final: NoMatch]");
}
};
_transcriber.Canceled += (s, e) =>
{
Console.WriteLine($"[Canceled] Reason={e.Reason} ErrorCode={e.ErrorCode} Details={e.ErrorDetails}");
};
_transcriber.SessionStarted += (s, e) => Console.WriteLine($"[SessionStarted] {e.SessionId}");
_transcriber.SessionStopped += (s, e) => Console.WriteLine($"[SessionStopped] {e.SessionId}");
}
public Task StartAsync() => _transcriber.StartTranscribingAsync(); // "pause/resume" pattern is Start/Stop
public void PushAudio(byte[] buffer) => _pushStream?.Write(buffer);
public async Task StopAndResetAsync()
{
// Stop transcribing first to avoid periodic empties during silence
if (_transcriber != null) await _transcriber.StopTranscribingAsync();
// Close stream so the service sees an actual EOS
_pushStream?.Close();
// Dispose objects before next session
_transcriber?.Dispose();
_audioConfig?.Dispose();
_transcriber = null;
_audioConfig = null;
_pushStream = null;
}
public void Dispose() => StopAndResetAsync().GetAwaiter().GetResult();
}
Usage in your cycle:
C#
await harness.StopAndResetAsync(); // cleanly end previous session
harness.InitializeNewSession(); // new stream/transcriber
await harness.StartAsync(); // start transcribing
// ... call harness.PushAudio(buffer) while recording ...
// when user stops recording:
await harness.StopAndResetAsync(); // stop & teardown; then loop back to InitializeNewSession()
This pattern aligns with the SDK’s guidance on Stop → Start for conversation sessions and avoids reusing the same stream/transcriber object across cycles. [Conversati...soft Learn | Learn.Microsoft.com], [PushAudioI...soft Learn | Learn.Microsoft.com]
Preserving conversation context & speaker mapping
If you need continuity of speaker labels (e.g., “Speaker1/Speaker2” or enrolled participants), use a Conversation object and join a new transcriber to the existing conversation when resuming. That lets you reset the audio pipeline while keeping the same conversation context. See the C# quickstart sample (shows ConversationTranscriber with conversation and how to handle RecognizedSpeech vs NoMatch). [github.com]
Note: Without enrolled voices, real‑time diarization can assign generic speakers (Speaker1/2) but isn’t guaranteed to keep identities stable across separate sessions; anchoring to a
Conversationhelps, but fully stable mapping typically requires voice signatures for participants. [github.com]
Additional checks to stabilize your WPF repro
- Event filtering: Only treat
ResultReason.RecognizedSpeechas real output; log/ignoreNoMatch. (Sample shows this exact check.) [github.com] - Stop before stopping capture: Call
StopTranscribingAsync()before stopping NAudio or muting the client feed; otherwise the recognizer continues segmenting silence and may emit empty finals. [learn.microsoft.com] - Close the stream on stop; do not keep the same
PushAudioInputStreamfor the next session. (Closing signals EOS and resets the pipeline cleanly.) [PushAudioI...soft Learn | Learn.Microsoft.com] - Create a fresh transcriber each cycle. Reuse of push stream/transcriber across sessions is a known source of odd behavior in push‑stream scenarios. [github.com], [github.com]
- Audio format: Your 16 kHz/16‑bit/mono PCM format is correct. If you later move to compressed input (MP3/OGG‑Opus/etc.), ensure GStreamer is installed and on PATH or you’ll see unrecognized/empty results. [docs.azure.cn]
- Enable SDK logs while diagnosing (
Speech_LogFilename,EnableAudioLogging) to capture segmentation and EOS events for a session. (Multiple examples use this property while troubleshooting.) [stackoverflow.com]
Addressing your specific questions
Is this a known limitation/bug? There are public reports of empty results / end‑of‑stream quirks and reuse issues around push streams and conversation sessions; the common workaround is not reusing the same PushAudioInputStream/recognizer across cycles and instead closing & recreating per session. [learn.microsoft.com], [stackoverflow.com], [github.com]
- Recommended way to pause/resume while preserving context? The supported pattern is
StopTranscribingAsync()(pause) andStartTranscribingAsync()(resume). To preserve context/speaker mapping across resumes, anchor your sessions to aConversationand re‑join with a new transcriber when you resume; avoid reusing the old stream/transcriber. [Conversati...soft Learn | Learn.Microsoft.com], [github.com] - Workarounds/advice Use the reset‑per‑cycle approach above; filter out
NoMatchfinals; verify that your NAudio buffers aren’t zeros/silence during “recording stopped” phases; if you use compressed input later, install GStreamer and prefer WAV/PCM 16 kHz mono when possible for predictability.After several cycles, the SDK can fall into a state where it keeps emitting “final” results with empty text (often at ~100–200 ms cadence) even though audio is still being pushed. This typically presents asTranscribedevents whoseResult.Reasonis NoMatch rather than RecognizedSpeech. [github.com] Below is a concise way to reproduce and then stabilize the scenario, plus code you can drop into your WPF test app.### References - Microsoft Q&A: ConversationTranscriber stops recognizing audio after several Start/Stop cycles with PushAudioInputStream (matches your scenario) [learn.microsoft.com]
- Stack Overflow / Q&A threads on EOS & reuse behaviors with
PushAudioInputStreamand ConversationTranscriber (empty outputs / EOS, and reuse pitfalls) [stackoverflow.com], [stackoverflow.com], [github.com] - .NET API docs:
ConversationTranscriber.StopTranscribingAsync()andPushAudioInputStream.Close()(pause/stop vs reset) [Conversati...soft Learn | Learn.Microsoft.com], [PushAudioI...soft Learn | Learn.Microsoft.com] - .NET API docs:
PushAudioInputStreamclass (supported write/close semantics) [PushAudioI...soft Learn | Learn.Microsoft.com] - C# sample: handling
RecognizedSpeechvsNoMatchinTranscribedevent (ignore empty results) and conversation usage pattern [github.com] - Compressed input guidance (GStreamer requirements) for MP3/OGG, etc. [docs.azure.cn]
Please let us know if it helps address the issue.
Thank you