Has Diarization in Speech SDK been implemented for overlapping audio of multiple speakers speaking simultaneously ?

Question

To the Microsoft Support Team,

We have been using ConversationTranscriber of the Azure Speech SDK, to implement Diarization in our project, and have encountered an issue in which we need your assistance.

In our project, the Transcriber works well when 2 or more speakers speak separately, i.e., their audios do not overlap. In this scenario, separate speakers and their spoken audio is recognized.
But when 2 or more speakers speak simultaneously, i.e., their audios overlap, the Transcriber does not identify the speakers separately. Instead, it clubs their spoken audio together, and classifies it as a single speaker. Sometimes, it detects different parts of the different audios, returning erroneous results.

Our project setup is as follows :

We have a Gstreamer C++ project, in which we are implementing the Azure Speech SDK.
The project receives an OPUS audio stream, containing audio of speakers speaking in real time.
The OPUS audio stream is converted into a raw audio stream (format : S16LE, rate : 16000, channel : mono)
Samples from this raw audio stream are pushed to a pushstream (whenever they become available). The pushstream has been configured with the Transcriber.
The transcribing asynchronous process is running in the background, and it transcribes audio from the pushstream.

We have been using the following documentation as reference :

https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/cpp/windows/console/samples/conversation_transcriber_samples.cpp (ConversationTranscriptionWithPushAudioStream())
2. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=linux&pivots=programming-language-cpp

As mentioned above, we get correct results when speakers speak separately. But when they speak simultaneously, we get erroneous results.

We wished to know whether Diarization using ConversationTranscriber has been implemented for overlapping speakers’? If so, could you kindly assist us in identifying what might be going wrong with our project setup or our approach to implementing the Transcriber? Are we using the correct functions from the Speech SDK to implement overlapping audios’ Diarization ? Could you also provide us with the relevant documentation/working examples to help us further?

Thanks and regards,
Shyamal Goel (edited)

Share via

Has Diarization in Speech SDK been implemented for overlapping audio of multiple speakers speaking simultaneously ?

Your answer