Has Diarization in Speech SDK been implemented for overlapping audio of multiple speakers speaking simultaneously ?

Shyamal Goel 0 Reputation points
2024-10-01T09:16:56.08+00:00

To the Microsoft Support Team,

We have been using ConversationTranscriber of the Azure Speech SDK, to implement Diarization in our project, and have encountered an issue in which we need your assistance.

In our project, the Transcriber works well when 2 or more speakers speak separately, i.e., their audios do not overlap. In this scenario, separate speakers and their spoken audio is recognized.
But when 2 or more speakers speak simultaneously, i.e., their audios overlap, the Transcriber does not identify the speakers separately. Instead, it clubs their spoken audio together, and classifies it as a single speaker. Sometimes, it detects different parts of the different audios, returning erroneous results.

Our project setup is as follows :

  1. We have a Gstreamer C++ project, in which we are implementing the Azure Speech SDK.
  2. The project receives an OPUS audio stream, containing audio of speakers speaking in real time.
  3. The OPUS audio stream is converted into a raw audio stream (format : S16LE, rate : 16000, channel : mono)
  4. Samples from this raw audio stream are pushed to a pushstream (whenever they become available). The pushstream has been configured with the Transcriber.
  5. The transcribing asynchronous process is running in the background, and it transcribes audio from the pushstream.

We have been using the following documentation as reference :  

  1. https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/cpp/windows/console/samples/conversation_transcriber_samples.cpp   (ConversationTranscriptionWithPushAudioStream())
     2. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=linux&pivots=programming-language-cpp

As mentioned above, we get correct results when speakers speak separately. But when they speak simultaneously, we get erroneous results.

We wished to know whether Diarization using ConversationTranscriber has been implemented for overlapping speakers’? If so, could you kindly assist us in identifying what might be going wrong with our project setup or our approach to implementing the Transcriber? Are we using the correct functions from the Speech SDK to implement overlapping audios’ Diarization ? Could you also provide us with the relevant documentation/working examples to help us further?

Thanks and regards,
Shyamal Goel (edited) 

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,705 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.