Hello Majad,
Greetings! Thanks for raising this question in Q&A forum.
The intermittent audio input detection failure you're experiencing on SIP calls with the GPT Realtime model is most likely caused by a combination of two things a subtle race condition in how the Realtime model initializes audio input processing at the start of a session, and potential codec or audio format negotiation timing differences between individual SIP calls. Since the issue is inconsistent (some calls work, some don't) and started appearing only recently without any changes on your end, it strongly suggests a backend service-side change or regression introduced after April that is affecting how the SIP audio stream gets picked up at session start.
This type of intermittent audio failure is a known behavior pattern with the current GPT Realtime models it is related to how the model handles audio stream initialization internally, and is not necessarily an error in your integration logic.
Here are the steps I'd recommend to investigate and work around this:
First, check the Azure Service Health dashboard at https://status.azure.com and filter for Azure OpenAI or Azure AI Foundry in your deployed regions (East US 2 or Sweden Central). There may be an ongoing or recently resolved incident that correlates with the late-April timeframe when things started breaking.
Review the Azure OpenAI What's New page at https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new for any updates pushed after April. The Realtime API has recently received SIP support for telephony connections, and new model features have been added — it's possible a backend rollout introduced a regression in audio detection timing for existing SIP integrations.
On your backend, check whether the session.created event is being fully received and acknowledged before your SIP trunk starts sending audio. A common cause of intermittent input failure is that audio from the caller arrives at the model endpoint before the session is fully ready. Add a small buffer or gate the audio stream until you receive session.created confirmation.
Verify your turn_detection configuration in the session.update event. If you're using server_vad (Voice Activity Detection), confirm that the threshold, silence_duration_ms, and prefix_padding_ms values are appropriately set. Too high a silence threshold can cause the model to miss the beginning of a caller's speech, especially for calls where audio starts immediately.
Check the audio codec and format being negotiated on the SIP calls that fail versus those that succeed. The GPT Realtime API over SIP expects PCM16 audio at 24kHz. If some SIP calls are negotiating a different codec (such as G.711 or G.729) and your transcoder introduces even a slight delay or format mismatch, the model may fail to pick up the initial audio stream.
Also verify your OPENAI_WEBHOOK_SECRET environment variable matches the secret from webhook creation, and ensure you're passing raw request body bytes — not parsed JSON — to the unwrap function, and that no middleware is modifying the request body before verification. These subtle configuration issues can cause intermittent failures that look like audio processing problems.
Since gpt-realtime-1.5 also showed the same issue, this helps rule out a model-version-specific bug and points more toward the SIP session initialization or backend infrastructure. I would strongly recommend raising an Azure Support ticket with the following details: your deployment region, the model version used, sample call timestamps where the failure occurred, and the specific event logs from your backend showing what events were received (or not received) from the Realtime API during the failed calls. This will allow the Azure OpenAI engineering team to check for any backend changes that may have impacted SIP audio handling after April.
The most actionable next step while you wait for support is to add a guard in your backend that buffers incoming SIP audio and only begins forwarding it to the Realtime API once the session.created event has been confirmed — this alone has resolved similar intermittent input detection issues for other developers.
If this answer helps you kindly accept the answer which will help others who have similar questions.
Best Regards,
Jerald Felix.