GPT Realtime models do not properly detect input audio on all SIP calls

Question

GPT Realtime models do not properly detect input audio on all SIP calls

Majad 0

I would like to request assistance regarding the usage of OpenAI Realtime models over SIP connections from the Microsoft Foundry model provisioning.

I utilize a ‘gpt-realtime’ (GA version) model to handle SIP calls through a SIP trunk provider, in addition to a backend to authorize calls and supervise events. Over the last week I’ve had issues with this, mainly regarding intermittent situations where the model is unable to receive or process any audio input as communication starts. As mentioned, this behavior is inconsistent due to some of these calls not presenting this issue.

Up until the end of April, the service functioned normally, and all calls were able to interact with the model. The issue has appeared only in recent days.

I’ve discarded any issues with the SIP Trunk provider. We suspect there might be issues with the gpt-realtime SIP connection or configuration. gpt-realtime-1.5 was also tested but presented similar issues.

As such, I would appreciate your support clarifying the following:

- If any changes have occurred to the core functionality of this model regarding SIP/API usage (I’ve followed Migration from Preview to GA version of Realtime API - Microsoft Foundry | Microsoft Learn and I’ve been using these changes, but I would like to know if any of these are critical for call flow, or if any other changes have occurred to SIP communications or Realtime models that are affecting my case.

- If any issues or interferences related to gpt-realtime models and/or SIP communication are currently happening that might have an impact on functionality.

I appreciate your support with this matter.

0 comments

2 answers

Your answer

Answer 1

Hello Jerald,

Thank you for your response. Based on my testing, I was able to gather the following findings:

No incidents were identified in Azure Service Health related to the use of OpenAI Realtime or SIP in any of the supported regions.
The latest updates to the Realtime models on the Microsoft website do not indicate changes that would impact this implementation, as they mainly apply to the gpt-realtime-2 model.
Adjustments were made on the backend to ensure that the session.created event is successfully established; however, interruptions during certain calls still persist.
I am currently using server_vad. My initial configuration values were:
- Threshold: 0.6
- Prefix padding (ms): 300
- Silence duration (ms): 800
I have tested several variations—primarily lowering the threshold and increasing both prefix padding and silence duration—but none of these changes resolved the issue.
I verified that the audio codec and format are correctly supported by the SIP trunk provider, using G.711 µ-law (ulaw). The provider also reported that the ACK/BYE messages are being sent by the model, which appears to terminate the call almost immediately.
The OpenAI_Webhook_Secret remains consistent across all requests, and, as mentioned previously, every call successfully results in a valid session.created event.

Answer 2

Hello Majad,

Greetings! Thanks for raising this question in Q&A forum.

The intermittent audio input detection failure you're experiencing on SIP calls with the GPT Realtime model is most likely caused by a combination of two things a subtle race condition in how the Realtime model initializes audio input processing at the start of a session, and potential codec or audio format negotiation timing differences between individual SIP calls. Since the issue is inconsistent (some calls work, some don't) and started appearing only recently without any changes on your end, it strongly suggests a backend service-side change or regression introduced after April that is affecting how the SIP audio stream gets picked up at session start.

This type of intermittent audio failure is a known behavior pattern with the current GPT Realtime models it is related to how the model handles audio stream initialization internally, and is not necessarily an error in your integration logic.

Here are the steps I'd recommend to investigate and work around this:

First, check the Azure Service Health dashboard at https://status.azure.com and filter for Azure OpenAI or Azure AI Foundry in your deployed regions (East US 2 or Sweden Central). There may be an ongoing or recently resolved incident that correlates with the late-April timeframe when things started breaking.

Review the Azure OpenAI What's New page at https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new for any updates pushed after April. The Realtime API has recently received SIP support for telephony connections, and new model features have been added — it's possible a backend rollout introduced a regression in audio detection timing for existing SIP integrations.

On your backend, check whether the session.created event is being fully received and acknowledged before your SIP trunk starts sending audio. A common cause of intermittent input failure is that audio from the caller arrives at the model endpoint before the session is fully ready. Add a small buffer or gate the audio stream until you receive session.created confirmation.

Verify your turn_detection configuration in the session.update event. If you're using server_vad (Voice Activity Detection), confirm that the threshold, silence_duration_ms, and prefix_padding_ms values are appropriately set. Too high a silence threshold can cause the model to miss the beginning of a caller's speech, especially for calls where audio starts immediately.

Check the audio codec and format being negotiated on the SIP calls that fail versus those that succeed. The GPT Realtime API over SIP expects PCM16 audio at 24kHz. If some SIP calls are negotiating a different codec (such as G.711 or G.729) and your transcoder introduces even a slight delay or format mismatch, the model may fail to pick up the initial audio stream.

Also verify your OPENAI_WEBHOOK_SECRET environment variable matches the secret from webhook creation, and ensure you're passing raw request body bytes — not parsed JSON — to the unwrap function, and that no middleware is modifying the request body before verification. These subtle configuration issues can cause intermittent failures that look like audio processing problems.

Since gpt-realtime-1.5 also showed the same issue, this helps rule out a model-version-specific bug and points more toward the SIP session initialization or backend infrastructure. I would strongly recommend raising an Azure Support ticket with the following details: your deployment region, the model version used, sample call timestamps where the failure occurred, and the specific event logs from your backend showing what events were received (or not received) from the Realtime API during the failed calls. This will allow the Azure OpenAI engineering team to check for any backend changes that may have impacted SIP audio handling after April.

The most actionable next step while you wait for support is to add a guard in your backend that buffers incoming SIP audio and only begins forwarding it to the Realtime API once the session.created event has been confirmed — this alone has resolved similar intermittent input detection issues for other developers.

If this answer helps you kindly accept the answer which will help others who have similar questions.

Best Regards,

Jerald Felix.

Share via

GPT Realtime models do not properly detect input audio on all SIP calls

2 answers

Your answer