Azure OpenAI gpt-realtime – Intermittent ResponseFailed (server_error) during peak hours

Question

Azure OpenAI gpt-realtime – Intermittent ResponseFailed (server_error) during peak hours

Himanshu Changwal 10

We are using Azure OpenAI gpt-realtime model over WebSocket for a production use case and are seeing intermittent failures during peak traffic hours. We are using eastus2.

Error Details

{
  "name": "ResponseFailed",
  "response_id": "resp_XXXXXXXXXXXXXXXXXXXXXXXXX",
  "status": "failed",
  "error_type": "server_error",
  "details": {
    "type": "failed",
    "error": {
      "type": "server_error",
      "code": null,
      "message": "The server had an error while processing your request. Sorry about that! Please contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options  if the error persists. (include session ID in your message: sess_XXXXXXXXXXXXXXX). We recommend you retry your request. If the problem persists, contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options . (Please include the session ID 
sess_XXXXXXXXXXXXXXX in your message.)"     }   },
"error_name": "ResponseFailed" 
}

Happens randomly (not all requests fail)
Exponential retries are not helping much.

Observed Pattern

Total sessions (25-hour window): ~10,000
Total failures: 290(2.9%)

Key observations:

Failures are very low during off-peak hours
Significant spike during 07:00–09:00 PDT
- Example:
  - 06:00 → ~997 sessions → 8 failures
  - 08:00 → ~893 sessions → 135 failures

Current Setup

Using WebSocket-based realtime API
Multiple concurrent sessions (one per user interaction)
Sessions may stay open for multiple interactions
TPM limits already increased on Azure

What We Tried

Increased tokens per minute (TPM) → no improvement
Added basic retry → helps partially
Verified request payloads → no structural issues

Questions

Are there recommended limits on concurrent WebSocket sessions per deployment?

Could this be due to:

session reuse / long-lived sessions?
- hidden concurrency throttling?
- region-specific capacity issues?
Are these failures expected (~2–5%) and should be handled via retries?

Any best practices for:

session lifecycle (rotation, timeout)?
- connection pooling?
- load distribution across deployments?

0 comments

4 answers

Your answer

Answer 1

{

"name": "ResponseFailed",

"response_id": "resp_XXXXXXXXXXXXXXXXXXXXXXXXX",

"status": "failed",

"error_type": "server_error",

"details": {

"type": "failed",

"error": {

  "type": "server_error",

  "code": null,

  "message": "The server had an error while processing your request. Sorry about that! Please contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options  if the error persists. (include session ID in your message: sess_XXXXXXXXXXXXXXX). We recommend you retry your request. If the problem persists, contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options . (Please include the session ID

sess_XXXXXXXXXXXXXXX in your message.)" } },

"error_name": "ResponseFailed"

}

Answer 2

Himanshu Changwal 10

We have observed an increase in failure rates today and require immediate attention. Please advise on steps to mitigate and resolve this issue.
Screenshot 2026-04-14 at 2.17.15 PM

0 comments

Answer 3

We’re seeing something very similar on our side (also in East US 2), but using the standard /openai/v1/responses API rather than realtime. We went through the same debugging steps. Payload validation, retry logic, TPM limits, config and nothing obvious was wrong. What helped was looking at metrics instead of just the surface errors:

Latency spikes (we’ve seen Time to Last Byte up to ~8–9s)
Azure metrics showing 499s (client closing the request)
No corresponding 408s in Azure metrics, even though the client surfaces timeouts

The pattern we’re seeing:

request starts fine
backend slows down intermittently
client times out / cancels
surfaces as 408 / incomplete responses / sometimes even content_filter depending on the client

It’s inconsistent and seems to line up with higher traffic windows, which matches what you’re describing pretty closely. Also worth noting... this wasn’t happening for us prior to last week under similar workload, so it feels maybe like a backend/regional change than anything on our side? Looks like regional performance variability or capacity constraints in East US 2.

Curious how others are handling cost in these scenarios since some requests partially execute before timing out, it’s not clear to me how billing is applied when responses don’t complete.

Himanshu Changwal 10 Reputation points

2026-04-14T19:33:25.4+00:00

Thanks for sharing this. The latency and 499 behavior you're describing is useful context and points toward backend slowdown rather than request-level issues.

In our case (realtime over WebSockets), the failures surface differently as server_error, but the pattern is consistent with intermittent degradation rather than hard limits. We're also seeing that retries with a fresh session typically succeed, which further suggests the issue is not payload or configuration related.

One aspect we're investigating is whether session lifecycle plays a role under load. With realtime, multiple concurrent and long-lived sessions may be contributing to instability during peak windows, even when overall request volume is within limits.

On the billing point, we have the same concern. It appears that once a request is accepted and processing begins, token usage may still be counted even if the response does not complete.

It would be helpful to know if others have observed improvement by adjusting client-side timeouts or by distributing traffic across multiple deployments or regions. At this stage, the behavior seems more consistent with regional capacity or performance variability than anything specific to client implementation.

Answer 4

The error pattern described matches intermittent server-side issues or regional capacity constraints rather than a client-side payload or quota problem.

From the available information, the following points are supported:

Realtime WebSocket connection behavior and limits

The GPT Realtime API is accessed via a secure WebSocket connection to the /realtime endpoint of an Azure OpenAI resource.
Correct construction of the WebSocket URL is critical; using the wrong path or mixing GA/preview formats results in errors (including 404 or auth-related issues), but not the intermittent server_error pattern described.
- GA WebSocket URL format:
```
    wss://<resource>.openai.azure.com/openai/v1/realtime?model=<gpt-realtime-deployment-name>
```
- Preview WebSocket URL format:
```
    wss://<resource>.openai.azure.com/openai/realtime?api-version=2025-04-01-preview&deployment=<realtime-preview-deployment-name>
```
The documentation does not specify a hard limit on concurrent WebSocket sessions per deployment. It does, however, distinguish recommended protocol usage:
- WebRTC: best for low-latency, client-side real-time audio.
- WebSocket: best for server-to-server and batch processing, with moderate latency.
- SIP: for telephony integration.

Given this, the intermittent server_error responses during peak hours are consistent with transient service-side issues or regional load, not with a documented per-session concurrency limit.

Whether a 2–5% failure rate is “expected” and how to handle it

The FAQ for Azure OpenAI notes that when the service performs processing, charges can apply even if the status code is not 200 (for example, 400 due to content filter or 408 due to timeout). It also notes that some 500-level errors can occur and recommends retry with backoff.
For known 500-level issues (for example, “invalid Unicode output” or “Unexpected special token”), the guidance is:
- Reduce temperature.
- Ensure client has retry logic.
- Reattempting often results in a successful response.

By analogy, intermittent server_error failures on Realtime are expected to be handled via robust retry logic and resiliency patterns. However, a sustained 2–5% failure rate concentrated in a specific time window and region is not documented as a normal baseline; it is a signal to:

Implement resilient client behavior (retries, backoff, failover), and
Open an Azure support request including session IDs and timestamps so the service team can check regional capacity or backend issues.

Region and capacity considerations

Other Azure services (for example, Document Intelligence in West US 2) show that region-specific service-side issues can cause timeouts or failures even when client code and payloads are correct. The recommended actions there are:
- Check Azure Status and Service Health for incidents in the region.
- Try a different region to see if the issue is regional.

For the described pattern (spike in failures in eastus2 during 07:00–09:00 PDT), similar guidance applies:

Check Azure Service Health for eastus2 and the Azure OpenAI resource.
If possible, deploy an additional Azure OpenAI resource in a second region and route a portion of traffic there during peak hours to see if failures drop.

Best practices for session lifecycle and resiliency

From the Realtime and migration guidance, plus general Azure OpenAI recommendations, the following practices are supported:

Use the correct GA vs preview endpoint and query parameters:
- GA: /openai/v1/realtime?model=... (no api-version).
- Preview: /openai/realtime?api-version=...&deployment=....
- Mixing these can cause connection/auth issues; ensure all clients are consistent.
Handle WebSocket and session-level failures explicitly:
- Monitor for session.created and other events to verify sessions are established correctly.
- On ResponseFailed / server_error, close the WebSocket and create a new session rather than reusing a potentially corrupted one.
Session lifecycle:
- Avoid unnecessarily long-lived sessions if traffic is bursty; prefer:
  - One session per interaction or per short-lived conversation window, or
  - Periodic rotation of sessions (for example, after N interactions or M minutes).
- This reduces the chance that a long-lived session is affected by transient backend issues.
Retry strategy:
- Use exponential backoff with jitter.
- Limit the total number of retries and overall timeout per user interaction to keep UX acceptable.
- On repeated server_error for the same session, create a new session (new WebSocket) before retrying.
Load distribution:
- If the application is latency-tolerant and server-to-server, WebSockets are appropriate; if low latency is critical, consider WebRTC as recommended.
- For high-scale scenarios, consider:
  - Multiple deployments or resources (possibly in multiple regions).
  - Application-level routing to spread load across deployments/regions.

What to do next for this specific scenario

Based on the documented behavior and guidance:

Validate endpoint usage:
- Ensure all clients use the GA or preview format consistently and do not include api-version on GA endpoints, as this can cause authentication/connection issues after migration.
Implement stronger resiliency:
- On server_error:
  - Log session_id and response_id.
  - Close the WebSocket and establish a new session.
  - Retry the interaction with backoff.
Check for regional issues and consider multi-region:
- Use Azure Service Health to check eastus2 for incidents affecting Azure OpenAI.
- If possible, deploy a second Azure OpenAI resource in another region and route a portion of traffic there during 07:00–09:00 PDT to see if failure rates drop.
Engage Azure Support:
- The error message explicitly instructs to file an Azure support request and include the session ID. For a persistent, time-windowed 2–5% server_error rate, this is appropriate so the product team can investigate backend capacity or reliability issues in eastus2.

The documentation does not expose a specific numeric limit for concurrent WebSocket sessions per deployment, nor does it state that a 2–5% server_error rate is expected. The supported guidance is to:

Use the correct Realtime endpoint format.
Implement robust retry and session recreation logic.
Check for regional issues and, if needed, use multiple regions.
Open a support case with session IDs when server_error persists.

References:

Share via

Azure OpenAI gpt-realtime – Intermittent ResponseFailed (server_error) during peak hours

4 answers

Your answer