Share via

Azure OpenAI gpt-realtime – Intermittent ResponseFailed (server_error) during peak hours

Himanshu Changwal 10 Reputation points
2026-04-14T18:55:45.5733333+00:00

We are using Azure OpenAI gpt-realtime model over WebSocket for a production use case and are seeing intermittent failures during peak traffic hours. We are using eastus2.

Error Details

{
  "name": "ResponseFailed",
  "response_id": "resp_XXXXXXXXXXXXXXXXXXXXXXXXX",
  "status": "failed",
  "error_type": "server_error",
  "details": {
    "type": "failed",
    "error": {
      "type": "server_error",
      "code": null,
      "message": "The server had an error while processing your request. Sorry about that! Please contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options  if the error persists. (include session ID in your message: sess_XXXXXXXXXXXXXXX). We recommend you retry your request. If the problem persists, contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options . (Please include the session ID 
sess_XXXXXXXXXXXXXXX in your message.)"     }   },
"error_name": "ResponseFailed" 
}
  • Happens randomly (not all requests fail)
  • Exponential retries are not helping much.

Observed Pattern

  • Total sessions (25-hour window): ~10,000
  • Total failures: 290(2.9%)

Key observations:

  • Failures are very low during off-peak hours
  • Significant spike during 07:00–09:00 PDT
    • Example:
      • 06:00 → ~997 sessions → 8 failures
      • 08:00 → ~893 sessions → 135 failures

Current Setup

  • Using WebSocket-based realtime API
  • Multiple concurrent sessions (one per user interaction)
  • Sessions may stay open for multiple interactions
  • TPM limits already increased on Azure

What We Tried

  • Increased tokens per minute (TPM) → no improvement
  • Added basic retry → helps partially
  • Verified request payloads → no structural issues

Questions

  1. Are there recommended limits on concurrent WebSocket sessions per deployment?

Could this be due to:

  • session reuse / long-lived sessions?
    • hidden concurrency throttling?
    • region-specific capacity issues?
    Are these failures expected (~2–5%) and should be handled via retries?
  1. Any best practices for:
  • session lifecycle (rotation, timeout)?
    • connection pooling?
    • load distribution across deployments?
Azure OpenAI Service
Azure OpenAI Service

An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.

0 comments No comments

4 answers

Sort by: Most helpful
  1. 24250632 0 Reputation points
    2026-04-14T22:57:41.1733333+00:00

    {

    "name": "ResponseFailed",

    "response_id": "resp_XXXXXXXXXXXXXXXXXXXXXXXXX",

    "status": "failed",

    "error_type": "server_error",

    "details": {

    "type": "failed",
    
    "error": {
    
      "type": "server_error",
    
      "code": null,
    
      "message": "The server had an error while processing your request. Sorry about that! Please contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options  if the error persists. (include session ID in your message: sess_XXXXXXXXXXXXXXX). We recommend you retry your request. If the problem persists, contact us by filing an Azure support request in the portal. You can find information on how to do that here: https://learn.microsoft.com/azure/ai-services/cognitive-services-support-options . (Please include the session ID 
    

    sess_XXXXXXXXXXXXXXX in your message.)" } },

    "error_name": "ResponseFailed"

    }

    0 comments No comments

  2. Himanshu Changwal 10 Reputation points
    2026-04-14T21:18:42.9066667+00:00

    We have observed an increase in failure rates today and require immediate attention. Please advise on steps to mitigate and resolve this issue.
    Screenshot 2026-04-14 at 2.17.15 PM

    0 comments No comments

  3. Ian D 0 Reputation points
    2026-04-14T19:13:54.7966667+00:00

    We’re seeing something very similar on our side (also in East US 2), but using the standard /openai/v1/responses API rather than realtime. We went through the same debugging steps. Payload validation, retry logic, TPM limits, config and nothing obvious was wrong. What helped was looking at metrics instead of just the surface errors:

    • Latency spikes (we’ve seen Time to Last Byte up to ~8–9s)
    • Azure metrics showing 499s (client closing the request)
    • No corresponding 408s in Azure metrics, even though the client surfaces timeouts

    The pattern we’re seeing:

    • request starts fine
    • backend slows down intermittently
    • client times out / cancels
    • surfaces as 408 / incomplete responses / sometimes even content_filter depending on the client

    It’s inconsistent and seems to line up with higher traffic windows, which matches what you’re describing pretty closely. Also worth noting... this wasn’t happening for us prior to last week under similar workload, so it feels maybe like a backend/regional change than anything on our side? Looks like regional performance variability or capacity constraints in East US 2.

    Curious how others are handling cost in these scenarios since some requests partially execute before timing out, it’s not clear to me how billing is applied when responses don’t complete.


  4. Q&A Assist
    2026-04-14T18:56:17.38+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The error pattern described matches intermittent server-side issues or regional capacity constraints rather than a client-side payload or quota problem.

    From the available information, the following points are supported:

    1. Realtime WebSocket connection behavior and limits
    • The GPT Realtime API is accessed via a secure WebSocket connection to the /realtime endpoint of an Azure OpenAI resource.
    • Correct construction of the WebSocket URL is critical; using the wrong path or mixing GA/preview formats results in errors (including 404 or auth-related issues), but not the intermittent server_error pattern described.
      • GA WebSocket URL format:
            wss://<resource>.openai.azure.com/openai/v1/realtime?model=<gpt-realtime-deployment-name>
        
      • Preview WebSocket URL format:
            wss://<resource>.openai.azure.com/openai/realtime?api-version=2025-04-01-preview&deployment=<realtime-preview-deployment-name>
        
    • The documentation does not specify a hard limit on concurrent WebSocket sessions per deployment. It does, however, distinguish recommended protocol usage:
      • WebRTC: best for low-latency, client-side real-time audio.
      • WebSocket: best for server-to-server and batch processing, with moderate latency.
      • SIP: for telephony integration.

    Given this, the intermittent server_error responses during peak hours are consistent with transient service-side issues or regional load, not with a documented per-session concurrency limit.

    1. Whether a 2–5% failure rate is “expected” and how to handle it
    • The FAQ for Azure OpenAI notes that when the service performs processing, charges can apply even if the status code is not 200 (for example, 400 due to content filter or 408 due to timeout). It also notes that some 500-level errors can occur and recommends retry with backoff.
    • For known 500-level issues (for example, “invalid Unicode output” or “Unexpected special token”), the guidance is:
      • Reduce temperature.
      • Ensure client has retry logic.
      • Reattempting often results in a successful response.

    By analogy, intermittent server_error failures on Realtime are expected to be handled via robust retry logic and resiliency patterns. However, a sustained 2–5% failure rate concentrated in a specific time window and region is not documented as a normal baseline; it is a signal to:

    • Implement resilient client behavior (retries, backoff, failover), and
    • Open an Azure support request including session IDs and timestamps so the service team can check regional capacity or backend issues.
    1. Region and capacity considerations
    • Other Azure services (for example, Document Intelligence in West US 2) show that region-specific service-side issues can cause timeouts or failures even when client code and payloads are correct. The recommended actions there are:
      • Check Azure Status and Service Health for incidents in the region.
      • Try a different region to see if the issue is regional.

    For the described pattern (spike in failures in eastus2 during 07:00–09:00 PDT), similar guidance applies:

    • Check Azure Service Health for eastus2 and the Azure OpenAI resource.
    • If possible, deploy an additional Azure OpenAI resource in a second region and route a portion of traffic there during peak hours to see if failures drop.
    1. Best practices for session lifecycle and resiliency

    From the Realtime and migration guidance, plus general Azure OpenAI recommendations, the following practices are supported:

    • Use the correct GA vs preview endpoint and query parameters:
      • GA: /openai/v1/realtime?model=... (no api-version).
      • Preview: /openai/realtime?api-version=...&deployment=....
      • Mixing these can cause connection/auth issues; ensure all clients are consistent.
    • Handle WebSocket and session-level failures explicitly:
      • Monitor for session.created and other events to verify sessions are established correctly.
      • On ResponseFailed / server_error, close the WebSocket and create a new session rather than reusing a potentially corrupted one.
    • Session lifecycle:
      • Avoid unnecessarily long-lived sessions if traffic is bursty; prefer:
        • One session per interaction or per short-lived conversation window, or
        • Periodic rotation of sessions (for example, after N interactions or M minutes).
      • This reduces the chance that a long-lived session is affected by transient backend issues.
    • Retry strategy:
      • Use exponential backoff with jitter.
      • Limit the total number of retries and overall timeout per user interaction to keep UX acceptable.
      • On repeated server_error for the same session, create a new session (new WebSocket) before retrying.
    • Load distribution:
      • If the application is latency-tolerant and server-to-server, WebSockets are appropriate; if low latency is critical, consider WebRTC as recommended.
      • For high-scale scenarios, consider:
        • Multiple deployments or resources (possibly in multiple regions).
        • Application-level routing to spread load across deployments/regions.
    1. What to do next for this specific scenario

    Based on the documented behavior and guidance:

    • Validate endpoint usage:
      • Ensure all clients use the GA or preview format consistently and do not include api-version on GA endpoints, as this can cause authentication/connection issues after migration.
    • Implement stronger resiliency:
      • On server_error:
        • Log session_id and response_id.
        • Close the WebSocket and establish a new session.
        • Retry the interaction with backoff.
    • Check for regional issues and consider multi-region:
      • Use Azure Service Health to check eastus2 for incidents affecting Azure OpenAI.
      • If possible, deploy a second Azure OpenAI resource in another region and route a portion of traffic there during 07:00–09:00 PDT to see if failure rates drop.
    • Engage Azure Support:
      • The error message explicitly instructs to file an Azure support request and include the session ID. For a persistent, time-windowed 2–5% server_error rate, this is appropriate so the product team can investigate backend capacity or reliability issues in eastus2.

    The documentation does not expose a specific numeric limit for concurrent WebSocket sessions per deployment, nor does it state that a 2–5% server_error rate is expected. The supported guidance is to:

    • Use the correct Realtime endpoint format.
    • Implement robust retry and session recreation logic.
    • Check for regional issues and, if needed, use multiple regions.
    • Open a support case with session IDs when server_error persists.

    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.