The error pattern described matches intermittent server-side issues or regional capacity constraints rather than a client-side payload or quota problem.
From the available information, the following points are supported:
- Realtime WebSocket connection behavior and limits
- The GPT Realtime API is accessed via a secure WebSocket connection to the
/realtime endpoint of an Azure OpenAI resource.
- Correct construction of the WebSocket URL is critical; using the wrong path or mixing GA/preview formats results in errors (including 404 or auth-related issues), but not the intermittent
server_error pattern described.
- GA WebSocket URL format:
wss://<resource>.openai.azure.com/openai/v1/realtime?model=<gpt-realtime-deployment-name>
- Preview WebSocket URL format:
wss://<resource>.openai.azure.com/openai/realtime?api-version=2025-04-01-preview&deployment=<realtime-preview-deployment-name>
- The documentation does not specify a hard limit on concurrent WebSocket sessions per deployment. It does, however, distinguish recommended protocol usage:
- WebRTC: best for low-latency, client-side real-time audio.
- WebSocket: best for server-to-server and batch processing, with moderate latency.
- SIP: for telephony integration.
Given this, the intermittent server_error responses during peak hours are consistent with transient service-side issues or regional load, not with a documented per-session concurrency limit.
- Whether a 2–5% failure rate is “expected” and how to handle it
- The FAQ for Azure OpenAI notes that when the service performs processing, charges can apply even if the status code is not 200 (for example, 400 due to content filter or 408 due to timeout). It also notes that some 500-level errors can occur and recommends retry with backoff.
- For known 500-level issues (for example, “invalid Unicode output” or “Unexpected special token”), the guidance is:
- Reduce temperature.
- Ensure client has retry logic.
- Reattempting often results in a successful response.
By analogy, intermittent server_error failures on Realtime are expected to be handled via robust retry logic and resiliency patterns. However, a sustained 2–5% failure rate concentrated in a specific time window and region is not documented as a normal baseline; it is a signal to:
- Implement resilient client behavior (retries, backoff, failover), and
- Open an Azure support request including session IDs and timestamps so the service team can check regional capacity or backend issues.
- Region and capacity considerations
- Other Azure services (for example, Document Intelligence in West US 2) show that region-specific service-side issues can cause timeouts or failures even when client code and payloads are correct. The recommended actions there are:
- Check Azure Status and Service Health for incidents in the region.
- Try a different region to see if the issue is regional.
For the described pattern (spike in failures in eastus2 during 07:00–09:00 PDT), similar guidance applies:
- Check Azure Service Health for eastus2 and the Azure OpenAI resource.
- If possible, deploy an additional Azure OpenAI resource in a second region and route a portion of traffic there during peak hours to see if failures drop.
- Best practices for session lifecycle and resiliency
From the Realtime and migration guidance, plus general Azure OpenAI recommendations, the following practices are supported:
- Use the correct GA vs preview endpoint and query parameters:
- GA:
/openai/v1/realtime?model=... (no api-version).
- Preview:
/openai/realtime?api-version=...&deployment=....
- Mixing these can cause connection/auth issues; ensure all clients are consistent.
- Handle WebSocket and session-level failures explicitly:
- Monitor for
session.created and other events to verify sessions are established correctly.
- On
ResponseFailed / server_error, close the WebSocket and create a new session rather than reusing a potentially corrupted one.
- Session lifecycle:
- Avoid unnecessarily long-lived sessions if traffic is bursty; prefer:
- One session per interaction or per short-lived conversation window, or
- Periodic rotation of sessions (for example, after N interactions or M minutes).
- This reduces the chance that a long-lived session is affected by transient backend issues.
- Retry strategy:
- Use exponential backoff with jitter.
- Limit the total number of retries and overall timeout per user interaction to keep UX acceptable.
- On repeated
server_error for the same session, create a new session (new WebSocket) before retrying.
- Load distribution:
- If the application is latency-tolerant and server-to-server, WebSockets are appropriate; if low latency is critical, consider WebRTC as recommended.
- For high-scale scenarios, consider:
- Multiple deployments or resources (possibly in multiple regions).
- Application-level routing to spread load across deployments/regions.
- What to do next for this specific scenario
Based on the documented behavior and guidance:
- Validate endpoint usage:
- Ensure all clients use the GA or preview format consistently and do not include
api-version on GA endpoints, as this can cause authentication/connection issues after migration.
- Implement stronger resiliency:
- On
server_error:
- Log
session_id and response_id.
- Close the WebSocket and establish a new session.
- Retry the interaction with backoff.
- Check for regional issues and consider multi-region:
- Use Azure Service Health to check eastus2 for incidents affecting Azure OpenAI.
- If possible, deploy a second Azure OpenAI resource in another region and route a portion of traffic there during 07:00–09:00 PDT to see if failure rates drop.
- Engage Azure Support:
- The error message explicitly instructs to file an Azure support request and include the session ID. For a persistent, time-windowed 2–5%
server_error rate, this is appropriate so the product team can investigate backend capacity or reliability issues in eastus2.
The documentation does not expose a specific numeric limit for concurrent WebSocket sessions per deployment, nor does it state that a 2–5% server_error rate is expected. The supported guidance is to:
- Use the correct Realtime endpoint format.
- Implement robust retry and session recreation logic.
- Check for regional issues and, if needed, use multiple regions.
- Open a support case with session IDs when
server_error persists.
References: