Frequent Instability Across all LLM Models

Question

Frequent Instability Across all LLM Models

Dan Hastings 5

I have been noticing a lot of instability across a range of models from the foundry.

Model Timeouts

Completion timeouts are now problematic across all models. Some models are completely unusable and never respond, others are very slow, others are sporadic and just fail randomly. What makes me think this is an issue with Azure is the fact that i have seen this happen with many models for simple queries like "tell me a joke" and it times out after 600 seconds.

Grok: pretty much never works. It always times out after 600 seconds. Its very rare i can get a completion from Grok models, even if its a very simple prompt like "tell me a joke"
Deepseek: The deepseek models do complete more often but it can take up to a minute for it to tell a joke or return a simple prompt. Streaming has not helped as it still returns a few letters and then nothing for several seconds.
OpenAI: Up until recently, OpenAI had been far more stable but recently it is also timing out a lot. Its happening daily and while we have retries in place to handle it, the time wasted waiting for responses that never come are becoming a problem. Its regularly returning errors like the following

{

"error": {

"message": "azure returned an error: upstream connect error or disconnect/reset before headers. reset reason: connection timeout",

"status": 503,

"code": "UPSTREAM_SERVER_ERROR",

"provider": "azure",

"providerUrl": "https://xxx.cognitiveservices.azure.com/openai/v1/responses",

"model": "azure/gpt-5.1-codex-mini",

"upstreamStatus": 503,

"upstreamResponse": "upstream connect error or disconnect/reset before headers. reset reason: connection timeout"

}

Aggressive Content Filter

This has been showing up a lot in the past week and I cant work out why its happening so much. I have seen it return this error when reviewing source code which really makes no sense at all. It has become a common error and every time I have investigated, I cant find anything that makes sense

status 422: {"error":{"message":"Provider returned an incomplete response (reason: content_filter). The provider's content safety filter blocked the model output.","status":422,"code":"INCOMPLETE_RESPONSE","incompleteReason":"content_filter"}}

This one is quite problematic in that we dont know why its blocking in the first place. Its so vague and we cant identify anything different in the content. Especially when its source code.

0 comments

1 answer

Your answer

Answer 1

The described behavior aligns with two main areas covered in the context: timeouts/instability and increasingly frequent content-filter blocks.

1. Timeouts, slow responses, and 503 errors

From the context, there are several relevant points:

Client-side timeouts vs service latency
- For non-reasoning models, guidance is:
  - Streaming: set client-side timeout up to 60 seconds.
  - Non‑streaming: up to 29 minutes.
- For reasoning models, requests (including streaming) can take up to 29 minutes because all reasoning tokens are generated before the first token is returned.
- Long-running calls can be canceled by client-side timeouts if those timeouts are set too low.
Actionable steps:
- Explicitly set client-side timeouts according to the guidance for each workload and model type, and tune them to traffic patterns.
- For long or complex prompts, increase the timeout above 60 seconds if not using streaming, or ensure the HTTP client/proxy in front of the app is not enforcing a shorter timeout than the SDK.
Use streaming to avoid client/proxy timeouts
- Streaming is recommended when calls take a long time to process, because many clients and intermediary layers have their own timeouts.
- Streaming ensures incremental data is received, improving perceived latency and reducing the chance that intermediaries cancel the request before any data is seen.
In the DeepSeek/OpenAI scenarios where only a few characters arrive and then nothing for a long time, this suggests either:
- The model is still generating but is slow (large context, complex reasoning, or cache misses), or
- Some intermediary (gateway, proxy, or client) is interrupting the stream.
Actionable steps:
- Confirm that the HTTP client, reverse proxies, and any API gateways allow long‑lived streaming connections and do not buffer or cut them off early.
- If streaming is enabled but still timing out, increase the upstream timeout in the gateway/load balancer to be at least as large as the model guidance (or the app’s own timeout, whichever is smaller).
Service-side issues and region-specific instability
- The context shows that sometimes issues are region-specific and not clearly surfaced in error messages. A deployment that failed repeatedly in one region succeeded immediately when moved to another region, even though the error did not mention a regional problem.
- Another example shows Document Intelligence requests timing out both from code and from the Studio, indicating a service-side or regional issue rather than client code.
Actionable steps:
- Check Azure Status and the Service Health blade in the Azure portal for the regions where the Foundry models are deployed to see if there are incidents or degraded performance.
- If possible, test the same prompts against a deployment in a different region to see if the behavior is regional.
- If the same simple prompts (like “tell me a joke”) consistently time out across multiple models and regions, capture correlation IDs and timestamps and open a support request so the service team can investigate upstream 503/timeout patterns.
HTTP 429 and 5xx handling
- For rate limits (429), the guidance is to implement retry with exponential backoff and respect Retry-After.
- For 5xx errors like the 503 “upstream connect error or disconnect/reset before headers. reset reason: connection timeout”, this indicates the upstream service did not respond in time.
Actionable steps:
- Ensure robust retry logic for transient 5xx errors with backoff and jitter.
- Log and surface the upstreamStatus, upstreamResponse, and any correlation IDs to support for deeper analysis.
Performance tuning
- Separate workloads: mixing very different workloads on the same endpoint can hurt latency due to batching and cache contention.
- Prompt size and generation size both affect latency; large prompts and large max_tokens will increase response time.
- Batching can reduce the number of requests and sometimes improve overall throughput.
Actionable steps:
- Use separate deployments/endpoints for very different workloads (e.g., short chat vs long code analysis) to avoid cross‑impact on latency.
- Review prompt sizes and generation lengths for the slowest workloads and reduce where possible.

2. Aggressive content filter and 422 `content_filter` errors

The context describes Azure OpenAI’s content filtering system:

Prompts and completions are run through an ensemble of classification models to detect potentially harmful content.
If a prompt is flagged, the API returns an error with error.code = "contentFilter" and a message like “Your task failed as a result of our safety system.”
It is also possible for the generated output itself to be filtered, in which case the error message is “Generated image was filtered as a result of our safety system” (for images), or a similar contentFilter error for text.
The system increases safety but also adds latency.

Applied to the 422 error:

The message "Provider returned an incomplete response (reason: content_filter). The provider's content safety filter blocked the model output." indicates that the model started generating a response, but the content safety system blocked part or all of that output.
This can happen even when the input seems benign (e.g., source code) if the model’s output is classified as falling into a restricted category (for example, if the code or comments resemble disallowed content, or if the model attempts to generate something that matches a blocked pattern).

Actionable steps:

Understand that both input and output are filtered
- Even if the input is just source code, the model’s attempted completion may contain patterns that trigger filters.
Inspect prompts and partial outputs where possible
- Log the exact prompt and any partial completion (if available) when a content_filter error occurs to look for patterns (specific libraries, function names, comments, or strings that might resemble disallowed content).
Request content filter policy adjustments for low-risk workloads
- The context notes that content filtering can be modified for certain lower‑risk use cases to improve performance and reduce unnecessary blocking.
- For scenarios like internal source-code review where the risk profile is lower, it is possible to request modifications to the default content filtering policies.
Account for filter latency and failures in the client
- The content filtering system adds latency; for workloads that are very sensitive to latency and are low risk, adjusted policies can help.
- In the client, treat content_filter as a distinct failure mode and decide whether to:
  - Retry with a more constrained prompt (e.g., ask for a summary instead of full code rewrite), or
  - Fall back to a different workflow (e.g., manual review) when blocked.
Use RAG or system prompts to constrain outputs
- For code-review scenarios, constrain the model via system prompts to only produce certain types of output (e.g., “Only output a bullet list of issues, no code snippets”), which can reduce the chance of generating content that triggers filters.

3. Next steps to stabilize the environment

Given the symptoms across multiple providers/models:

Explicitly set and tune client-side timeouts per model type (reasoning vs non‑reasoning, streaming vs non‑streaming) following the quotas/limits guidance.
Verify that all intermediaries (API gateways, load balancers, proxies) allow long‑running and streaming requests and are not enforcing shorter timeouts.
Separate high-latency/complex workloads from simple chat workloads into different deployments/endpoints.
Monitor Azure Service Health and, if possible, test in an alternate region to rule out regional service issues.
Implement structured logging for all errors, including 503s and content_filter 422s, capturing timestamps, regions, model names, and correlation IDs, and use these when opening a support case.
For the aggressive content filter behavior, review the affected prompts/outputs and consider requesting content filter policy adjustments for low-risk, internal scenarios like source-code analysis.

References:

Share via

Frequent Instability Across all LLM Models

Model Timeouts

Aggressive Content Filter

1 answer

1. Timeouts, slow responses, and 503 errors

2. Aggressive content filter and 422 content_filter errors

3. Next steps to stabilize the environment

Your answer

2. Aggressive content filter and 422 `content_filter` errors