An Azure service that enables users to identify content that is potentially offensive, risky, or otherwise undesirable. Previously known as Azure Content Moderator.
The described behavior aligns with two main areas covered in the context: timeouts/instability and increasingly frequent content-filter blocks.
1. Timeouts, slow responses, and 503 errors
From the context, there are several relevant points:
- Client-side timeouts vs service latency
- For non-reasoning models, guidance is:
- Streaming: set client-side timeout up to 60 seconds.
- Non‑streaming: up to 29 minutes.
- For reasoning models, requests (including streaming) can take up to 29 minutes because all reasoning tokens are generated before the first token is returned.
- Long-running calls can be canceled by client-side timeouts if those timeouts are set too low.
- Explicitly set client-side timeouts according to the guidance for each workload and model type, and tune them to traffic patterns.
- For long or complex prompts, increase the timeout above 60 seconds if not using streaming, or ensure the HTTP client/proxy in front of the app is not enforcing a shorter timeout than the SDK.
- For non-reasoning models, guidance is:
- Use streaming to avoid client/proxy timeouts
- Streaming is recommended when calls take a long time to process, because many clients and intermediary layers have their own timeouts.
- Streaming ensures incremental data is received, improving perceived latency and reducing the chance that intermediaries cancel the request before any data is seen.
- The model is still generating but is slow (large context, complex reasoning, or cache misses), or
- Some intermediary (gateway, proxy, or client) is interrupting the stream.
- Confirm that the HTTP client, reverse proxies, and any API gateways allow long‑lived streaming connections and do not buffer or cut them off early.
- If streaming is enabled but still timing out, increase the upstream timeout in the gateway/load balancer to be at least as large as the model guidance (or the app’s own timeout, whichever is smaller).
- Service-side issues and region-specific instability
- The context shows that sometimes issues are region-specific and not clearly surfaced in error messages. A deployment that failed repeatedly in one region succeeded immediately when moved to another region, even though the error did not mention a regional problem.
- Another example shows Document Intelligence requests timing out both from code and from the Studio, indicating a service-side or regional issue rather than client code.
- Check Azure Status and the Service Health blade in the Azure portal for the regions where the Foundry models are deployed to see if there are incidents or degraded performance.
- If possible, test the same prompts against a deployment in a different region to see if the behavior is regional.
- If the same simple prompts (like “tell me a joke”) consistently time out across multiple models and regions, capture correlation IDs and timestamps and open a support request so the service team can investigate upstream 503/timeout patterns.
- HTTP 429 and 5xx handling
- For rate limits (429), the guidance is to implement retry with exponential backoff and respect
Retry-After. - For 5xx errors like the 503 “upstream connect error or disconnect/reset before headers. reset reason: connection timeout”, this indicates the upstream service did not respond in time.
- Ensure robust retry logic for transient 5xx errors with backoff and jitter.
- Log and surface the
upstreamStatus,upstreamResponse, and any correlation IDs to support for deeper analysis.
- For rate limits (429), the guidance is to implement retry with exponential backoff and respect
- Performance tuning
- Separate workloads: mixing very different workloads on the same endpoint can hurt latency due to batching and cache contention.
- Prompt size and generation size both affect latency; large prompts and large
max_tokenswill increase response time. - Batching can reduce the number of requests and sometimes improve overall throughput.
- Use separate deployments/endpoints for very different workloads (e.g., short chat vs long code analysis) to avoid cross‑impact on latency.
- Review prompt sizes and generation lengths for the slowest workloads and reduce where possible.
2. Aggressive content filter and 422 content_filter errors
The context describes Azure OpenAI’s content filtering system:
- Prompts and completions are run through an ensemble of classification models to detect potentially harmful content.
- If a prompt is flagged, the API returns an error with
error.code = "contentFilter"and a message like “Your task failed as a result of our safety system.” - It is also possible for the generated output itself to be filtered, in which case the error message is “Generated image was filtered as a result of our safety system” (for images), or a similar
contentFiltererror for text. - The system increases safety but also adds latency.
Applied to the 422 error:
- The message
"Provider returned an incomplete response (reason: content_filter). The provider's content safety filter blocked the model output."indicates that the model started generating a response, but the content safety system blocked part or all of that output. - This can happen even when the input seems benign (e.g., source code) if the model’s output is classified as falling into a restricted category (for example, if the code or comments resemble disallowed content, or if the model attempts to generate something that matches a blocked pattern).
Actionable steps:
- Understand that both input and output are filtered
- Even if the input is just source code, the model’s attempted completion may contain patterns that trigger filters.
- Inspect prompts and partial outputs where possible
- Log the exact prompt and any partial completion (if available) when a
content_filtererror occurs to look for patterns (specific libraries, function names, comments, or strings that might resemble disallowed content).
- Log the exact prompt and any partial completion (if available) when a
- Request content filter policy adjustments for low-risk workloads
- The context notes that content filtering can be modified for certain lower‑risk use cases to improve performance and reduce unnecessary blocking.
- For scenarios like internal source-code review where the risk profile is lower, it is possible to request modifications to the default content filtering policies.
- Account for filter latency and failures in the client
- The content filtering system adds latency; for workloads that are very sensitive to latency and are low risk, adjusted policies can help.
- In the client, treat
content_filteras a distinct failure mode and decide whether to:- Retry with a more constrained prompt (e.g., ask for a summary instead of full code rewrite), or
- Fall back to a different workflow (e.g., manual review) when blocked.
- Use RAG or system prompts to constrain outputs
- For code-review scenarios, constrain the model via system prompts to only produce certain types of output (e.g., “Only output a bullet list of issues, no code snippets”), which can reduce the chance of generating content that triggers filters.
3. Next steps to stabilize the environment
Given the symptoms across multiple providers/models:
- Explicitly set and tune client-side timeouts per model type (reasoning vs non‑reasoning, streaming vs non‑streaming) following the quotas/limits guidance.
- Verify that all intermediaries (API gateways, load balancers, proxies) allow long‑running and streaming requests and are not enforcing shorter timeouts.
- Separate high-latency/complex workloads from simple chat workloads into different deployments/endpoints.
- Monitor Azure Service Health and, if possible, test in an alternate region to rule out regional service issues.
- Implement structured logging for all errors, including 503s and
content_filter422s, capturing timestamps, regions, model names, and correlation IDs, and use these when opening a support case. - For the aggressive content filter behavior, review the affected prompts/outputs and consider requesting content filter policy adjustments for low-risk, internal scenarios like source-code analysis.
References:
- Azure OpenAI frequently asked questions
- Microsoft Foundry Models quotas and limits
- Performance and latency
- Azure OpenAI image generation models (programming-language-powershell)
- Azure OpenAI image generation models (programming-language-csharp)
- Azure OpenAI image generation models (programming-language-javascript)
- Azure OpenAI image generation models (rest-api)
- Document Intelligence classification and extraction requests timing out - Microsoft Q&A
- Azure Deployment Takes 70 Minutes to Fail with Obscure Error - Microsoft Q&A