Intermittent & Sustained HTTP 429 Responses – Azure OpenAI

Question

Intermittent & Sustained HTTP 429 Responses – Azure OpenAI

Anshika Gupta 0 Microsoft Employee

We are experiencing intermittent as well as sustained HTTP 429 (Too Many Requests) responses when using Azure OpenAI models deployed via Azure AI Foundry, even though our configured Tokens‑Per‑Minute (TPM) and Requests‑Per‑Minute (RPM) limits appear sufficient for the workload.

This behavior is observed across specific model + region combinations. This issue is currently impacting our ability to reliably run preview and production‑like workflows.

Questions for Azure OpenAI / AI Services Support

Are there known scenarios where regional capacity constraints or model‑level throttling can trigger 429 responses even when subscription‑level TPM/RPM quotas are not exhausted?
Are there known regional stability considerations for specific models that we should be aware of?
What deployment or configuration best practices (for example, region strategy or deployment types) are recommended to improve reliability for this workload pattern?

Our Goals

Stabilize request success rates
Understand whether the issue is quota‑, capacity‑, or region‑related
Align our deployment strategy with Azure OpenAI best practices for reliability We are experiencing intermittent as well as sustained HTTP 429 (Too Many Requests) responses when using Azure OpenAI models deployed via Azure AI Foundry, even though our configured Tokens‑Per‑Minute (TPM) and Requests‑Per‑Minute (RPM) limits appear sufficient for the workload. This behavior is observed across specific model + region combinations. This issue is currently impacting our ability to reliably run preview and production‑like workflows. Questions for Azure OpenAI / AI Services Support
1. Are there known scenarios where regional capacity constraints or model‑level throttling can trigger 429 responses even when subscription‑level TPM/RPM quotas are not exhausted?
  1. Are there known regional stability considerations for specific models that we should be aware of?
    1. What deployment or configuration best practices (for example, region strategy or deployment types) are recommended to improve reliability for this workload pattern?
Our Goals
- Stabilize request success rates
- Understand whether the issue is quota‑, capacity‑, or region‑related
- Align our deployment strategy with Azure OpenAI best practices for reliability

0 comments

3 answers

Your answer

Answer 1

Hello Anshika Gupta,

Hope you found the above insights helpful.

To emphasize, 429 rate limits occurs when model interaction exceeds per minute /day quota any time.

We suggest customers to

increase TPM of existing deployments. (A lot of 429 rate limits as per above screenshot-Exhausting Quota limits already)
Use Exponential retry and adjust timeout/threshold - Reference - Handle rate limit OpenAI
Load balance with other region deployment
Optimize Agent instruction/ Prompt to keep output token limit in Certain word count.
You can pre-estimate token size using tik-token library and adjust instruction accordingly to save tokens and avoid rate limits

Regarding Regional loads

East US 2 and Sweden Central are in demand region and have occasional load spikes due to accumulated long prompting and tool invocations.

Reference threads

https://learn.microsoft.com/en-us/answers/questions/1851574/resolving-429-errors-in-azure-openai-due-to-rate-l

Have tried to reach you for review through private message for the same.

Thank you.

Manas Mohanty 16,190 Reputation points Microsoft External Staff Moderator

2026-04-15T04:51:41.3366667+00:00

Hello Anshika Gupta,

Hope that you were able to address the rate limit issue with above pointer.

Please let us know if you had followed up queries on the same.

Thank you.

Answer 2

Hello Anshika Gupta,

Thanks for laying this out clearly

what you’re seeing is a known and expected behavior pattern in Azure OpenAI, and it can definitely happen even when your subscription-level TPM/RPM looks underutilized.

Let me consolidate everything into a clear explanation and actionable guidance.

Why you’re getting 429s despite sufficient TPM/RPM

Yes, 429 Too Many Requests can occur even when your quota isn’t fully consumed.

What’s actually happening

Azure OpenAI enforces limits at multiple layers, not just what you see in quota:

Regional & model-level capacity constraints

Capacity is tied to specific model + region combinations

Even if your subscription has high TPM:

That region may be saturated
That model may be under heavy demand

429 before quota is reached

Transient scaling delays

Azure OpenAI scales dynamically.

If your traffic suddenly spikes:

Backend may not scale instantly
You’ll see temporary 429s until capacity catches up

Per-deployment throttling

Even within your quota:

Each deployment has its own throughput ceiling
Backend clusters also enforce limits

So Subscription quota not guaranteed usable throughput at any moment

Burst traffic patterns

Even if total TPM is within limits:

Sending requests in short bursts
High concurrency spikes

Can trigger 429 due to instantaneous load, not average usage

Provisioned deployment saturation

If you’re using provisioned throughput Once PTUs hit ~100% utilization You’ll get 429 until usage drops

Request configuration impact

Large requests can silently consume quota faster:

High max_tokens
Using best_of

These increase token usage - can push you into throttling

Preview / newer model behavior

Preview models often have:
- Lower capacity
- Higher contention
More prone to throttling

Regional stability considerations

Yes, regions behave differently.

Not all regions have equal capacity for every model

High-demand regions (e.g., East US 2) can throttle more

Some regions may experience:

Temporary saturation
Maintenance events

If the same workload works in another region, it’s almost always capacity-related, not code-related

What you should do

Adopt a multi-region strategy

Deploy same model in 2+ regions

Route traffic via Failover, Load balancing

This avoids regional hot spots

Use multiple deployments

Instead of one large deployment Split load across multiple deployments Reduces per-deployment throttling

Implement retry with exponential backoff and jitter

429 handling is required for production.

Best practice:

Respect retry-after / retry-after-ms
Use exponential backoff (e.g., 1s → 2s → 4s)

Smooth traffic

Queue requests
Rate-limit concurrency

Prevents sudden spikes that trigger throttling

Right-size your requests

Reduce max_tokens if not needed
Avoid best_of unless necessary

Frees up token capacity

Consider Provisioned throughput

Guarantees allocated capacity
Provides predictable behavior
Still returns 429 when fully utilized

Monitor the right metrics

Use Azure Monitor:

Request rate
Token usage
Throttled requests
Latency spikes
Provisioned-Managed Utilization

Scale proactively

If you expect growth Increase TPM ahead of time, Don’t wait until throttling starts

How to confirm the root cause

You can validate quickly:

Retry succeeds after delay - capacity issue
Same workload fails only in one region - regional constraint
Works in another deployment - deployment-level throttling

Please refer this

Azure OpenAI quotas and limits (Tokens-Per-Minute, Requests-Per-Minute) https://learn.microsoft.com/azure/ai-foundry/openai/quotas-limits?tabs=REST

Provisioned deployments in Azure AI Foundry Models https://learn.microsoft.com/azure/ai-foundry/openai/how-to/provisioned-get-started?tabs=REST

Model availability by region https://learn.microsoft.com/azure/ai-foundry/openai/concepts/models#model-summary-table-and-region-availability

General best practices to remain within rate limits https://learn.microsoft.com/azure/ai-services/openai/quotas-limits#general-best-practices-to-remain-within-rate-limits

I Hope this helps. Do let me know if you have any further queries.

Thank you!

SRILAKSHMI C 16,975 Reputation points Microsoft External Staff Moderator

2026-04-06T12:11:39.4166667+00:00

Hi Anshika Gupta,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
SRILAKSHMI C 16,975 Reputation points Microsoft External Staff Moderator

2026-04-07T11:57:40.93+00:00

Hi Anshika Gupta,

Following up to see if the above answer was helpful. If this answers your query, please do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank you!

Answer 3

Scenarios where 429 occurs without exhausting subscription TPM/RPM

Yes. 429 responses can occur even when subscription‑level TPM/RPM quotas appear sufficient, due to how rate limiting and capacity are enforced:

Per‑deployment TPM/RPM estimation:
- Each request is evaluated using an estimated max processed‑token count that includes:
  1. Prompt tokens
  2. max_tokens
  3. best_of
- This estimate is added to a running per‑deployment token counter that resets every minute. If the estimated TPM is reached at any point in that minute, further requests receive HTTP 429 until the counter resets, even if subscription‑level quota is not fully used.
- Because the estimate is approximate and based on character count, rate limiting can trigger earlier than expected compared to exact token counts.
RPM burst behavior:
- RPM limits assume requests are evenly distributed over the minute.
- Azure OpenAI evaluates request rate over short windows (typically 1 or 10 seconds). If the number of requests in that window exceeds what is expected for the configured RPM, 429s are returned until the next evaluation period.
- This means bursty traffic can see 429s even when the one‑minute average is below the RPM limit.
Provisioned throughput capacity limits:
- For provisioned deployments, when allocated capacity is exceeded, the API returns HTTP 429 as a fast‑fail signal. This is by design and continues until utilization drops below 100%.
- This is independent of subscription‑level quota; it reflects the capacity of a specific provisioned deployment.
System high demand / regional capacity:
- 429 can also occur when “the system is experiencing high demand and cannot process your request due to capacity or latency limits,” even if configured quota is not exhausted. In this case, retrying after the suggested time or using a different deployment/region is recommended.

Regional stability and model‑level considerations

From the documented behavior:

429s can be triggered when “the system is experiencing high demand,” which is effectively a regional/model capacity condition.
Capacity availability is region‑specific and dynamic. Some regions or model SKUs may have tighter capacity at certain times, which can manifest as intermittent or sustained 429s for particular model+region combinations.
Provisioned throughput deployments explicitly surface capacity pressure via 429 when utilization reaches 100%.
The guidance includes:
- Attempt deployment or traffic at a different time when capacity may be more available.
- Ensure quota is available in all acceptable regions and use the model capacities API and Foundry experience to discover alternative regions.

While the context does not list specific “unstable” regions or models, it clearly indicates that:

429s can be due to regional capacity/high demand even when quotas look sufficient.
Behavior can vary by region and deployment type (standard vs provisioned throughput).

Deployment and configuration best practices to improve reliability

To stabilize success rates and align with best practices, the following patterns are recommended in the documentation:

a. Tune request parameters to reduce rate‑limit pressure

Set max_tokens and best_of to the minimum values that meet scenario needs.
Avoid unnecessarily large max_tokens when responses are expected to be small.
This directly reduces the estimated TPM per request and lowers the chance of 429.

b. Smooth traffic and implement robust retry logic

Avoid sharp changes in workload; increase load gradually.
Distribute requests evenly over time to match RPM expectations.
Implement retry logic with exponential backoff, honoring Retry-After / retry-after-ms headers.
For provisioned deployments, client‑side retries can maximize throughput per PTU if higher per‑call latency is acceptable.

c. Use appropriate deployment types and region strategy

For workloads needing higher reliability and more predictable latency, consider Premium / Provisioned throughput instead of only Standard:
- Provisioned deployments provide dedicated capacity and consistent processing time.
- When capacity is exceeded, 429 is returned quickly so traffic can be redirected or retried.
Use multiple regions and/or multiple deployments:
- Ensure quota is available in more than one acceptable region.
- Use the model capacities API and Foundry experience to identify regions with available capacity.
- Implement a fallback strategy to redirect traffic to alternate deployments/regions when 429s persist.

d. Monitor utilization and rate‑limit behavior

For provisioned deployments, monitor Provisioned‑Managed Utilization V2 in Azure Monitor to see when utilization approaches or exceeds 100%.
Use metrics and logs to correlate 429 spikes with:
- Traffic bursts
- Changes in max_tokens / best_of
- Specific regions or deployments

e. General rate‑limit best practices

Use quota management to increase TPM on high‑traffic deployments and reduce TPM where needs are limited.
Test different load increase patterns and validate behavior under production‑like conditions.

Summary mapped to goals

Stabilize request success rates: Reduce per‑request token estimates, smooth traffic, implement exponential‑backoff retries, and consider provisioned throughput plus multi‑region failover.
Understand quota vs capacity vs region: Use TPM/RPM configuration, utilization metrics, and 429 patterns (plus Retry-After headers) to distinguish between per‑deployment rate limits, provisioned capacity saturation, and regional high demand.
Align with Azure OpenAI reliability best practices: Follow rate‑limit tuning guidance, use quota management, adopt provisioned throughput where appropriate, and architect with multiple regions/deployments and robust retry/redirect logic.

References:

Share via

Intermittent & Sustained HTTP 429 Responses – Azure OpenAI

3 answers

Your answer