Share via

Intermittent & Sustained HTTP 429 Responses – Azure OpenAI

Anshika Gupta 0 Reputation points Microsoft Employee
2026-03-27T22:02:29.6666667+00:00

We are experiencing intermittent as well as sustained HTTP 429 (Too Many Requests) responses when using Azure OpenAI models deployed via Azure AI Foundry, even though our configured Tokens‑Per‑Minute (TPM) and Requests‑Per‑Minute (RPM) limits appear sufficient for the workload.

 

This behavior is observed across specific model + region combinations. This issue is currently impacting our ability to reliably run preview and production‑like workflows.

 

Questions for Azure OpenAI / AI Services Support

  1. Are there known scenarios where regional capacity constraints or model‑level throttling can trigger 429 responses even when subscription‑level TPM/RPM quotas are not exhausted?
  2. Are there known regional stability considerations for specific models that we should be aware of?
  3. What deployment or configuration best practices (for example, region strategy or deployment types) are recommended to improve reliability for this workload pattern?

 

Our Goals

  • Stabilize request success rates
  • Understand whether the issue is quota‑, capacity‑, or region‑related
  • Align our deployment strategy with Azure OpenAI best practices for reliability We are experiencing intermittent as well as sustained HTTP 429 (Too Many Requests) responses when using Azure OpenAI models deployed via Azure AI Foundry, even though our configured Tokens‑Per‑Minute (TPM) and Requests‑Per‑Minute (RPM) limits appear sufficient for the workload.   This behavior is observed across specific model + region combinations. This issue is currently impacting our ability to reliably run preview and production‑like workflows.   Questions for Azure OpenAI / AI Services Support
    1. Are there known scenarios where regional capacity constraints or model‑level throttling can trigger 429 responses even when subscription‑level TPM/RPM quotas are not exhausted?
      1. Are there known regional stability considerations for specific models that we should be aware of?
        1. What deployment or configuration best practices (for example, region strategy or deployment types) are recommended to improve reliability for this workload pattern?
      Our Goals
    • Stabilize request success rates
    • Understand whether the issue is quota‑, capacity‑, or region‑related
    • Align our deployment strategy with Azure OpenAI best practices for reliability
Azure OpenAI Service
Azure OpenAI Service

An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.

0 comments No comments

3 answers

Sort by: Most helpful
  1. Manas Mohanty 16,190 Reputation points Microsoft External Staff Moderator
    2026-04-13T06:24:58.98+00:00

    Hello Anshika Gupta,

    Hope you found the above insights helpful.

    To emphasize, 429 rate limits occurs when model interaction exceeds per minute /day quota any time.

    We suggest customers to

    1. increase TPM of existing deployments. (A lot of 429 rate limits as per above screenshot-Exhausting Quota limits already)
    2. Use Exponential retry and adjust timeout/threshold - Reference - Handle rate limit OpenAI
    3. Load balance with other region deployment
    4. Optimize Agent instruction/ Prompt to keep output token limit in Certain word count.
    5. You can pre-estimate token size using tik-token library and adjust instruction accordingly to save tokens and avoid rate limits

    Regarding Regional loads

    East US 2 and Sweden Central are in demand region and have occasional load spikes due to accumulated long prompting and tool invocations.

    Reference threads

    https://learn.microsoft.com/en-us/answers/questions/1851574/resolving-429-errors-in-azure-openai-due-to-rate-l

    Have tried to reach you for review through private message for the same.

    Thank you.


  2. SRILAKSHMI C 16,975 Reputation points Microsoft External Staff Moderator
    2026-04-05T02:56:18.0233333+00:00

    Hello Anshika Gupta,

    Thanks for laying this out clearly

    what you’re seeing is a known and expected behavior pattern in Azure OpenAI, and it can definitely happen even when your subscription-level TPM/RPM looks underutilized.

    Let me consolidate everything into a clear explanation and actionable guidance.

    Why you’re getting 429s despite sufficient TPM/RPM

    Yes, 429 Too Many Requests can occur even when your quota isn’t fully consumed.

    What’s actually happening

    Azure OpenAI enforces limits at multiple layers, not just what you see in quota:

    1. Regional & model-level capacity constraints

    Capacity is tied to specific model + region combinations

    Even if your subscription has high TPM:

    • That region may be saturated
    • That model may be under heavy demand

    429 before quota is reached

    1. Transient scaling delays

    Azure OpenAI scales dynamically.

    If your traffic suddenly spikes:

    • Backend may not scale instantly
    • You’ll see temporary 429s until capacity catches up
    1. Per-deployment throttling

    Even within your quota:

    • Each deployment has its own throughput ceiling
    • Backend clusters also enforce limits

    So Subscription quota not guaranteed usable throughput at any moment

    1. Burst traffic patterns

    Even if total TPM is within limits:

    • Sending requests in short bursts
    • High concurrency spikes

    Can trigger 429 due to instantaneous load, not average usage

    1. Provisioned deployment saturation

    If you’re using provisioned throughput Once PTUs hit ~100% utilization You’ll get 429 until usage drops

    1. Request configuration impact

    Large requests can silently consume quota faster:

    • High max_tokens
    • Using best_of

    These increase token usage - can push you into throttling

    1. Preview / newer model behavior
    • Preview models often have:
      • Lower capacity
      • Higher contention
    • More prone to throttling

    Regional stability considerations

    Yes, regions behave differently.

    Not all regions have equal capacity for every model

    High-demand regions (e.g., East US 2) can throttle more

    Some regions may experience:

    • Temporary saturation
    • Maintenance events

    If the same workload works in another region, it’s almost always capacity-related, not code-related

    What you should do

    1. Adopt a multi-region strategy

    Deploy same model in 2+ regions

    Route traffic via Failover, Load balancing

    This avoids regional hot spots

    1. Use multiple deployments

    Instead of one large deployment Split load across multiple deployments Reduces per-deployment throttling

    1. Implement retry with exponential backoff and jitter

    429 handling is required for production.

    Best practice:

    • Respect retry-after / retry-after-ms
    • Use exponential backoff (e.g., 1s → 2s → 4s)
    1. Smooth traffic
    • Queue requests
    • Rate-limit concurrency

    Prevents sudden spikes that trigger throttling

    1. Right-size your requests
    • Reduce max_tokens if not needed
    • Avoid best_of unless necessary

    Frees up token capacity

    1. Consider Provisioned throughput
    • Guarantees allocated capacity
    • Provides predictable behavior
    • Still returns 429 when fully utilized
    1. Monitor the right metrics

    Use Azure Monitor:

    • Request rate
    • Token usage
    • Throttled requests
    • Latency spikes
    • Provisioned-Managed Utilization
    1. Scale proactively

    If you expect growth Increase TPM ahead of time, Don’t wait until throttling starts

    How to confirm the root cause

    You can validate quickly:

    • Retry succeeds after delay - capacity issue
    • Same workload fails only in one region - regional constraint
    • Works in another deployment - deployment-level throttling

    Please refer this

    Azure OpenAI quotas and limits (Tokens-Per-Minute, Requests-Per-Minute) https://learn.microsoft.com/azure/ai-foundry/openai/quotas-limits?tabs=REST

    Provisioned deployments in Azure AI Foundry Models https://learn.microsoft.com/azure/ai-foundry/openai/how-to/provisioned-get-started?tabs=REST

    Model availability by region https://learn.microsoft.com/azure/ai-foundry/openai/concepts/models#model-summary-table-and-region-availability

    General best practices to remain within rate limits https://learn.microsoft.com/azure/ai-services/openai/quotas-limits#general-best-practices-to-remain-within-rate-limits

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!


  3. Q&A Assist
    2026-03-27T22:02:49.4066667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more
    1. Scenarios where 429 occurs without exhausting subscription TPM/RPM

    Yes. 429 responses can occur even when subscription‑level TPM/RPM quotas appear sufficient, due to how rate limiting and capacity are enforced:

    • Per‑deployment TPM/RPM estimation:
      • Each request is evaluated using an estimated max processed‑token count that includes:
        1. Prompt tokens
        2. max_tokens
        3. best_of
      • This estimate is added to a running per‑deployment token counter that resets every minute. If the estimated TPM is reached at any point in that minute, further requests receive HTTP 429 until the counter resets, even if subscription‑level quota is not fully used.
      • Because the estimate is approximate and based on character count, rate limiting can trigger earlier than expected compared to exact token counts.
    • RPM burst behavior:
      • RPM limits assume requests are evenly distributed over the minute.
      • Azure OpenAI evaluates request rate over short windows (typically 1 or 10 seconds). If the number of requests in that window exceeds what is expected for the configured RPM, 429s are returned until the next evaluation period.
      • This means bursty traffic can see 429s even when the one‑minute average is below the RPM limit.
    • Provisioned throughput capacity limits:
      • For provisioned deployments, when allocated capacity is exceeded, the API returns HTTP 429 as a fast‑fail signal. This is by design and continues until utilization drops below 100%.
      • This is independent of subscription‑level quota; it reflects the capacity of a specific provisioned deployment.
    • System high demand / regional capacity:
      • 429 can also occur when “the system is experiencing high demand and cannot process your request due to capacity or latency limits,” even if configured quota is not exhausted. In this case, retrying after the suggested time or using a different deployment/region is recommended.
    1. Regional stability and model‑level considerations

    From the documented behavior:

    • 429s can be triggered when “the system is experiencing high demand,” which is effectively a regional/model capacity condition.
    • Capacity availability is region‑specific and dynamic. Some regions or model SKUs may have tighter capacity at certain times, which can manifest as intermittent or sustained 429s for particular model+region combinations.
    • Provisioned throughput deployments explicitly surface capacity pressure via 429 when utilization reaches 100%.
    • The guidance includes:
      • Attempt deployment or traffic at a different time when capacity may be more available.
      • Ensure quota is available in all acceptable regions and use the model capacities API and Foundry experience to discover alternative regions.

    While the context does not list specific “unstable” regions or models, it clearly indicates that:

    • 429s can be due to regional capacity/high demand even when quotas look sufficient.
    • Behavior can vary by region and deployment type (standard vs provisioned throughput).
    1. Deployment and configuration best practices to improve reliability

    To stabilize success rates and align with best practices, the following patterns are recommended in the documentation:

    a. Tune request parameters to reduce rate‑limit pressure

    • Set max_tokens and best_of to the minimum values that meet scenario needs.
    • Avoid unnecessarily large max_tokens when responses are expected to be small.
    • This directly reduces the estimated TPM per request and lowers the chance of 429.

    b. Smooth traffic and implement robust retry logic

    • Avoid sharp changes in workload; increase load gradually.
    • Distribute requests evenly over time to match RPM expectations.
    • Implement retry logic with exponential backoff, honoring Retry-After / retry-after-ms headers.
    • For provisioned deployments, client‑side retries can maximize throughput per PTU if higher per‑call latency is acceptable.

    c. Use appropriate deployment types and region strategy

    • For workloads needing higher reliability and more predictable latency, consider Premium / Provisioned throughput instead of only Standard:
      • Provisioned deployments provide dedicated capacity and consistent processing time.
      • When capacity is exceeded, 429 is returned quickly so traffic can be redirected or retried.
    • Use multiple regions and/or multiple deployments:
      • Ensure quota is available in more than one acceptable region.
      • Use the model capacities API and Foundry experience to identify regions with available capacity.
      • Implement a fallback strategy to redirect traffic to alternate deployments/regions when 429s persist.

    d. Monitor utilization and rate‑limit behavior

    • For provisioned deployments, monitor Provisioned‑Managed Utilization V2 in Azure Monitor to see when utilization approaches or exceeds 100%.
    • Use metrics and logs to correlate 429 spikes with:
      • Traffic bursts
      • Changes in max_tokens / best_of
      • Specific regions or deployments

    e. General rate‑limit best practices

    • Use quota management to increase TPM on high‑traffic deployments and reduce TPM where needs are limited.
    • Test different load increase patterns and validate behavior under production‑like conditions.

    Summary mapped to goals

    • Stabilize request success rates: Reduce per‑request token estimates, smooth traffic, implement exponential‑backoff retries, and consider provisioned throughput plus multi‑region failover.
    • Understand quota vs capacity vs region: Use TPM/RPM configuration, utilization metrics, and 429 patterns (plus Retry-After headers) to distinguish between per‑deployment rate limits, provisioned capacity saturation, and regional high demand.
    • Align with Azure OpenAI reliability best practices: Follow rate‑limit tuning guidance, use quota management, adopt provisioned throughput where appropriate, and architect with multiple regions/deployments and robust retry/redirect logic.

    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.