Hello Anshika Gupta,
Thanks for laying this out clearly
what you’re seeing is a known and expected behavior pattern in Azure OpenAI, and it can definitely happen even when your subscription-level TPM/RPM looks underutilized.
Let me consolidate everything into a clear explanation and actionable guidance.
Why you’re getting 429s despite sufficient TPM/RPM
Yes, 429 Too Many Requests can occur even when your quota isn’t fully consumed.
What’s actually happening
Azure OpenAI enforces limits at multiple layers, not just what you see in quota:
- Regional & model-level capacity constraints
Capacity is tied to specific model + region combinations
Even if your subscription has high TPM:
- That region may be saturated
- That model may be under heavy demand
429 before quota is reached
- Transient scaling delays
Azure OpenAI scales dynamically.
If your traffic suddenly spikes:
- Backend may not scale instantly
- You’ll see temporary 429s until capacity catches up
- Per-deployment throttling
Even within your quota:
- Each deployment has its own throughput ceiling
- Backend clusters also enforce limits
So Subscription quota not guaranteed usable throughput at any moment
- Burst traffic patterns
Even if total TPM is within limits:
- Sending requests in short bursts
- High concurrency spikes
Can trigger 429 due to instantaneous load, not average usage
- Provisioned deployment saturation
If you’re using provisioned throughput Once PTUs hit ~100% utilization You’ll get 429 until usage drops
- Request configuration impact
Large requests can silently consume quota faster:
- High
max_tokens
- Using
best_of
These increase token usage - can push you into throttling
- Preview / newer model behavior
- Preview models often have:
- Lower capacity
- Higher contention
- More prone to throttling
Regional stability considerations
Yes, regions behave differently.
Not all regions have equal capacity for every model
High-demand regions (e.g., East US 2) can throttle more
Some regions may experience:
- Temporary saturation
- Maintenance events
If the same workload works in another region, it’s almost always capacity-related, not code-related
What you should do
- Adopt a multi-region strategy
Deploy same model in 2+ regions
Route traffic via Failover, Load balancing
This avoids regional hot spots
- Use multiple deployments
Instead of one large deployment Split load across multiple deployments Reduces per-deployment throttling
- Implement retry with exponential backoff and jitter
429 handling is required for production.
Best practice:
- Respect
retry-after / retry-after-ms
- Use exponential backoff (e.g., 1s → 2s → 4s)
- Smooth traffic
- Queue requests
- Rate-limit concurrency
Prevents sudden spikes that trigger throttling
- Right-size your requests
- Reduce
max_tokens if not needed
- Avoid
best_of unless necessary
Frees up token capacity
- Consider Provisioned throughput
- Guarantees allocated capacity
- Provides predictable behavior
- Still returns 429 when fully utilized
- Monitor the right metrics
Use Azure Monitor:
- Request rate
- Token usage
- Throttled requests
- Latency spikes
- Provisioned-Managed Utilization
- Scale proactively
If you expect growth Increase TPM ahead of time, Don’t wait until throttling starts
How to confirm the root cause
You can validate quickly:
- Retry succeeds after delay - capacity issue
- Same workload fails only in one region - regional constraint
- Works in another deployment - deployment-level throttling
Please refer this
Azure OpenAI quotas and limits (Tokens-Per-Minute, Requests-Per-Minute) https://learn.microsoft.com/azure/ai-foundry/openai/quotas-limits?tabs=REST
Provisioned deployments in Azure AI Foundry Models https://learn.microsoft.com/azure/ai-foundry/openai/how-to/provisioned-get-started?tabs=REST
Model availability by region https://learn.microsoft.com/azure/ai-foundry/openai/concepts/models#model-summary-table-and-region-availability
General best practices to remain within rate limits https://learn.microsoft.com/azure/ai-services/openai/quotas-limits#general-best-practices-to-remain-within-rate-limits
I Hope this helps. Do let me know if you have any further queries.
Thank you!