How to increase Azure AI Foundry throughput for deployed LLM under high parallel load?

Question

How to increase Azure AI Foundry throughput for deployed LLM under high parallel load?

Vitalii Horbovyi 25

Hi,

I am experiencing significant throughput degradation when handling parallel user requests to GPT-4.1-mini via Azure AI Foundry, and I would like your guidance on the best architectural approach for our situation.

Current situation:

Each user session triggers approximately 10 sequential LLM calls, with each call consuming roughly 10,000 tokens. In isolation (single user), the full flow takes approximately 50-70 seconds.

However, under parallel load the performance degrades significantly: 4 concurrent users ~200-300 seconds (40 parallel requests). This is already unacceptable for our use case, and we are concerned about what will happen at 10–50 concurrent users.

What I have already tried or considered:

Multiple deployments within a single subscription (Pay-As-You-Go). I deployed several models of GPT-4.1-mini with deployment type Global Standard within the same region and same subscription in Azure AI Foundry, expecting that load balancing across deployments would increase overall throughput. However, after reading the following Microsoft documentation, I understand that standard quota is subscription-scoped, not deployment-scoped. Therefore, adding more deployments within the same subscription does not increase throughput.
Batch API. We evaluated the Batch API but it does not fit our use case, as we require real-time responses from the model.
Provisioned Throughput Units (PTU). We have evaluated PTU but it is not financially viable for our business at this stage. Our margins do not support this option.

Since standard quota is subscription-bound, would deploying one Azure AI Foundry instance per Azure subscription - each in the same region, each with one GPT-4.1-mini deployment - and routing requests across them via a gateway effectively multiply available throughput?

For example:

Subscription A → Azure AI Foundry instance: 1× GPT-4.1-mini (Poland Central)
Subscription B → Azure AI Foundry instance: 1× GPT-4.1-mini (Poland Central)
Subscription C → Azure AI Foundry instance: 1× GPT-4.1-mini (Poland Central)
Gateway → load balances across all three

Would this approach actually increase throughput proportionally to the number of subscriptions? Are there any limitations, compliance considerations, or technical blockers we should be aware of? Is this a good way to scale the system?

Additional questions:

Is there any other approach - aside from PTU and multi-subscription load balancing - that could meaningfully increase throughput for Pay-As-You-Go standard deployments under high parallel load?
What is the recommended scalable architecture for a workload like ours?

Thanks in advance!

0 comments

2 answers

Your answer

Answer 1

Hello Vitalii,

Your observation about throughput degradation is a common challenge when transitioning from isolated testing to high parallel load.

To answer your primary question: Yes, the multi-subscription gateway approach you proposed will technically multiply your throughput. Because Azure OpenAI rate limits (TPM and RPM) are scoped per region, per subscription, and per model, routing through three subscriptions in Poland Central will grant you three distinct quota pools.

However, this is generally considered an anti-pattern. Scaling via multiple subscriptions introduces unnecessary administrative overhead, complex billing, and fragmented security. Best practices dictate using separate subscriptions only for distinct environments (like Dev vs. Prod), not for bypassing regional quotas.

The Recommended Scalable Architecture (Pay-As-You-Go)

Instead of a multi-subscription architecture, the most effective Pay-As-You-Go strategy is Multi-Region scaling within a single subscription.

Leverage Regional Quota Pools: Because your quota is allocated per region within a single subscription, you can easily multiply your total available TPM/RPM by deploying your GPT-4.1-mini model across multiple regions (e.g., Poland Central, Sweden Central, and East US).

Implement Azure API Management (APIM): Place Azure APIM in front of these regional deployments.

Use Smart Load Balancing & Circuit Breakers: Configure APIM to distribute requests across your multiple regional endpoints. By implementing a circuit breaker policy, APIM will detect when a specific region is overwhelmed (returning 429 rate limit errors) and automatically reroute subsequent requests to the next available region. This prevents cascading failures and ensures high availability.

A Note on Global Standard: You mentioned using "Global Standard" deployments. These are already designed to dynamically route your traffic to the datacenter with the best availability across Azure's global infrastructure. If you are still hitting rate limits on Global Standard, your immediate next step should be to submit a request for a quota increase through the Azure Foundry Service, as Global Standard typically offers the highest initial throughput limits. If that is insufficient, transition to the Multi-Region + APIM architecture described above.

Vitalii Horbovyi 25 Reputation points

2026-04-19T20:26:11.3866667+00:00

Hi Ghulam,

Thank you for the detailed explanation. I went ahead and tried the multi-region approach within a single subscription - I created several Azure AI Foundry instances, each in a different region, and deployed one model per instance.

Unfortunately, the throughput did not improve. It seems like the quota is actually enforced at the subscription level overall, not per region independently.

At this point, it looks like the only viable option is actually creating separate subscriptions after all - which is exactly the anti-pattern you mentioned. Not ideal, but it seems like there may be no other way around it.

Answer 2

Hi ,

Thanks for reaching out to Microsoft Q&A.

your current design is token-heavy and chatty, which is why it collapses under parallel load. Fixing that will give you far more gain than just adding more subscriptions

Short answer: Yes, your multi-subs approach will increase throughput, but it is a workaround, not the recommended long-term architecture.

Paragraph answer: In Azure AI Foundry standard (pay-as-you-go) deployments, throughput is primarily constrained by subscription-level quotas (tokens per minute and requests per minute). Because of this, adding multiple deployments inside the same subscription does not help, but spreading deployments across multiple subscriptions does effectively multiply available throughput, provided each subscription has its own quota allocation. So your design (A/B/C subscriptions + gateway load balancing) will scale linearly in practice. However, this comes with operational overhead (quota management, auth, monitoring, cost tracking) and potential soft limits if Microsoft detects coordinated scaling patterns or applies regional capacity constraints.

A better architecture for your case (high parallel, multi-step LLM workflows) is to reduce pressure on the model rather than only scaling horizontally. The biggest issue in your design is not just concurrency, but token volume (10 calls × 10k tokens per user).

You should aggressively optimize here:

collapse multi-step chains into fewer prompts (prompt engineering or tool-calling)

cache deterministic or semi-deterministic responses (semantic cache layer)

use smaller or mixed models where possible (route some steps away from GPT-4.1-mini)

implement request queuing + rate shaping instead of pure parallel fan-out

introduce async pipelines where user experience allows partial streaming instead of blocking full flows

If you still need raw throughput scaling without PTU, combine:

multi-subscription sharding (what you proposed)
multi-region deployments (if latency allows)
intelligent gateway (token-aware routing + backpressure)

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

Share via

How to increase Azure AI Foundry throughput for deployed LLM under high parallel load?

2 answers

The Recommended Scalable Architecture (Pay-As-You-Go)

Your answer