Configuring Circuit Breaker & Load Balancer in APIM for Multiple LLMs (Phi, Llama, etc.)

Question

Configuring Circuit Breaker & Load Balancer in APIM for Multiple LLMs (Phi, Llama, etc.)

Vivek Kumar 45

I understand that we cannot configure the circuit breaker and load balancing directly through the Azure Portal in API Management (APIM). However, I need guidance on how to integrate these features properly.

I have deployed multiple different large language models (LLMs) such as Phi, Llama, etc., as serverless APIs in Azure AI Foundry and added those as apis in APIM. The challenge is that Azure has only provided configurations for OpenAI in their GitHub repositories, but I need a way to set up circuit breakers and load balancing for custom LLM endpoints plus Azure openai endpoints

How can I properly configure a circuit breaker in APIM to handle failures and timeouts for these LLM APIs?
What is the best approach for load balancing between multiple LLM instances in APIM?
Are there any detailed steps or code snippets available to implement this setup for non-OpenAI models?

I would appreciate any examples, ARM templates, Bicep scripts, or REST API configurations and how to use them, that could help. Thanks in advance!

LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-03-20T22:20:21.5333333+00:00

@Vivek Kumar Thank you for your question. We are reviewing you ask internally with the team and get back to you with more findings on this.
Rana 0 Reputation points

2025-03-21T14:04:24.1666667+00:00

I’m running into the same issue with configuring circuit breakers and load balancing for custom LLM endpoints in APIM. The Azure documentation and GitHub samples seem very much focused on OpenAI, but I’d love to see guidance, assistance and examples for non-OpenAI models like Phi, Mistral, Cohere and Llama. Hoping Microsoft can provide some detailed steps, Bicep scripts, or REST API configs to help us out with example policies which would help us continue with Azure rather shifting to AWS or GCP
Venkatesh Yallamraju 0 Reputation points

2025-03-24T03:04:35.98+00:00
Hello,

We are implementing multi-model routing in Azure API Management (APIM) using policies to distribute requests across multiple backend AI models (Phi, Mistral, LLaMA, Cohere, etc.). Our approach includes:

Load balancing across different backend models

Circuit breaking to handle failures gracefully

Semantic caching to optimize response times and reduce redundant API calls

However, we are encountering challenges in implementing these features effectively using APIM policies. Some of the key issues include:

Load Balancing: Ensuring an even distribution of requests across models while maintaining performance.

Circuit Breaking: Properly handling model failures and preventing cascading errors.

Semantic Caching: Efficiently caching responses based on request similarity to optimize latency.

We would appreciate any insights, best practices, or potential solutions from the community and Microsoft experts to resolve these challenges.

Has anyone successfully implemented a similar setup in APIM? Any guidance on the correct policy configurations or alternative approaches would be highly valuable.

Thank you!
Vivek Kumar 45 Reputation points

2025-04-15T11:06:24.8266667+00:00

Still looking for input on this. Would appreciate any guidance.
LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-04-23T23:34:34.2566667+00:00

@Vivek Kumar Please find the below response from our product team. If you have any questions or seek further clarification, please leverage comments below.
Ryan Hill 30,181 Reputation points Microsoft Employee

2025-04-25T16:45:42.7333333+00:00

Apologies for the late reply @Vivek Kumar. Can you check to see you find AI-Gateway/labs/backend-pool-load-balancing/backend-pool-load-balancing.ipynb at main · Azure-Samples/AI-Gateway helpful?
Vivek Kumar 45 Reputation points

2025-04-25T16:53:56.8766667+00:00

Hi Ryan,
I already have gone through the AI-Gateway labs(the link that you have mentioned). All the examples there are related to Azure openai resources only. My specific use case is to create load balancing for a mix of Azure openai and opensource models (like llama, mistral etc. available on AI-Foundry). I have commented in detail to Nima Kamoosi answer below.

1 answer

Your answer

LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-03-20T22:20:21.5333333+00:00

@Vivek Kumar Thank you for your question. We are reviewing you ask internally with the team and get back to you with more findings on this.
Rana 0 Reputation points

2025-03-21T14:04:24.1666667+00:00

I’m running into the same issue with configuring circuit breakers and load balancing for custom LLM endpoints in APIM. The Azure documentation and GitHub samples seem very much focused on OpenAI, but I’d love to see guidance, assistance and examples for non-OpenAI models like Phi, Mistral, Cohere and Llama. Hoping Microsoft can provide some detailed steps, Bicep scripts, or REST API configs to help us out with example policies which would help us continue with Azure rather shifting to AWS or GCP
Venkatesh Yallamraju 0 Reputation points

2025-03-24T03:04:35.98+00:00

Hello,

We are implementing multi-model routing in Azure API Management (APIM) using policies to distribute requests across multiple backend AI models (Phi, Mistral, LLaMA, Cohere, etc.). Our approach includes:

Load balancing across different backend models

Circuit breaking to handle failures gracefully

Semantic caching to optimize response times and reduce redundant API calls

However, we are encountering challenges in implementing these features effectively using APIM policies. Some of the key issues include:

Load Balancing: Ensuring an even distribution of requests across models while maintaining performance.

Circuit Breaking: Properly handling model failures and preventing cascading errors.

Semantic Caching: Efficiently caching responses based on request similarity to optimize latency.

We would appreciate any insights, best practices, or potential solutions from the community and Microsoft experts to resolve these challenges.

Has anyone successfully implemented a similar setup in APIM? Any guidance on the correct policy configurations or alternative approaches would be highly valuable.

Thank you!
Vivek Kumar 45 Reputation points

2025-04-15T11:06:24.8266667+00:00

Still looking for input on this. Would appreciate any guidance.
LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-04-23T23:34:34.2566667+00:00

@Vivek Kumar Please find the below response from our product team. If you have any questions or seek further clarification, please leverage comments below.
Ryan Hill 30,181 Reputation points Microsoft Employee

2025-04-25T16:45:42.7333333+00:00

Apologies for the late reply @Vivek Kumar. Can you check to see you find AI-Gateway/labs/backend-pool-load-balancing/backend-pool-load-balancing.ipynb at main · Azure-Samples/AI-Gateway helpful?
Vivek Kumar 45 Reputation points

2025-04-25T16:53:56.8766667+00:00

Hi Ryan,
I already have gone through the AI-Gateway labs(the link that you have mentioned). All the examples there are related to Azure openai resources only. My specific use case is to create load balancing for a mix of Azure openai and opensource models (like llama, mistral etc. available on AI-Foundry). I have commented in detail to Nima Kamoosi answer below.

Answer 1

Nima Kamoosi 5 Microsoft Employee

@Vivek Kumar A few detailed answers:

Assuming all your backends are hosted by Azure AI Foundry (or Azure OpenAI) they will expose OpenAI compatible endpoints for each deployment/model. Correct way to use load balancing is to create a Pool Backend and include many Single backend, each corresponding to an Azure AI model/deployment endpoint
You can configure Circuit Breakers on the Single Backends and configure 5xx and 429 error ranges on both Pool Backends and Single Backends (with circuit breaker). See this related blog as well as our general documentation on backends.
- Support for propagating retry-after is not currently available, but we are working on it.
For non-OpenAI models, as long as they contain an OpenAI compatible endpoint, the setup will be the same. We may share detailed steps in the future through a blog or using other collateral like templates etc.

Vivek Kumar 45 Reputation points

2025-04-25T13:20:57.57+00:00
Hi LeelaRajeshSayana & Nima Kamoosi,
Thanks for the response

I'm working on a solution using Azure API Management (APIM) to implement priority-based load balancing across different LLM backends. I've had success with some scenarios, but I'm running into an issue when mixing OpenAI and open-source models in the same backend pool.

What Works

1. Multiple open-source models (e.g., Phi, LLaMA) All backends use the same path structure, such as /chat/completions. APIM's backend pool routing with priority works well here.

2. Multiple Azure OpenAI deployments (e.g., pay-as-you-go and PTU) These backends use the route /deployments/{deployment-id}/chat/completions?api-version=.... Again, the priority-based fallback logic in the backend pool works perfectly in this setup.

The Problem

When combining Azure OpenAI backends with open-source backends (e.g., LLaMA, Phi), I run into issues because:

OpenAI routes include /deployments/{deployment-id}/chat/completions

OSS models use a simpler /chat/completions route

Since the backend pool setup in APIM only updates the base URL during backend switching, the route remains static. If APIM starts with OpenAI and falls back to Phi, it ends up calling /deployments/{deployment-id}/chat/completions on Phi's endpoint, which results in a 404.

What I'm Looking For

Has anyone successfully handled a mixed-backend scenario like this in APIM? Specifically:

Is there a way to dynamically rewrite the path based on the backend selected by APIM from the pool?

Are there recommended workarounds to support mixed routing structures while retaining priority-based backend pool logic?

Can Microsoft provide example policies or reusable patterns that support this use case? If not currently available, is there a plan to make such patterns or features available in the future?

Share via

Configuring Circuit Breaker & Load Balancer in APIM for Multiple LLMs (Phi, Llama, etc.)

1 answer

Your answer