Configuring Circuit Breaker & Load Balancer in APIM for Multiple LLMs (Phi, Llama, etc.)

Vivek Kumar 45 Reputation points
2025-03-20T15:58:02.14+00:00

I understand that we cannot configure the circuit breaker and load balancing directly through the Azure Portal in API Management (APIM). However, I need guidance on how to integrate these features properly.

I have deployed multiple different large language models (LLMs) such as Phi, Llama, etc., as serverless APIs in Azure AI Foundry and added those as apis in APIM. The challenge is that Azure has only provided configurations for OpenAI in their GitHub repositories, but I need a way to set up circuit breakers and load balancing for custom LLM endpoints plus Azure openai endpoints

  • How can I properly configure a circuit breaker in APIM to handle failures and timeouts for these LLM APIs?
  • What is the best approach for load balancing between multiple LLM instances in APIM?
  • Are there any detailed steps or code snippets available to implement this setup for non-OpenAI models?

I would appreciate any examples, ARM templates, Bicep scripts, or REST API configurations and how to use them, that could help. Thanks in advance!

Azure API Management
Azure API Management
An Azure service that provides a hybrid, multi-cloud management platform for APIs.
2,371 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Nima Kamoosi 5 Reputation points Microsoft Employee
    2025-03-31T19:28:27.9+00:00

    @Vivek Kumar A few detailed answers:

    • Assuming all your backends are hosted by Azure AI Foundry (or Azure OpenAI) they will expose OpenAI compatible endpoints for each deployment/model. Correct way to use load balancing is to create a Pool Backend and include many Single backend, each corresponding to an Azure AI model/deployment endpoint
    • You can configure Circuit Breakers on the Single Backends and configure 5xx and 429 error ranges on both Pool Backends and Single Backends (with circuit breaker). See this related blog as well as our general documentation on backends.
      • Support for propagating retry-after is not currently available, but we are working on it.
    • For non-OpenAI models, as long as they contain an OpenAI compatible endpoint, the setup will be the same. We may share detailed steps in the future through a blog or using other collateral like templates etc.
    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.