@Vivek Kumar A few detailed answers:
- Assuming all your backends are hosted by Azure AI Foundry (or Azure OpenAI) they will expose OpenAI compatible endpoints for each deployment/model. Correct way to use load balancing is to create a Pool Backend and include many Single backend, each corresponding to an Azure AI model/deployment endpoint
- You can configure Circuit Breakers on the Single Backends and configure 5xx and 429 error ranges on both Pool Backends and Single Backends (with circuit breaker). See this related blog as well as our general documentation on backends.
- Support for propagating retry-after is not currently available, but we are working on it.
- For non-OpenAI models, as long as they contain an OpenAI compatible endpoint, the setup will be the same. We may share detailed steps in the future through a blog or using other collateral like templates etc.