Edit

Share via


What is provisioned throughput?

Note

If you're looking for what's recently changed with the provisioned throughput offering, see the update article for more information.

The provisioned throughput offering is a model deployment type that allows you to specify the amount of throughput you require in a model deployment. The Azure OpenAI service then allocates the necessary model processing capacity and ensures it's ready for you. Provisioned throughput provides:

  • Predictable performance: stable max latency and throughput for uniform workloads.
  • Allocated processing capacity: A deployment configures the amount of throughput. Once deployed, the throughput is available whether used or not.
  • Cost savings: High throughput workloads might provide cost savings vs token-based consumption.

Tip

When to use provisioned throughput

You should consider switching from standard deployments to provisioned managed deployments when you have well-defined, predictable throughput and latency requirements. Typically, this occurs when the application is ready for production or has already been deployed in production and there's an understanding of the expected traffic. This allows users to accurately forecast the required capacity and avoid unexpected billing. Provisioned managed deployments are also useful for applications that have real-time/latency sensitive requirements.

Key concepts

Provisioned Throughput Units (PTU)

Provisioned throughput units (PTUs) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota, and used to define costs. Each quota is specific to a region and defines the maximum number of PTUs that can be assigned to deployments in that subscription and region. For information about the costs associated with the provision managed offering and PTUs, see Understanding costs associated with PTU.

Deployment types

When creating a provisioned deployment in Azure AI Foundry, the deployment type on the Create Deployment dialog can be set to the Global Provisioned-Managed, DataZone Provisioned-Managed, or regional Provisioned-Managed deployment type depending on the data processing needs for the given workload.

When creating a provisioned deployment in Azure OpenAI via CLI or API, the sku-name can be set to GlobalProvisionedManaged, DataZoneProvisionedManaged, or ProvisionedManaged depending on the data processing need for the given workload. To adapt the Azure CLI example command below to a different deployment type, simply update the sku-name parameter to match the deployment type you wish to deploy.

az cognitiveservices account deployment create \
--name <myResourceName> \
--resource-group  <myResourceGroupName> \
--deployment-name MyDeployment \
--model-name gpt-4o \
--model-version 2024-08-06  \
--model-format OpenAI \
--sku-capacity 15 \
--sku-name GlobalProvisionedManaged

Capacity transparency

Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This constraint can limit some customers' ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:

  • Quota places a limit on the maximum number of PTUs that can be deployed in a subscription and region, and does not guarantee of capacity availability.
  • Capacity is allocated at deployment time and is held for as long as the deployment exists. If service capacity is not available, the deployment will fail
  • Customers use real-time information on quota/capacity availability to choose an appropriate region for their scenario with the necessary model capacity
  • Scaling down or deleting a deployment releases capacity back to the region. There is no guarantee that the capacity will be available should the deployment be scaled up or re-created later.

Regional capacity guidance

To find the capacity needed for their deployments, use the capacity API or the Azure AI Foundry deployment experience to provide real-time information on capacity availability.

In Azure AI Foundry, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If capacity is unavailable, the experience directs users to a select an alternative region.

Details on the deployment experience can be found in the Azure OpenAI Provisioned get started guide.

The model capacities API can be used to programmatically identify the maximum sized deployment of a specified model. The API considers both your quota and service capacity in the region.

If an acceptable region isn't available to support the desire model, version and/or PTUs, customers can also try the following steps:

  • Attempt the deployment with a smaller number of PTUs.
  • Attempt the deployment at a different time. Capacity availability changes dynamically based on customer demand and more capacity might become available later.
  • Ensure that quota is available in all acceptable regions. The model capacities API and Azure AI Foundry experience consider quota availability in returning alternative regions for creating a deployment.

How can I monitor capacity?

The Provisioned-Managed Utilization V2 metric in Azure Monitor measures a given deployments utilization on 1-minute increments. All provisioned deployment types are optimized to ensure that accepted calls are processed with a consistent model processing time (actual end-to-end latency is dependent on a call's characteristics).

How utilization performance works

Provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.

In all provisioned deployment types, when capacity is exceeded, the API will return a 429 HTTP Status Error. This fast response enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or use a retry strategy to manage a given request. The service continues to return the 429 HTTP status code until the utilization drops below 100%.

What should I do when I receive a 429 response?

The 429 response isn't an error, but instead part of the design for telling users that a given deployment is fully utilized at a point in time. By providing a fast-fail response, you have control over how to handle these situations in a way that best fits your application requirements.

The retry-after-ms and retry-after headers in the response tell you the time to wait before the next call will be accepted. How you choose to handle this response depends on your application requirements. Here are some considerations:

  • You can consider redirecting the traffic to other models, deployments, or experiences. This option is the lowest-latency solution because the action can be taken as soon as you receive the 429 signal. For ideas on how to effectively implement this pattern see this community post.
  • If you're okay with longer per-call latencies, implement client-side retry logic. This option gives you the highest amount of throughput per PTU. The Azure OpenAI client libraries include built-in capabilities for handling retries.

How does the service decide when to send a 429?

In all provisioned deployment types, each request is evaluated individually according to its prompt size, expected generation size, and model to determine its expected utilization. This is in contrast to pay-as-you-go deployments, which have a custom rate limiting behavior based on the estimated traffic load. For pay-as-you-go deployments this can lead to HTTP 429 errors being generated prior to defined quota values being exceeded if traffic is not evenly distributed.

For provisioned deployments, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:

  1. Each customer has a set amount of capacity they can utilize on a deployment

  2. When a request is made:

    a. When the current utilization is above 100%, the service returns a 429 code with the retry-after-ms header set to the time until utilization is below 100%

    b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining the prompt tokens, less any cached tokens, and the specified max_tokens in the call. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the max_tokens parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the max_tokens value is as close as possible to the true generation size.

  3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:

    a. If the actual > estimated, then the difference is added to the deployment's utilization.

    b. If the actual < estimated, then the difference is subtracted.

  4. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.

Note

Calls are accepted until utilization reaches 100%. Bursts just over 100% may be permitted in short periods, but over time, your traffic is capped at 100% utilization.

Diagram showing how subsequent calls are added to the utilization.

How many concurrent calls can I have on my deployment?

The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_tokens parameter, etc.). The service continues to accept calls until the utilization reaches 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the capacity calculator. If the system generates less than the number of output tokens set for the max_tokens parameter, then the provisioned deployment will accept more requests.

What models and regions are available for provisioned throughput?

Global provisioned managed model availability

Region o3-mini, 2025-01-31 o1, 2024-12-17 gpt-4o, 2024-05-13 gpt-4o, 2024-08-06 gpt-4o, 2024-11-20 gpt-4o-mini, 2024-07-18
australiaeast
brazilsouth
canadaeast
eastus
eastus2
francecentral
germanywestcentral
italynorth
japaneast
koreacentral
northcentralus
norwayeast
polandcentral
southafricanorth
southcentralus
southeastasia
southindia
spaincentral
swedencentral
switzerlandnorth
switzerlandwest
uaenorth
uksouth
westeurope
westus
westus3

Note

The provisioned version of gpt-4 Version: turbo-2024-04-09 is currently limited to text only.

Next steps