Hi Hải Phạm,
Thanks for reaching out to Microsoft Q&A.
The configuration of 1000 TPM and 10 RPM for your GPT-4-o-mini deployment specifies the rate limits for how frequently you can call the APIs within a given time frame.
Here's a detailed explanation:
What 1000 TPM and 10 RPM mean:
1000 TPM (Transactions Per Minute):
- This is the aggregate limit on the number of API requests your deployment can handle per minute.
- It indicates that you can make up to 1000 calls in one minute, distributed across multiple users or processes.
- If you exceed 1000 requests in a minute, subsequent requests will fail with a
429 Too Many Requests
response.
- If you exceed 1000 requests in a minute, subsequent requests will fail with a
- It indicates that you can make up to 1000 calls in one minute, distributed across multiple users or processes.
- 10 RPM (Requests Per Minute):
- This is the per-client limit (or burst limit) that restricts how frequently you can call the API from a single client or token.
- It means that a single client (ex: your API key or specific application) can only make 10 requests per minute.
- If you try to send more than 10 requests in one minute, you will encounter a
429
response, even if the overall deployment limit of 1000 TPM is not reached.
Why are you encountering delays?
When you call the Chat Completion API:
- The 10 RPM rate limit applies to your specific API client, meaning you cannot make more than one request every 6 seconds (60 seconds/10 requests).
- Even though your deployment supports up to 1000 TPM, this aggregate limit is shared across all clients. The 10 RPM restriction throttles your requests specifically.
Practical Example:
Scenario 1 (Single Client):
- You call the API once and then immediately call it again within the same minute.
- If this exceeds the 10 RPM limit for your client, the second request will fail with a
429
status.
- If 100 clients are using the deployment simultaneously and each is limited to **10 RPM**, the total traffic could reach **1000 TPM** without any client exceeding its individual limit.
- If this exceeds the 10 RPM limit for your client, the second request will fail with a
Solution or Workarounds:
Optimize Request Frequency:
- Space out your API calls to ensure you stay under the 10 RPM limit.
Parallelize Clients (if applicable):
- If your use case involves multiple clients or tokens, distribute requests across them to take advantage of the higher aggregate TPM limit.
- Check Rate Limits via API Response Headers:
- Responses from Azure OpenAI APIs often include rate limit headers (e.g.,
x-ratelimit-limit-requests
andx-ratelimit-remaining-requests
) to help you monitor usage and prevent hitting limits.
- Responses from Azure OpenAI APIs often include rate limit headers (e.g.,
Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.