what does 1000 TPM actually mean in deployments

Hải Phạm 0 Reputation points
2024-11-23T05:38:16.2533333+00:00

I have a GPT4-o-mini deployment and its config is 1000 TPM and corresponding RPM is 10.
But everytime I call an API such as chat completion API then I need to wait 1 minute to able to call API again. If not I will receive 429 response status. So what does 1000 TPM and 10 RPM do in this deployment. Can anyone tell me?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,308 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 24,091 Reputation points MVP
    2024-11-23T10:17:48.3766667+00:00

    Hi Hải Phạm,

    Thanks for reaching out to Microsoft Q&A.

    The configuration of 1000 TPM and 10 RPM for your GPT-4-o-mini deployment specifies the rate limits for how frequently you can call the APIs within a given time frame.

    Here's a detailed explanation:

    What 1000 TPM and 10 RPM mean:

    1000 TPM (Transactions Per Minute):

    • This is the aggregate limit on the number of API requests your deployment can handle per minute.
      • It indicates that you can make up to 1000 calls in one minute, distributed across multiple users or processes.
        • If you exceed 1000 requests in a minute, subsequent requests will fail with a 429 Too Many Requests response.
    1. 10 RPM (Requests Per Minute):
    • This is the per-client limit (or burst limit) that restricts how frequently you can call the API from a single client or token.
      • It means that a single client (ex: your API key or specific application) can only make 10 requests per minute.
      • If you try to send more than 10 requests in one minute, you will encounter a 429 response, even if the overall deployment limit of 1000 TPM is not reached.

    Why are you encountering delays?

    When you call the Chat Completion API:

    • The 10 RPM rate limit applies to your specific API client, meaning you cannot make more than one request every 6 seconds (60 seconds/10 requests).
    • Even though your deployment supports up to 1000 TPM, this aggregate limit is shared across all clients. The 10 RPM restriction throttles your requests specifically.

    Practical Example:

    Scenario 1 (Single Client):

    • You call the API once and then immediately call it again within the same minute.
      • If this exceeds the 10 RPM limit for your client, the second request will fail with a 429 status.
      Scenario 2 (Multiple Clients):
      - If 100 clients are using the deployment simultaneously and each is limited to **10 RPM**, the total traffic could reach **1000 TPM** without any client exceeding its individual limit.
      

    Solution or Workarounds:

    Optimize Request Frequency:

    • Space out your API calls to ensure you stay under the 10 RPM limit.

    Parallelize Clients (if applicable):

      - If your use case involves multiple clients or tokens, distribute requests across them to take advantage of the higher aggregate TPM limit.
      
    
    1. Check Rate Limits via API Response Headers:
      • Responses from Azure OpenAI APIs often include rate limit headers (e.g., x-ratelimit-limit-requests and x-ratelimit-remaining-requests) to help you monitor usage and prevent hitting limits.

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.