Azure AI Translator Container Memory Leak Symptoms Causes and Solutions Inquiry

BrityMeeting 공용계정 0 Reputation points
2025-07-18T01:48:28.0933333+00:00

We configured AI Translator Container on 4 servers a month ago (June 2025).

There is no problem with the translation itself, but we have confirmed that there is a memory leak as shown in the image below.

I would like to ask how to identify the cause of the memory leak and resolve it.

Azure AI Translator
Azure AI Translator
An Azure service to easily conduct machine translation with a simple REST API call.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 41,111 Reputation points Volunteer Moderator
    2025-10-27T15:16:02.16+00:00

    Hello ! Thank you for posting on Microsoft Learn Q&A.

    In your case you need to verify if it’s real RSS growth and not page cache or host noise so try to run on each host/container:

    # live view
    docker stats cpaas_trans
    
    # inside the container: process 1 is the service
    docker exec cpaas_trans sh -c 'awk "/VmRSS|VmSize/" /proc/1/status; echo "---"; cat /proc/meminfo | egrep "MemAvailable|Cached"; echo "---"; head -100 /proc/1/smaps_rollup'
    
    # per-region allocator view (good for C/C++ heaps)
    docker exec cpaas_trans sh -c 'cat /proc/1/smaps_rollup'
    

    and if VmRSS tracks your Grafana line while Cached on the host does not it’s true heap growth.

    Another thing is to rule out easy the followings :

    • cgroups limits: you set --memory 46g on a 48 GB host. Leave at least 6–8 GB headroom for the OS + page cache
        docker run ... --memory 38g --memory-swap 0 --memory-reservation 30g ...
      
      (--memory-swap 0 = no swap; prevents silent ballooning.)
    • logging: you’re using --log-driver=syslog; confirm host syslog rotation so logs don’t pressure memory/disk I/O
    • concurrency that correlate with mini-steps up in RSS

    You need to lower limits on one node: --memory 38g --memory-swap 0 --cpus 16.

    and spin up one canary with -e Languages=en,pt only and mirror10% traffic for 48 h.

    Add an hourly dump of /proc/1/smaps_rollup to quantify where memory grows and on clients, cap HTTP keep-alive lifetime and idle pools and try a newer tag on a separate canary.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.