The services in ACI are often impacted, which will cause a timeout.

Question

Liang Zhang 0 Microsoft Employee

1.There is a sharp fluctuation in the execution time curve at the beginning of every hour.
User's image

2.Sometimes, several server instances suddenly operated more slowly, but they automatically resumed afterward as follows.
User's image

Himanshu Shekhar 6,420 Reputation points Microsoft External Staff Moderator

2026-06-03T08:19:02.6866667+00:00
Liang Zhang

Thank you for sharing the details and graphs.

From the provided trend, we can observe intermittent latency spikes and temporary execution slowdowns across ACI instances, especially around the beginning of hourly intervals. Since the services automatically recover afterward, this currently appears to be an intermittent platform/backend performance degradation rather than a complete application failure.

To help isolate whether the behavior is originating from the application layer, container runtime, or underlying infrastructure, could you please help validate/share the following:

Affected ACI region and timeframe (with timezone)

Correlation between timeout spikes and container restarts/redeployments

Container resource utilization during the impacted window (CPU/Memory)

Any backend dependency calls impacted during the same period

Whether the issue is affecting all containers or only specific instances
Himanshu Shekhar 6,420 Reputation points Microsoft External Staff Moderator

2026-06-04T08:01:58.06+00:00

Liang Zhang - I’m just checking whether the response was helpful. Please let me know if you have any queries. We have also reached out via private messages, so please review those as well.

1 answer

Himanshu Shekhar 6,420 Reputation points Microsoft External Staff Moderator

2026-06-03T08:19:02.6866667+00:00

Liang Zhang

Thank you for sharing the details and graphs.

From the provided trend, we can observe intermittent latency spikes and temporary execution slowdowns across ACI instances, especially around the beginning of hourly intervals. Since the services automatically recover afterward, this currently appears to be an intermittent platform/backend performance degradation rather than a complete application failure.

To help isolate whether the behavior is originating from the application layer, container runtime, or underlying infrastructure, could you please help validate/share the following:

Affected ACI region and timeframe (with timezone)

Correlation between timeout spikes and container restarts/redeployments

Container resource utilization during the impacted window (CPU/Memory)

Any backend dependency calls impacted during the same period

Whether the issue is affecting all containers or only specific instances
Himanshu Shekhar 6,420 Reputation points Microsoft External Staff Moderator

2026-06-04T08:01:58.06+00:00

Liang Zhang - I’m just checking whether the response was helpful. Please let me know if you have any queries. We have also reached out via private messages, so please review those as well.

Answer 1

The described behavior matches intermittent performance issues on the compute side rather than a pure network fault.

In similar Azure containerized workloads, intermittent timeouts and sharp spikes in latency are commonly caused by:

Component or application performance problems inside the container (for example, CPU or memory pressure, garbage collection, or restarts), which manifest as timeouts at the client.
Platform limits such as SNAT port exhaustion when many outbound connections are opened in a short time window.

From the available guidance:

Intermittent timeouts with an otherwise healthy network are typically linked to component performance issues. The recommended approach is to inspect resource usage and health from the “inside out”: check CPU/memory usage and container restarts, and review logs around the time of the spikes to see whether the application is recycling or entering a CrashLoopBackOff‑like pattern.
For services that make many outbound calls, sudden latency spikes and timeouts can also be caused by exhausting SNAT ports. A rapid succession of client or container requests to the same backend host and port can consume the preallocated SNAT ports. Until ports are reclaimed (after several minutes), new connections may hang or time out.

Based on this, the next steps are:

During the top‑of‑hour spikes, inspect container metrics (CPU, memory, restarts) and logs to see whether the application is under heavy load, recycling, or throwing errors.
If the containers call external services, review connection patterns (for example, many short‑lived connections to the same endpoint) and consider connection pooling or reducing connection churn to avoid SNAT exhaustion.
Correlate timeout periods with any scheduled jobs or batch workloads that run at the beginning of each hour, as these may be triggering the bursts in resource usage or outbound connections.

References: