Share via

How to scale Sync-over-Async HTTP Gateways without hitting Azure Service Bus Session limits?

Shivam Saluja 0 Reputation points
2026-04-13T17:37:00.5033333+00:00

We are building an edge API Gateway to bridge legacy synchronous HTTP clients with long-running, asynchronous AI workers (tasks taking 45+ seconds).

To prevent 504 Gateway Timeouts and thread exhaustion, we are using a Sync-over-Async pattern: the REST controller receives the request, drops it on a Service Bus queue, and waits for a reply from the worker to map back to the HTTP connection.

The standard recommendation is to use Service Bus Sessions to correlate the request to the reply. However, as we scale, we are hitting severe bottlenecks:

Stateful Routing: The Gateway pod must hold a session lock. If that pod crashes, the session is locked and the reply is trapped until timeout.

Standard Tier Limits: A traffic spike easily exhausts the 1,000 concurrent session limit on the Service Bus Standard tier.

Horizontal Scaling: Load balancers distribute incoming requests statelessly, but the session requires stateful tracking to get the reply back to the specific pod holding the open HTTP connection.

Is there a recommended architectural pattern to achieve request-reply correlation over Service Bus at high scale without relying on Sessions, keeping the Gateway pods 100% stateless?

Azure Service Bus
Azure Service Bus

An Azure service that provides cloud messaging as a service and hybrid integration.


2 answers

Sort by: Most helpful
  1. Q&A Assist
    2026-04-13T17:37:19.0233333+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    Use a pure asynchronous request-reply pattern and remove the requirement to hold the HTTP connection open. That keeps the gateway stateless and avoids Service Bus session limits entirely.

    Key elements:

    1. Decouple HTTP request from worker reply
      • Front-end gateway:
        • Validates the request.
        • Enqueues a command message to Service Bus (no session required).
        • Immediately returns 202 Accepted with a URL to a status resource in the Location header.
      • This follows the Asynchronous Request-Reply pattern with HTTP polling, where the client no longer waits synchronously for the worker to complete.
    2. Use an HTTP status endpoint instead of in-memory correlation
      • The Location header points to a status endpoint, for example /operations/{operationId}.
      • The gateway persists operation state in a durable store (cache/DB) keyed by operationId, not in process memory.
      • Any gateway pod can handle GET /operations/{operationId} because all state is externalized.
      • While work is in progress, the status endpoint returns 200 OK with a body indicating the current state.
      • When work completes, the status endpoint either:
        • Returns the result directly, or
        • Returns 303 See Other redirecting to the final resource URL.
    3. Correlate via IDs, not Service Bus sessions
      • Include a correlation/operation ID in the command message sent to Service Bus.
      • The worker processes the message asynchronously and publishes a completion message or writes the result directly to the status store using that ID.
      • No Service Bus sessions are required; correlation is done at the application level via the ID.
    4. Make the gateway and workers fully stateless
      • Gateway pods:
        • Only send messages to Service Bus and read/write operation state from a shared store.
        • Do not hold open HTTP connections waiting for Service Bus replies.
      • Workers:
        • Consume from Service Bus using standard queues/topics.
        • Write results and state transitions to the shared store.
      • This aligns with asynchronous messaging guidance where services scale independently and failures in downstream services do not block the front end.
    5. Apply asynchronous messaging patterns for scale
      • Use queues and the Queue-Based Load Leveling and Competing Consumers patterns so workers can scale horizontally and process messages at their own rate.
      • Ensure message processing is idempotent so retries do not corrupt state.
      • Use Service Bus built-in retry and dead-letter capabilities for robustness.
    6. Manage client experience
      • Since operations are long-running (45+ seconds), design the UX around asynchronous completion:
        • Show “in progress” state based on the status endpoint.
        • Optionally send notifications (email, push, webhook) when the operation completes.

    This architecture:

    • Eliminates the need for Service Bus sessions and their concurrency limits.
    • Keeps gateway pods stateless and horizontally scalable.
    • Avoids stateful routing and session lock issues because no HTTP connection is tied to a specific worker or queue consumer.

    References:

    1 person found this answer helpful.
  2. Rakesh Mishra 8,100 Reputation points Microsoft External Staff Moderator
    2026-04-13T18:44:51.1833333+00:00

    Hey Shivam,

    Great question—and nice work teasing out the root of the scaling issue with Sessions. You are completely right that pushing the routing logic down to the broker is the key to achieving stateless gateways, but you need to be very careful with how you implement dynamic subscriptions.

    The flow you described (creating a temporary subscription per request) is actually considered an anti-pattern at scale and will quickly hit Azure Service Bus quotas. Specifically:

    • The 2,000 Limit: Service Bus restricts topics to a maximum of 2,000 subscriptions. A traffic spike of 2k+ concurrent requests will exhaust your namespace quotas and cause complete failure.
    • Control-Plane Latency: Creating and tearing down subscriptions are management API operations. They are heavily rate-limited and far too slow to put on the hot path of an HTTP request.
    • Orphaned Entities: If a pod crashes before deleting its per-request subscription, the subscription leaks. Even with AutoDeleteOnIdle, it takes a minimum of 5 minutes for the broker to clean it up.

    The Recommended Pattern: Per-Pod Temporary Subscriptions + In-Memory Correlation

    To achieve 100% statelessness without hitting limits, you should create a dynamic subscription per Gateway Pod, not per request. Here is the optimized flow:

    1. Gateway Pod Startup (Control Plane)
      • When a Gateway pod spins up, it generates a unique ID (e.g., Pod-42).
      • It creates a single temporary subscription on the shared ReplyTopic with a SQL Filter: ReplyToPod = 'Pod-42'.
      • It configures AutoDeleteOnIdle = 5 minutes so the broker cleans it up if the pod scales down or crashes.
    2. Handling the HTTP Request (Data Plane)
      • The pod receives an HTTP request and generates a CorrelationId (e.g., Req-A1).
      • It registers a TaskCompletionSource (or CompletableFuture) into an in-memory ConcurrentDictionary, keyed by the CorrelationId.
      • It sends the work message to the worker queue, stamping both CorrelationId = 'Req-A1' and ReplyToPod = 'Pod-42'.
    3. Worker Processing
      • The AI worker does its 45+ second job and publishes the reply to the ReplyTopic, passing along the same CorrelationId and ReplyToPod properties.
    4. Broker-Driven Fan-in
      • Service Bus evaluates the SQL filter and pushes the message strictly to Pod-42's subscription.
      • The message pump on Pod-42 reads the message, extracts the CorrelationId, pulls the corresponding Task from the ConcurrentDictionary, and completes it—returning the HTTP response.

    Why this is the ultimate scaling fix:

    • Zero Hot-Path Latency: You execute zero control-plane operations during a request.
    • Massive Scale: If you scale to 500 gateway pods, you only have 500 subscriptions on the topic—well below the 2,000 hard limit.
    • Stateless horizontally: Standard load balancers can route the incoming HTTP request to any pod, and the Service Bus handles routing the reply back to the exact pod holding that specific TCP connection.

    (Note: Since your AI workers take 45+ seconds, ensure your edge Load Balancers/Ingress controllers have their idle timeout increased to 60s+, otherwise the client will receive a 504 Gateway Timeout before the Service Bus reply even makes it back to the pod! If you cross the 60s boundary, you may need to abandon Sync-over-Async entirely and adopt an Asynchronous HTTP 202 Polling pattern).

    Hope this helps you unblock your scale-out safely and let me know in comments if it works.

    Note: This response is drafted with the help of AI systems.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.