Claude models in azure foundry's model router not doing prompt caching

Question

Claude models in azure foundry's model router not doing prompt caching

Sharma, Deeksha 0

Hi ,

I'm testing the model router with latest antrhopic models. Seems like they don't have prompt caching enabled as I don't see any field in usage dictionary.

For eg :

{ "model": "claude-opus-4-7", "usage": { "completion_tokens": 2043, "prompt_tokens": 1327, "total_tokens": 3370 } }

This is the usage dictionary returned.

My question is wouldn't it defeat the purpose of having a router if we're not have any prompt caching for anthropic models. The costs would in fact be super hight than the baseline model costs.

Manish Deshpande 7,520 Reputation points Microsoft External Staff Moderator

2026-07-02T20:25:01.1133333+00:00
Hello @Sharma, Deeksha

Thanks for the detailed example it actually points straight to the answer, and the short version is: this isn't the router disabling caching, and your costs aren't higher than baseline. Let me break it down.

1. The empty usage fields don't mean caching is off. Model Router normalizes usage into the OpenAI-style schema, where cache hits appear as cached_tokens under prompt_tokens_details. Claude doesn't use that field — natively it reports caching through cache_creation_input_tokens and cache_read_input_tokens (Anthropic bills cache writes and cache reads differently).
So for a Claude-backed response, the absence of cached_tokens/prompt_tokens_details is expected and doesn't by itself prove caching didn't happen.

2. In your specific example, caching couldn't apply anyway. Your request shows prompt_tokens: 1327. For Claude models in Foundry, the minimum cacheable prompt is 2,048 tokens — below that threshold, nothing is eligible to be cached. So this particular call would show no cache benefit no matter what.

3. Claude caching is not automatic it needs cache breakpoints. Unlike OpenAI models (which cache identical prefixes automatically), Claude/Anthropic prompt caching only activates when you explicitly mark cache breakpoints (cache_control) in the request, on a prompt of at least 2,048 tokens. If your calls don't set those breakpoints, Claude won't cache that's an Anthropic API behavior, not a router limitation.

4. On the cost concern the router doesn't make Claude more expensive than baseline. A request the router sends to Claude is billed at that Claude model's own token rates the same as if you called the Claude deployment directly. Prompt caching is an optional discount layered on top; not getting it means you miss the discount, not that you pay a premium above the base model. The router's value is right-fit model selection (cheaper/faster models for simpler prompts), which stands on its own even before caching.

To confirm caching works on Claude, test it deliberately: Call your Claude deployment directly (bypass the router) with a prompt of ≥2,048 tokens, set cache breakpoints (cache_control) per Anthropic's guidance, and send it twice in a row. On the second call, check the usage object for a non-zero cache_read_input_tokens. If it's non-zero, caching is working the router just isn't surfacing Claude's cache fields in the normalized schema. You can also cross-check billed input tokens in Azure Monitor (Metrics on your Foundry resource, filtered to the Claude deployment) or Cost Management.

One more note: through the router, two consecutive identical prompts may be routed to different underlying models, which produces no cache hit by design — so the direct-deployment test above is the reliable way to validate caching.

If the direct test with breakpoints still shows no cache reads, that's worth a support ticket referencing this behavior so engineering can confirm Claude cache support/surfacing on your router version.

References:

Model router concepts (prompt caching):
https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/model-router

Prompt caching with Azure OpenAI (cached_tokens / prompt_tokens_details):
https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching

Claude models in Foundry (cache breakpoints, ITPM):
https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/claude-models?tabs=pay-go

Foundry Models from partners — Anthropic (min cacheable prompt 2,048 tokens):
https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/models-from-partners#anthropic

Anthropic prompt caching (cache_control, cache token fields):
https://platform.claude.com/docs/en/build-with-claude/prompt-caching

Thanks,
Manish.
Sharma, Deeksha 0 Reputation points

2026-07-03T05:44:05.26+00:00
Hi @Manish Deshpande , Thanks, I tested directly and could get the cached tokens data . My requirement is to calculate and compare the cost saving using model router vs a baseline claude model. What would you suggest as a good way to do that.

In the foundry monitor section all the requests details are not updated in realtime so I need to do the cost calculation manually .

Claude opus 4.7 "usage": { "input_tokens": 25, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 4827, "cache_creation": { "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0 }, "output_tokens": 200, "output_tokens_details": { "thinking_tokens": 0 }, "service_tier": "standard", "inference_geo": "not_available" ``` }
Manish Deshpande 7,520 Reputation points Microsoft External Staff Moderator

2026-07-04T02:57:59.1133333+00:00
Hello @Sharma, Deeksha

Glad the direct call confirmed the cache accounting that tells us caching itself is fine, it was purely a reporting gap through the router.

For the Model Router vs. baseline cost comparison, there isn't a built-in report that does this for you, because it's a counterfactual: Azure can only ever bill you for calls that actually happened, so the "what if every request went to one fixed model instead" side has to be calculated manually from your logged usage. Here's the approach I'd use:

Log per-request data at the application layer, not from the portal.** For every call through Model Router, capture the model field (tells you which underlying model actually served that request) and the full usage block (input_tokens, cache_creation_input_tokens — split into ephemeral_5m_input_tokens / ephemeral_1h_input_tokens, cache_read_input_tokens, output_tokens). Note thinking_tokens under output_tokens_details is already included inside output_tokens, not additional — so you don't need to add it separately.

Pull current published per-model rates. Rates differ by model (Opus vs. Sonnet vs. Haiku), so grab the rate for every model in your routing subset from the Claude pricing page. Cache writes are billed at 1.25× the input rate for the 5-minute TTL and 2× for the 1-hour TTL; cache reads at 0.1×; everything else at standard input/output rates.

Calculate "Actual" cost. For each logged request, apply the rate of whichever model actually handled it (from the model field) to that request's token breakdown, and sum across all requests.

Calculate "Baseline" cost. Take the exact same per-request token counts, but re-rate every request using only your baseline model's published rate, as if every call had gone there instead. Savings = Baseline − Actual. One caveat worth flagging: this baseline figure is necessarily an approximation it assumes the baseline model would have hit the same cache prefixes and thresholds, which may not hold exactly if cache minimums differ by model.

Cross-check the "Actual" side against billed data. Once it settles, Azure Cost Management shows real post-consumption CCU charges split by model useful to sanity-check your own math. Note there's an approximately 5-hour delay between a billing event and when it shows up in Cost Management, which is almost certainly what you're seeing as "not updated in real time" that's documented, expected behavior, not a bug. For near-real-time (not cost, but token/request volume) data, Monitoring > Metrics on your Foundry resource, split by underlying model, updates much faster and is a good way to independently verify your routing distribution matches what you logged.

Docs for reference:

Claude Consumption Units (CCU) billing in Microsoft Foundry:
https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/claude-models-billing

Claude pricing (per-model rates, CCU conversion, cache multipliers):
https://platform.claude.com/docs/en/about-claude/pricing

Monitor Model Deployments in Foundry Models (the ~5-hour Cost Management delay, and the Cost Management link):
https://learn.microsoft.com/en-us/azure/foundry/foundry-models/how-to/monitor-models

How model router works (monitoring routing distribution, split by underlying model):
https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/model-router-how-it-works

Prompt caching (Anthropic) — cache write/read multipliers:
https://platform.claude.com/docs/en/build-with-claude/prompt-caching

Thanks,
Manish.

Your answer

Sharma, Deeksha 0 Reputation points

2026-07-03T05:44:05.26+00:00

Hi @Manish Deshpande , Thanks, I tested directly and could get the cached tokens data . My requirement is to calculate and compare the cost saving using model router vs a baseline claude model. What would you suggest as a good way to do that.

In the foundry monitor section all the requests details are not updated in realtime so I need to do the cost calculation manually .

Claude opus 4.7 "usage": { "input_tokens": 25, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 4827, "cache_creation": { "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0 }, "output_tokens": 200, "output_tokens_details": { "thinking_tokens": 0 }, "service_tier": "standard", "inference_geo": "not_available" ``` }