Mistral low context window tokens in Foundry

Question

Mistral low context window tokens in Foundry

Tiago Gomes 0

Hi,

I'm using a Mistral Medium 3.5 serverless (Global Standard) deployment in Microsoft Foundry and I'm hitting an input token limit that doesn't match the model's specs. Sending a large document (~59K tokens) returns:

{"object":"error","message":"Input length (59376 tokens) exceeds the maximum allowed length (18487 tokens). Use a shorter input or enable --allow-auto-truncate.","type":"BadRequestError","param":null,"code":400}

The model officially supports a 256K context window, so an 18,487 token cap seems wrong. Also, --allow-auto-truncate is a server-side flag I can't control from the API, so this looks like it's coming from the inference backend itself.

Is this a known limitation for Mistral models on Foundry, or a misconfigured deployment? Is there anything I can change on my side (region, deployment type, settings) to get the full context window?

Thanks!

0 comments

Answer accepted by question author

Jerald Felix 16,095 Volunteer Moderator

Hello Tiago Gomes,

Greetings! Thanks for raising this question in the Q&A forum.

What you are seeing is not something you can fix from the client side. The 18487 token cap and the reference to --allow-auto-truncate are both artifacts of the underlying inference server configuration (this looks like a vLLM-style backend) that Microsoft Foundry uses to host the Mistral Medium 3.5 serverless (Global Standard) deployment. That flag and the effective context length are set when the model container is provisioned on Azure's side, not by anything you send in the request body, so there is no request parameter, SDK setting, or client-side workaround that lets you raise it. The gap between the advertised 256K context window and the much smaller enforced limit indicates the serverless deployment's max_model_len (or equivalent context configuration) was set lower than the model's actual spec when it was published to the catalog. Since Mistral Medium 3.5 was only added to Microsoft Foundry very recently, this is consistent with a backend deployment misconfiguration rather than an intentional product limitation.

Confirm it is not a redeployment issue on your end Delete the existing serverless deployment and create a fresh one from the model catalog rather than reusing an older deployment. Sometimes newly published model cards get backend configuration fixes that only apply to new deployments created after the fix, so a clean redeploy is worth ruling out first.

Check region availability Try creating the deployment in a different supported region if one is available to you. Backend container versions can roll out region by region, so a region that received a later build may not have the truncated context configuration.

File an Azure Support ticket Since the limit is enforced server-side and not documented or configurable through the API, this needs to be escalated to Azure Support (or a Microsoft Foundry model-catalog related GitHub issue if you have access to file one) with the following details so engineering can trace the deployment:

Model: Mistral Medium 3.5 (Global Standard, serverless)
Deployment region:
Resource / deployment name:
Error: Input length (59376 tokens) exceeds the maximum allowed length (18487 tokens)
Expected context window per model card: 256K tokens

Include the exact error payload you already have, since the specific numeric limit (18487) helps the support team identify which backend config value is misapplied.

In the interim, chunk your input Until the deployment-side limit is corrected, keep requests under roughly 18K input tokens by splitting large documents into smaller chunks or using a retrieval/summarization pass before sending content to the model, so your application keeps working while the platform-side issue is being resolved.

If this answer helps you kindly accept the answer which will help others who have similar questions.

Best Regards,

Jerald Felix.

0 comments

1 additional answer

Your answer

Answer 1

Hello @Tiago Gomes

Wanted to add few points to Jerald's response.

Thanks for the clear repro the numbers here don't line up with what Mistral publishes, and they also don't fully line up with what Microsoft documents for this model on Foundry, so it's worth separating the two.

What's actually documented for mistral-medium-3-5 on Foundry today

Mistral publishes a 256K (262,144-token) context window for Medium 3.5 on their own model card. But Microsoft's "Foundry Models sold by Azure" reference currently lists mistral-medium-3-5 (still tagged Preview) with an input limit of 128,000 tokens and an output limit of 128,000 tokens, for both Global Standard and Data Zone Standard deployments:

https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure#mistral-models-sold-by-azure

So even in the best case, Foundry's documented ceiling for this model is 128K — not the full 256K Mistral advertises. That part isn't a misconfiguration on your side, it's a platform-level cap Microsoft applies to this Preview deployment. (The same page documents similar sub-native caps elsewhere, e.g., GPT-4.1's provisioned deployments capping below its 1M native window, so this pattern isn't unique to Mistral.)

Why 18,487 is still the real problem

That said, 18,487 tokens is nowhere near the documented 128,000, so the limit your endpoint is actually enforcing doesn't match Microsoft's own published spec either. The exact phrasing of your error — "Input length (X tokens) exceeds the maximum allowed length (Y tokens)" — is the standard message vLLM returns when a request exceeds the max sequence length a serving instance was launched with. That lines up with what the earlier reply suspected: this looks like the serving backend behind your specific deployment is sized/configured with a much smaller effective context window than either Mistral's or Microsoft's documented limits not something you're doing wrong on the client side.

On --allow-auto-truncate`

You're right that this isn't something you can set. The Azure AI Model Inference API doesn't expose a context-window or truncation parameter at all — the only mechanism for passing a non-standard parameter through is the extra-parameters: pass-through header, and even then it's up to the underlying model/server to honor it:

https://learn.microsoft.com/en-us/rest/api/microsoft-foundry/modelinference/#extensibility

The truncation behavior referenced in your error is coming from the inference server itself, not from anything exposed to you as a client-configurable setting.

What I'd try next

Double-check you're deployed on mistral-medium-3-5 and not an older mistral-medium-3 — they have different documented limits.
Spin up a second deployment on the other deployment type (Global Standard vs. Data Zone Standard) and re-run the same request. If the cap differs between the two, that's strong diagnostic evidence and might unblock you faster than waiting on a fix:

https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/deployment-types

Since the enforced limit (18,487) contradicts Microsoft's own documented spec (128,000) for this model, I'd treat this as an Azure Support case rather than just a redeploy — recreating a serverless deployment doesn't let you pick the underlying serving configuration, so it's a reasonable quick thing to try but unlikely to be a reliable fix on its own. When you file, reference the exact error and the 128K figure from the docs above so support can check the specific backend instance behind your endpoint.
In the meantime, chunk the document and retrieve only the relevant sections per call so you stay under the current effective limit. Even once this specific issue is resolved, I'd plan around 128K as your practical ceiling on this model via Foundry, since that's what's documented today regardless of Mistral's native 256K.

Thanks,
Manish.