A unified Azure platform for creating and managing AI models, agents, and applications with built‑in enterprise security, monitoring, and governance
Hello Tiago Gomes,
Greetings! Thanks for raising this question in the Q&A forum.
What you are seeing is not something you can fix from the client side. The 18487 token cap and the reference to --allow-auto-truncate are both artifacts of the underlying inference server configuration (this looks like a vLLM-style backend) that Microsoft Foundry uses to host the Mistral Medium 3.5 serverless (Global Standard) deployment. That flag and the effective context length are set when the model container is provisioned on Azure's side, not by anything you send in the request body, so there is no request parameter, SDK setting, or client-side workaround that lets you raise it. The gap between the advertised 256K context window and the much smaller enforced limit indicates the serverless deployment's max_model_len (or equivalent context configuration) was set lower than the model's actual spec when it was published to the catalog. Since Mistral Medium 3.5 was only added to Microsoft Foundry very recently, this is consistent with a backend deployment misconfiguration rather than an intentional product limitation.
Confirm it is not a redeployment issue on your end Delete the existing serverless deployment and create a fresh one from the model catalog rather than reusing an older deployment. Sometimes newly published model cards get backend configuration fixes that only apply to new deployments created after the fix, so a clean redeploy is worth ruling out first.
Check region availability Try creating the deployment in a different supported region if one is available to you. Backend container versions can roll out region by region, so a region that received a later build may not have the truncated context configuration.
File an Azure Support ticket Since the limit is enforced server-side and not documented or configurable through the API, this needs to be escalated to Azure Support (or a Microsoft Foundry model-catalog related GitHub issue if you have access to file one) with the following details so engineering can trace the deployment:
Model: Mistral Medium 3.5 (Global Standard, serverless)
Deployment region:
Resource / deployment name:
Error: Input length (59376 tokens) exceeds the maximum allowed length (18487 tokens)
Expected context window per model card: 256K tokens
Include the exact error payload you already have, since the specific numeric limit (18487) helps the support team identify which backend config value is misapplied.
- In the interim, chunk your input Until the deployment-side limit is corrected, keep requests under roughly 18K input tokens by splitting large documents into smaller chunks or using a retrieval/summarization pass before sending content to the model, so your application keeps working while the platform-side issue is being resolved.
If this answer helps you kindly accept the answer which will help others who have similar questions.
Best Regards,
Jerald Felix.