Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform
Use the following checks and actions when an Azure AI Foundry evaluation is stuck in the Starting (or effectively never moves to Running/Completed) state, even though chat in the playground works:
- Verify the evaluation job status and cancel if it is stuck
- If using the SDK, check the run status.
- If the run has been in Running/Starting for a long time with no progress, cancel it:
client.evals.runs.cancel(run_id, eval_id=eval_id) - After canceling, create a new evaluation run.
- Check Azure OpenAI model capacity and quota
- A common cause of long-running or stuck evaluation jobs is insufficient capacity on the Azure OpenAI deployment used by the evaluation.
- In the Azure portal, open the Azure OpenAI (or Foundry model deployment) resource used for the evaluation and:
- Verify the deployment is healthy.
- Increase the model capacity / tokens-per-minute (TPM) quota if it is low or saturated.
- After increasing capacity, rerun the evaluation.
- Confirm authentication and permissions
- If the evaluation is created via SDK and authentication is misconfigured, the job may never progress.
- Ensure
DefaultAzureCredentialis correctly set up (runaz loginif using Azure CLI) and that the identity used has the Azure AI User role on the Foundry project. - Verify the project endpoint URL is correct and includes both account and project names.
- Validate evaluation dataset and mapping
- If the evaluation uses a dataset (CSV/JSONL), schema or mapping issues can cause failures:
- Ensure JSONL has one valid JSON object per line.
- Confirm
data_mappingfield names exactly match the dataset fields (case-sensitive). - Ensure
item_schemaproperties align with the dataset fields.
- Fix any schema/mapping issues and recreate the evaluation run.
- If the evaluation uses a dataset (CSV/JSONL), schema or mapping issues can cause failures:
- Handle rate limiting
- If the environment is hitting rate limits, evaluation creation or execution can stall:
- If you receive
429 Too Many Requestswhen creating runs, respect theretry-afterheader and use exponential backoff. - If jobs fail with
429during execution, reduce dataset size or split into smaller batches and/or increase TPM quota for the model deployment.
- If you receive
- If the environment is hitting rate limits, evaluation creation or execution can stall:
- Re-run after adjustments
- After adjusting capacity, fixing schema, or correcting permissions, submit a new evaluation run and monitor its status.
If the job continues to stay in Starting after these steps, capture the run ID and error details (if any) and open a support ticket, as the portal chat working already confirms the model deployment itself is healthy.
References: