AI foundry evaluation is stuck at starting

Question

AI foundry evaluation is stuck at starting

Ipsita Dutta 0

AI foundry evaluation is stuck at starting, its not moving ahead, I have checked the project and the resource in the same region. I am getting answer from model from playgound chat

1 answer

Your answer

Answer 1

Use the following checks and actions when an Azure AI Foundry evaluation is stuck in the Starting (or effectively never moves to Running/Completed) state, even though chat in the playground works:

Verify the evaluation job status and cancel if it is stuck
- If using the SDK, check the run status.
- If the run has been in Running/Starting for a long time with no progress, cancel it:
```
     client.evals.runs.cancel(run_id, eval_id=eval_id)
```
- After canceling, create a new evaluation run.
Check Azure OpenAI model capacity and quota
- A common cause of long-running or stuck evaluation jobs is insufficient capacity on the Azure OpenAI deployment used by the evaluation.
- In the Azure portal, open the Azure OpenAI (or Foundry model deployment) resource used for the evaluation and:
  - Verify the deployment is healthy.
  - Increase the model capacity / tokens-per-minute (TPM) quota if it is low or saturated.
- After increasing capacity, rerun the evaluation.
Confirm authentication and permissions
- If the evaluation is created via SDK and authentication is misconfigured, the job may never progress.
- Ensure DefaultAzureCredential is correctly set up (run az login if using Azure CLI) and that the identity used has the Azure AI User role on the Foundry project.
- Verify the project endpoint URL is correct and includes both account and project names.
Validate evaluation dataset and mapping
- If the evaluation uses a dataset (CSV/JSONL), schema or mapping issues can cause failures:
  - Ensure JSONL has one valid JSON object per line.
  - Confirm data_mapping field names exactly match the dataset fields (case-sensitive).
  - Ensure item_schema properties align with the dataset fields.
- Fix any schema/mapping issues and recreate the evaluation run.
Handle rate limiting
- If the environment is hitting rate limits, evaluation creation or execution can stall:
  - If you receive 429 Too Many Requests when creating runs, respect the retry-after header and use exponential backoff.
  - If jobs fail with 429 during execution, reduce dataset size or split into smaller batches and/or increase TPM quota for the model deployment.
Re-run after adjustments
- After adjusting capacity, fixing schema, or correcting permissions, submit a new evaluation run and monitor its status.

If the job continues to stay in Starting after these steps, capture the run ID and error details (if any) and open a support ticket, as the portal chat working already confirms the model deployment itself is healthy.

References:

Share via

AI foundry evaluation is stuck at starting

1 answer

Your answer