Batch jobs are stuck on "validating" and then all fail afterwards

Question

Batch jobs are stuck on "validating" and then all fail afterwards

Kiptoo Towett 0 Microsoft Employee

User's image

Batch jobs are stuck on "validating" and then all fail afterwards

Karnam Venkata Rajeswari 565 Reputation points Microsoft External Staff Moderator

2026-03-13T13:20:10.88+00:00

Hello Kiptoo Towett,

Welcome to Microsoft Q&A and Thank you for reaching out.

We need more details to assist you on this case. I’ve reached out to you via private message to request a few additional details that will help us investigate this further.

Thank you
Karnam Venkata Rajeswari 565 Reputation points Microsoft External Staff Moderator

2026-03-16T16:51:57.7266667+00:00

Hello Kiptoo Towett,

We are checking with the PG Team, will let you know the updates.
Ussumane Soare 0 Reputation points

2026-03-23T11:41:34.19+00:00

Hello,

I'm also having the same issue.

I have a GPT-4.1 Data Zone Batch deployment on Azure AI Foundry that i'm using to run a small batch job. The issue is that now I have jobs that are stuck in "Validating" and "Cancelling" since yesterday, I understand those struck in "Validating" but how can there be jobs stuck in "Cancelled"? And also I dont see any Health Service issues for this situation, I'm on Sweden Central zone, is there a problem in this zone? what are the alternate ways for me to proceed with my jobs?

Thank you
Becker, Christian 0 Reputation points

2026-03-23T12:38:32.1833333+00:00

Same for me, figured it out today... all new jobs are stuck on "validating". When will this be fixed? (west-europe and sweden-central)
Manas Mohanty 15,795 Reputation points Microsoft External Staff Moderator

2026-03-23T18:00:28.42+00:00
Hi Kiptoo Towett and all

Commenting here for latest status on the issue.

Customer is facing quota exhausted error in East US 2 region.

Considering the fact they had substantial quota,

We suggested below.

Reduce full batch to mini batches

Test in other regions as East US latency was dropped on customer side

Leverage multiple batch deployment to reduce latency and throughput.

Rest API commands were shared to cancel existing batch job if it is stalled for long. https://learn.microsoft.com/en-us/rest/api/azureopenai/batch/cancel?view=rest-azureopenai-2024-10-21&tabs=HTTP

Please let us know if the above suggestion helped resolve the issue at your side.

Thank you.
Manas Mohanty 15,795 Reputation points Microsoft External Staff Moderator

2026-03-24T17:45:33.57+00:00

Hi Kiptoo Towett and all

I have shared context of situation with accumulated pointers and observation on internal channel and shall be reaching product group via engineer ticket later today.

Thank you.

1 answer

Your answer

Karnam Venkata Rajeswari 565 Reputation points Microsoft External Staff Moderator

2026-03-13T13:20:10.88+00:00

Hello Kiptoo Towett,

Welcome to Microsoft Q&A and Thank you for reaching out.

We need more details to assist you on this case. I’ve reached out to you via private message to request a few additional details that will help us investigate this further.

Thank you
Karnam Venkata Rajeswari 565 Reputation points Microsoft External Staff Moderator

2026-03-16T16:51:57.7266667+00:00

Hello Kiptoo Towett,

We are checking with the PG Team, will let you know the updates.
Ussumane Soare 0 Reputation points

2026-03-23T11:41:34.19+00:00

Hello,

I'm also having the same issue.

I have a GPT-4.1 Data Zone Batch deployment on Azure AI Foundry that i'm using to run a small batch job. The issue is that now I have jobs that are stuck in "Validating" and "Cancelling" since yesterday, I understand those struck in "Validating" but how can there be jobs stuck in "Cancelled"? And also I dont see any Health Service issues for this situation, I'm on Sweden Central zone, is there a problem in this zone? what are the alternate ways for me to proceed with my jobs?

Thank you
Becker, Christian 0 Reputation points

2026-03-23T12:38:32.1833333+00:00

Same for me, figured it out today... all new jobs are stuck on "validating". When will this be fixed? (west-europe and sweden-central)
Manas Mohanty 15,795 Reputation points Microsoft External Staff Moderator

2026-03-23T18:00:28.42+00:00

Hi Kiptoo Towett and all

Commenting here for latest status on the issue.

Customer is facing quota exhausted error in East US 2 region.

Considering the fact they had substantial quota,

We suggested below.

Reduce full batch to mini batches

Test in other regions as East US latency was dropped on customer side

Leverage multiple batch deployment to reduce latency and throughput.

Rest API commands were shared to cancel existing batch job if it is stalled for long. https://learn.microsoft.com/en-us/rest/api/azureopenai/batch/cancel?view=rest-azureopenai-2024-10-21&tabs=HTTP

Please let us know if the above suggestion helped resolve the issue at your side.

Thank you.
Manas Mohanty 15,795 Reputation points Microsoft External Staff Moderator

2026-03-24T17:45:33.57+00:00

Hi Kiptoo Towett and all

I have shared context of situation with accumulated pointers and observation on internal channel and shall be reaching product group via engineer ticket later today.

Thank you.

Answer 1

Hi Kiptoo Towett

Thank you for confirming that issue is mitigated from backend now.

Re-attached above as work around in case of temporary regional outages.Reduce full batch to mini batches

Test in other regions (East US and Sweden Central are in demand regions)
Leverage multiple batch deployment to reduce latency and throughput.
Rest API commands were shared to cancel existing batch job if it is stalled for long. https://learn.microsoft.com/en-us/rest/api/azureopenai/batch/cancel?view=rest-azureopenai-2024-10-21&tabs=HTTP

We can setup monitor the events from log analytics workspace or set up a monitor in SDK.

if the processing time is more than usual processing time, it should execute above steps as precautionary measure.

Please take a minute to accept this answer if you found my recommendation helpful.

Thank you.

Share via

Batch jobs are stuck on "validating" and then all fail afterwards

1 answer

Your answer