Serious disk errors on virtual machines

2014-07-14

Question

_{Monday, July 14, 2014 11:39 AM}

Hi,

We have experienced a lot of disk errors on a few of our Azure virtual machines. This weekend it happened again. In the event log we find that the first event is written by ESENT:

"svchost (1300) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 1306624 (0x000000000013f000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (165 seconds) to be serviced by the OS. In addition, 0 other I/O requests to this file have also taken an abnormally long time to be serviced since the last message regarding this problem was posted 466 seconds ago. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem."

Then we get a lot of disk warnings stating that "The IO operation at logical block address..." was retried. This goes on until I logged on this morning to fix a stopped SQL Server (the server runs SQL Express). the event log shows the SQL server stopped quite soon after the error occured, SQLServerLogMgr::LogWriter complaining about disk IO errors.

Anyone else that have experienced the same thing?

The first half hour of event log messages:

Error 12.07.2014 06:28:56 MSSQLSERVER 9001 Server
Error 12.07.2014 06:28:56 MSSQLSERVER 9001 Server
Error 12.07.2014 06:28:56 MSSQLSERVER 17053 Server
Warning 12.07.2014 06:28:54 Ntfs (Microsoft-Windows-Ntfs) 153

Warning 12.07.2014 06:28:42 disk 153 None

[lots of similare events deleted to save som screen real estate]
Warning 12.07.2014 05:50:32 ESENT 533 General
Warning 12.07.2014 05:48:00 ESENT 508 Performance

All replies (11)

_{Tuesday, July 22, 2014 8:53 PM ✅Answered}

You can try to run SQLIO during the issue, and compare the IOPS with what is generally achieved with Azure Disks. That usually depends on the cache settings, so it's better to use Data Disk.

For more info, please refer to the Performance Guidance on http://msdn.microsoft.com/en-us/library/dn248436.aspx

_{Monday, July 14, 2014 7:41 PM}

Hi,

Have you tried rebooting the VM?

You can also resize the VM and check to see if it resolves the issue. If it does, then you can resize it back to the original size.

If the issue persists, I recommend that you open a Technical Support Ticket as it seems to be a performance issue and will require dedicated troubleshooting.

Best Regards,

Mekh.

_{Thursday, July 17, 2014 10:54 AM}

Hi,

Yes, rebooting returns the server to normal for a while.

I will try to resize the machine.

Thank you,

Hugo

_{Thursday, July 17, 2014 11:46 AM}

Disk errors on a SQL VM may indicate Storage throttling. What size is that VM?

Please remember that Disks are limited in terms of IOPS - see http://msdn.microsoft.com/library/azure/dn197896.aspx

Also please note that there was a minor performance issue for VM and Storage in some datacenters recently (West Europe if my memory serves me well). see on Azure Dashboard for more info http://azure.microsoft.com/en-us/status/#history

_{Monday, July 21, 2014 12:16 PM}

Hi,

Any updates on the issue? Did the resizing help ?

Please feel free to let us know if you need further assistance.

Regards,

Mekh.

_{Tuesday, July 22, 2014 10:07 AM}

Hi,

No, unfortunately not. I resized the VM and the change completed successfully. However, the server never became available on the network again. I could not connect to it using RDP, nor could I ping it from other servers on the same VNET. I tried forced reboots and size changes, but it never answered a ping again.

As a last resort I deleted the VM, keeping the attached disk. I tried to create a new VM using the same disk, but the process got stuck on "installing VM agent".

At this point I just gave up and created a new VM from scratch, attached the disk as an additional disk and salvaged any data I needed.

As a side note: I was using Azure Backup as my only backup solution on this server. Turns out you cannot recreate a server using Azure Backup as "restore to previous location" does not work when you reregister a server.

Regards,

Hugo

_{Tuesday, July 22, 2014 10:49 AM}

It was a Standard A2 virtual machine.

I am aware of the IOPS limitation, but with write delays of as much as 103 seconds I found it unlikely that the IOPS limit was my problem. I may well be wrong. I will have to figure out a way to monitor IOPS, I guess. Any tips? The built in monitoring in the Azure portal lets me monitor bytes, but not IOPS.

My disk problems seems to have appeared several days before the published performance issues.

Regards,

Hugo

_{Monday, July 28, 2014 2:23 AM}

Hi Hugo,
How is everything going now? I would appreciate it if you can feedback to us.
In addition, for deploy SQL server in Azure, it is recommended to use a virtual machine size of A3 or higher, especially for SQL Server Enterprise Edition.
Best regards,
Susie

_{Wednesday, July 30, 2014 6:48 AM}

Hi,

As posted earlier, I had to replace the faulty VM. My replacement VM does not show any of the symptoms of the faulty VM, despite the same load and size.

The database is SQL Express 2012 R2. The server serves as the HA database for a single RDS Broker in HA mode (necessary due to certificate and farm naming issues). There are less than 20 users total for this particular farm, so it seems very strange that IO should be a problem. But I will try to use SQLIO if I encounter the issue again.

Thank you all for trying to help me.

Best regards,

Hugo

Regards, Hugo

_{Monday, September 1, 2014 1:12 PM}

I have got this very same problem with 2 VMs on a different day. The problem persists until we shut down and start the VM again, as they fail to restart. Resizing VM has a procedure similar to Shutting down and start the VM - is there any update or fix available for this problem?

Thanks.

_{Tuesday, September 2, 2014 8:35 AM}

Yes, shutdown will deallocate the resources, so when you start the VM again, it should generally start on a different node (same as resize). Restart will generally keep the VM on the same node.

The Fix is being deployed to all datacenters - no ETA but it should take 2 or 3 more weeks.

Share via

Serious disk errors on virtual machines

Question

All replies (11)

Additional resources