Share via


iSCSIprt errors with persistent disconnections

Question

Thursday, July 4, 2013 8:39 PM

What we need:

==========

Determine why we have Persistent iSCSIprt errors in Hyper-V cluster Windows Server 2012

  

**Environment:
**==========

4x identical hosts Dell R720 running Windows server 2012 standard

4x - iSCSI NICs: Intel(R) Gigabit 4P I350-t Adapter - Driver 12.1.76.0 - FW Family 13.0.0

MTU set to 9000

Chimey disabled, auto Tuning disabled, congestion provider set to None

All kinds of OffLoad features were disabled in the iSCSI NICs properties

No teaming configuration is in place

RSS/TOE/TOEv2, Virtual Machine Queue (VMQ) and energy saving options are disabled

2x PCT6224 - FW 3.3.5.5

2 stack cables interconnecting them

STP port-fast enabled on all ethernet ports

MUT set to 9216 on all ethernet ports

Flow control is active in all ports

Speed in 1000 full duplex being auto negotiated

Multi, broad and unicast storm disable

None errors logged in any ethernet or stack ports

3x EQL Storages

2x EQPS6100X + 1x PS4100E all running 6.0.4

Current situation:

===========

Customer is experiencing timeout and disconnections:

* All four hosts are logging the following System Events every 1 minute:

ID 9 / Source iScsiPrt / Target did not respond in time for a SCSI request. The CDB is given in the dump data.

ID 39 / Source iScsiPrt / Initiator sent a task management command to reset the target. The target name is given in the dump data.

ID 129 / Source iScsiPrt / No Description

This events shows up only in Windows Server 2012 environment, the issue was isolated and do not occur with Windows Server 2008  in the same SAN.

Troubleshooting steps so far:

===================

  • Disable windows Firewall and make sure the non-san subnets are excluded in the hit kit *
  • Make sure the latest drivers from Dell D&D are installed for the iSCSI NICs
  • Jumbo Fram size setup correctly.
  • Disabled TOE, RSS and Large Send Offload, FlowControl setup
  • netsh interface tcp set global autotuninglevel=disable netsh int tcp set global chimney=disabled netsh interface tcp set global rss=disabled
  • Followed some practices from here: http://en.community.dell.com/support-forums/storage/f/3775/p/19480319/20326067.aspx
  • RSS/TOE/TOEv2, Virtual Machine Queue (VMQ) and energy saving options are disabled
  • All nodes are updated with the latest rollup update.
  • Teaming not configured.
  • Changed binding order on NICS
  • Disabled NetBIOS on iSCSI • Made sure that each NIC is NOT set to register its connection in DNS • Remove File and Printer sharing and Client from Microsoft networks
  • Following updates were applied:

**http://support.microsoft.com/kb/2791465/en-US ( kb 2779768 )
http://support.microsoft.com/kb/2795944/en-US
http://support.microsoft.com/kb/2822241/en-US
http://support.microsoft.com/kb/2808584/en-US
http://support.microsoft.com/?id=2838669
http://support.microsoft.com/?id=2813630 suprimido no KB2838669
http://support.microsoft.com/?id=2796000 
http://support.microsoft.com/?id=2795997 
http://support.microsoft.com/?id=2795993 
http://support.microsoft.com/kb/2838669 **

  • Disabled TCP Delay ACK in Server
  • Switches: STP port-fast enabled in all ethernet ports + MUT set to 9216 in all ethernet ports + Flow control is active in all ports + Unicast storm disabled
  • Captured iScsi traffic with wareshark

Anyone can help us?

VP

All replies (4)

Wednesday, June 18, 2014 7:20 PM ✅Answered

yes, we fixed it. We disabled the AV in both nodes. Try that one.


Wednesday, June 18, 2014 6:47 PM

Did you ever fix this, we have exactly the same issue. Our situation may be a little different.

Two sites (with different subnets) combined in one Hyper-V 2012 R2 Failover Cluster.
On each site 3 x R720 and 1 x PS6100X.
All 6 servers are connected to both EQL group IPs.

We don't get these 3 events when we create an individual Hyper-V Failover Cluster for each site.


Wednesday, February 18, 2015 5:54 PM

What's the AV??


Thursday, March 5, 2015 6:21 PM

Hello,

I have three Hyper-V 2012 (not R2) nodes and I'm experiencing similar issues.

I get iscsiprt error suddenly after a storage migration of a big VM ends: I launch the storage migration, it starts correctly and, more or less 60 seconds after it ends, iscsiprt errors appears in the event log of the node. I lose iSCSI connectivity with the iSCSI target on the storage were the VM was on before migration. The storage migrated VM works because is already moved to the "new" storage but obviously all the CSV on the "old" storage become inaccessible, I can see them offline or online (no access) so all the VMs that are on the old storage goes offline.

I noticed that it didn't happen eveytime I storage migrate a VM, but only sometimes and with big VMs.

I tried both with MPIO and with MCS since our storage supports both of them. I tuned some iSCSI parameters on the initiator. I leave a list of parameters I tried to tune at the end of this message.

The only thing I didn't do already is: disable RSS, disable TCP autotuninglevel and disable delayed ACK.

Someone can help me to understand what is causing these problems? What is AV you are referring to?

This is the parameters I tuned in order to increase timeouts but right now I didn't yet solved the problem:

HKLM\SYSTEM\CurrentControlSet\Services\mpio\Parameters
UseCustomPathRecoveryInterval 0 -> 1
PDORemovePeriod 20 -> 120
PathRecoveryInterval 40 -> 40

HKLM\SYSTEM\CurrentControlSet\Control\Class\4D36E97B-E325-11CE-BFC1-08002BE10318}\Instance Number>\Parameters
EnableNOPOut 0 -> 1
MaxRequestHoldTime 60 -> 90
LinkDownTime 15 -> 35

Thanks in Advance,

Davide