We are running a newly designed vSphere 4.0 environment connected to a very big LeftHand iSCSI environment. Lately we discovered some major problems with a couple of VM’s totally freezing for about 30 seconds, this problem seemed to only occur on several VM’s from one specific host, so time to do some research on this host.
The first fast conclusion I could make was that the vmkernel was flooded (multiple entries per second) with error messages coming from the Path Selection Policy (PSP).
Dec 2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec 2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device “naa.6000eb36b7210cc2000000000000017a” due to Not found
Dec 2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.6000eb36b7210cc2000000000000017a”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Dec 2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.6000eb36b7210cc2000000000000017a” is blocked. Not starting I/O from device.
Dec 2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu0:4285)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec 2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6000eb36b7210cc2000000000000017a” – issuing command 0x4100010f2e40
Dec 2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec 2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6000eb36b7210cc2000000000000017a” – failed to issue command due to Not found (APD), try again…
Dec 2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device “naa.6000eb36b7210cc2000000000000017a”: awaiting fast path state update…
Further conclusions at that moment learned that a volume was deleted from the LeftHand SAN and EXS obviously didn’t handle this well causing ALL VM’s on the troubled host to freeze completely. To the user it only appears like the server is losing its network connection but in fact it’s a real freeze that varies from 15 to 30 seconds (in our environment). So to get a grip on the situation I frozen (to stay in terms 😉 all the LUN removals since I first wanted to reproduce this in our life-like test environment.
While troubleshooting this morning Arne Fokkema pointed me to an article Chad Sakac published yesterday which contains some interesting information about this topic, read it here.
For my test results I was using these steps on and on again:
- Healthy ESX Host with volume connected;
- From LeftHand Console: Unassign the iSCSI volume from the ESX Host (The iSCSI session towards the volume stays connected even after a rescan since the volume still exists and the iSCSI session isn’t terminated
- From LeftHand Console: Delete the volume
- The deletion of the volume is instantly detected by ESX and multiple entries are written to the vmkernel (*)
- Next thing is that every 5 minutes the Path Selection Policy (PSP) is mentioning that it can’t select a path for device (log: WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.6000eb36b7210cc2000000000000016e”.)
(*) Now the tricky part: when deleting a volume from the LeftHand console this is, as stated, recorded in the vmkernel log. Sometimes the log stays steady after the first entries (when the volume is deleted) and then randomly fills up with several entries about paths that cannot be found and other error messages and sometimes the log instantly floods and just keeps on flooding for hours, or even days!
My test results showed that the VMs freeze only in the last case: ‘whenever the vmkernel logs keeps on flooding’. At this moment it’s unclear to me why this is random behavior but looking back at our production problems versus all the volumes that we’ve deleted in the past and my experiences of today I cannot do anything else than really conclude that the “flooding vmkernel log” is at random. Needless to say is that the actual freezing of VMs (when the vmkernel log is flooded) is definitely at random since sometimes it occurs once an hour and sometimes multiple times an hour.
Last but not least, rescanning the vmhba solved my problems with the log flooding, however, rescanning the vmhba doesn’t always work anymore while the system is flooded!
So better be safe than sorry and follow the steps in the correct order when deleting a volume from the SAN (as apparently this is a multi-vendor and this ESX problem), just like Chad stated in his article.
First delete it from ESX, than from the SAN and finish with rescanning the vmhba. (another workaround is available for vSphere 4.0 Update 1)
Clearly I would love to hear other users’ experiences on this specific topic since I’m not quite happy with the “at random” experiences.
Update: This patch resolves the “Freeze”-problem (verified on our LeftHand environment)
Stephen Vogel
/ December 11, 2009I’m having the same issues but its not related to a deleted volume or LUN. In my case, I have two ESX 4 hosts attached to a LeftHand SAN (single node). There are two volumes/datastores that are shared by each ESX host. VMs on ESX1 are fine but any VMs on ESX2 suffer from poor performance (30-40 second freezes) and tons of Disk and symmpi errors in the System Event Viewer.
I have tried rescanning the storage network but the problem has not gotten any better. Do you see and Disk errors in you VM Windows Event logs?? I have had these problem on and off but they have always only affected 1 of my 2 ESX cluster servers.
Kenneth van Ditmarsch
/ December 12, 2009Hi Stephen,
No I didn’t see errors in the windows eventlog since my drops were below 60 seconds (and windows disk timeout is defaulted to 60 seconds).
What does the vmkernel of the troubled ESX host tell you?, to me this sounds like a mis configuration on ESX level.
Have you’ve tried to move one of the troubled VM’s to the good ESX host to isolate the problem?
Cheers,
Kenneth
Brian
/ January 21, 2010issue with workaround described in:
Virtual machines might stop responding when any LUN on the ESX/ESXi host is in an all-paths-down condition – http://kb.vmware.com/kb/1016626
Unpresenting a LUN containing a datastore from ESX 4.x and ESXi 4.x – http://kb.vmware.com/kb/1015084
Tony
/ February 3, 2010We experienced the same issue and since we have over 30 hosts it ended up bringing down EVA8001. turns out this is a known reported issue with HP type sans and vsphere. When you unpresent a lun the host go into a panic state and bombard the san to the point that it can no longer rspond. The easiest work around is to turn the machine off when you unpresent a lun or you can mask the individual luns which is a pain if you have a lot of hosts. Vmware released a patch on the 5th janruary
KB article is 1016291 released with reference ESX 4.0 Patch 03. This KB states that the fix is delivered in the patch ESX400-200912401-BG. PR 467188 found in the “PRs Fixed” section is the PR for APD (All path down) situation.
KB Link: http://kb.vmware.com/kb/1016291
Mike Fiedler
/ October 14, 2010Thanks for this article.
It was very helpful in determining root cause of our outages and the appropriate patch.
It applies not only to LeftHand SAN setups – also to FC connected arrays.