Recently Dennis Agterberg, a fellow contractor, experienced some major issues with random LUN loss on all his ESXi Hosts. The environment consists of HP Virtual Connect FC modules connected to 8Gb Brocade switches.
From a vCenter perspective repetitive “lost access to volume … due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly” messages appear while the vmkernel log is showing messages like:
2012-05-02T01:34:02.805Z cpu3:4099)<3>lpfc820 0000:02:00.2: 0:(0):0717 FCP command x12 residual underrun converted to error Data: xff xff x24
2012-05-02T01:34:02.805Z cpu3:4099)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x12 (0x412400d086c0) to dev “naa.60060160d8901f00e859cf520978e111” on path “vmhba0:C0:T7:L3” Failed: H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.Act:EVAL
2012-05-02T01:34:02.805Z cpu3:4099)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device “naa.60060160d8901f00e859cf520978e111” state in doubt; requested fast path state update…
In short Dennis pointed out the problem to be an incorrect setting of the portCfgFillWord mode on the Brocade FC Switches.
So what’s this Fill-Word stuff? A Fill-Word is a primitive signal which is needed to maintain bit and word synchronization between two adjacent ports and is more error prone with the increased clockspeed on 8Gb links. Read the complete article about Fill-Words here.
The HP Virtual Connect Release Notes indicate that the portCfgFillWord mode needs to be set to 3 whenever you are connecting to a Brocade 8Gb FC Switch:
When VC 8Gb 20-port FC Module and VC FlexFabric 10Gb/24-port Module Fibre Channel uplink ports are configured to operate at 8Gb speed and connect to HP B-series (Brocade) Fibre Channel SAN switches, the minimum supported version of the Brocade Fabric OS (FOS) is v6.4.x. In addition, the Fill Word on those switch ports must be configured with option Mode 3 to prevent connectivity issues at 8Gb speed.
On HP B-series (Brocade) FC switches, use the portCfgFillWord (portCfgFillWord <Port#><Mode>) command to configure this setting.
Mode Link Init/Fill Word
Mode 0 IDLE/IDLE
Mode 1 ARBF/ARBF
Mode 2 IDLE/ARBF
Mode 3 If ARBF/ARBF fails, use IDLE/ARBF
Modes 2 and 3 are compliant with FC-FS-3 specifications (standards specify the IDLE/ARBF behavior of Mode 2, which is used by Mode 3 if ARBF/ARBF fails after 3 attempts). For most environments, Brocade recommends using Mode 3, as it provides more flexibility and compatibility with a wide range of devices. In the event that the default setting or Mode 3 does not work with a particular device, contact your switch vendor for further assistance.
Basically the key takeaway here is to mind the portCfgFillWord mode whenever setting up or migrating towards an 8 Gb FC Brocade infrastructure.
Stacy
/ July 22, 2012Question about the symptoms – When the intermittent LUN loss occurred, did the LUNs stay disconnected, or were access to volume restored messages seen in vCenter shortly after?
Dennis Agterberg
/ July 23, 2012Hi Stacy,
Access to the volumes was restored shortly after.
In vCenter I saw these messages:
Lost access to volume 4f969274-8bb27a5a-cfbe-0017a4770410 (volume name) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
And then these:
Successfully restored access to volume 4f969274-8bb27a5a-cfbe-0017a4770410 (volume name) following connectivity issues.
Regards,
Dennis
Joseph Mancera
/ August 20, 2012After modifying the portCfgFillword, does it resolved the issue?
Dennis Agterberg
/ August 21, 2012Ji Joseph,
Yes, the issues we had were resolved.
Regards,
Dennis
Carl Wallace
/ August 24, 2012I assume this change was made on the ports to which the HP equipment connected?
Non-8Gb gear uses idle/idle (mode 1) and 8Gb uses modes 2 or 3, but really… mode 3 just scans modes 1 and 2, in the same way a NIC uses ‘AUTO’ rather than hard-coding the speed/duplex.
Chris
/ August 28, 2012We just ran into this issues as well. The difference in our environment was that the VNX storage ports their fill word set incorrectly. So, just a reminder. Check both ends ie. the host and storage port configs when LUNs start disappearing or showing up with crazy capacity numbers.
Dave
/ September 14, 2012If I set all ports to fillword 3, will 4gb/sec connections have issues with this type? or can they handle fillword 3?
Dennis Agterberg
/ October 5, 2012@Carl, the changes were indeed only made to the ports were the HP FlexFabric was connected to.
@Dave, I would check with the vendor of the products first
Le Ton Phat
/ October 28, 2012Hi Dennis,
I had some problems about lost device storage same as you. But my environment consists of Brocade FC modules (Brocade M5424) connected to 8Gb Brocade switches (brocade 300). And Could u answer some questions about configurations?
1. What impact if i configure fillword ? need reboot or drop traffic for short time?
2. What ports should i configure?
– Switch module: downlink to blade servers or uplink to brocade 300
– Brocade 300: port where switch modules connected.
Thks & Best regards.
Dennis Agterberg
/ October 30, 2012Hi Le Ton Phat,
I would advise you to ask your vendor, contact Brocade. The setting depends on the devices connected, FOS version, if it is an ISL etc.
There is a Brocade article (I only found a link on an HP site, I myself did receive the document from Brocade when contacting them) http://h30499.www3.hp.com/hpeb/attachments/hpeb/bladescategory04/859/1/FOS%208G%20Link%20Init%20Fillword%20Behavior%20v1.pdf that explains a bunch of things.
In regards to downtime. We changed settings out of office hours but did not shutdown the environment.
FYI, I also have a Dell Blade enclosure with 2 M8428-k converged switches in it connected to the Brocade switches. We ended up configuring the Brocade ports connected to the M8428 on mode 3 as well.
Gl & Best regards,
Dennis
Le Ton Phat
/ October 30, 2012Hi Dennis,
Thks u for your reply. I changed settings but problems still occur. I think i missed another settings due to I’m a newbie for brocade switch.
I will explain issue details & i hope u could analyse it.
My storage is netapp connect to brocade 300 (same as Core) and then from brocade 300 connect to another chassis with module brocade M5424.
On my brocade 300 I only configured zoning & fillword (mode 3), another settings default
On FC switch module i configured Access Gateway Mode all settings default.
My issue:
Whenever one of the hosts on any chassis reboot và boot driver FC HBA (installed ESXi 5). another hosts connected same as the rebooted host will be relogin on SAN. Lost path redundancy on vcenter will be occured. I’m watting support from vmware.
Please tell me if u have any KB for this issue.
Note: i didn’t config willword on FC switch module
Thks & best regards.
Le Ton Phat
/ November 10, 2012Hi Dennis,
I solved my problems. Rootcause’s zoning size. My zone include 16 host + 2 storage. it will be impact if I change any objects on the zone. Best Practise’s 1 zone = 1 host + 2 storage (same as 1 initiator + 2 target)
Dennis Agterberg
/ November 16, 2012Hi Le Ton Phat,
Thanks for replying the root cause back. I always use single initiator zoning with 1 initiator and 1 (or more) SP ports.
The issues we had on our Dell enclosure were not gone after the reconfigure of the FillWord mode. Turned out it was a faulty SFP module.
Glad your issues are solved.
Andrew
/ December 10, 2012Is anyone else experiencing these errors after upgrading to ESXi 5.1?
portCfgFillWord was set to 3 on our Brocade switches a few months ago (thanks to this very article!) and the problem (and performance nightmares) disappeared. Since upgrading one of our clusters from 5.0 to 5.1 the errors have returned.
Thoughts?
Bruce
/ January 4, 2013we have..
– ESXi 5 hosts
– Hitachi AMS 2500 storage
– Dell M710HD blades
– Brocade/Dell M5424 8Gbps FC switch
– Brocade 300 switch
The LUNs were disappearing randomly and powering on VMs were not even possible.
Turns out the fillword setting is default 2 on Brocade 300 and the 300 was new addition to the system. It made AMS2500 freakout with fillword 2. In the end, I changed fillword to 0 for Hitachi and 3 for everything else. The problem now is resolved.
Dominic DouzeTrees
/ February 27, 2015This fixed our issue on 8GB connectivity to a NetApp filer, via an EMC brocade which was affecting VMWare datastores.
In addition, this also fixed the terrible disk write latency which we were observing which was causing latency sensitive servers massive performance issues.
Coincidently, this didn’t seem to affect the VNX connected in the same way although the port did show errors which cleared post implementation of the new value.
EMC, NetApp and VMWare we all unable to diagnose this despite having full supportsaves etc.
Thank you for sharing!
Jim
/ April 4, 2015In my case ,I checked everything, but still can not find where is the root cause, I check SAN switch every interface, it looks no CRC error message, not sure this function do not open or it’s no error, but it can find some package dropped in some interfaces.
tungvs
/ September 13, 2016Hi Kenneth,
Currently I have the same problem as you have described. We have 9 chassis (HP BL460 G9) with Brocade 300 FC modules embedded, which connect to Brocade 6510 SANSW. In the other end, we have 3 Dell Sc8000 storages connect to the Brocade 6510.
We have encountered randomly lost connectivity from blade servers to the storage LUNs. The problem has taken place in all the HP chassis, at the same time. I wonder if your problem is the same.
How many servers do you have encountered this problem ? Did they disconnect at the same time ? Or all your servers connect to the storage via the same Brocade 8Gb switch ?
Thanks.