Since my blog about Understanding HP Flex-10 Mappings with VMware ESX/vSphere is quite a big hit (seeing the page views per day) I decided to also write about the testing scenario’s which should all be walked through before taking a design as this into production.
In my blog I stated:
Last word of advice: while implementing a technical environment like this it’s crucial to test every possible failure, from single ESX Host to all the separate components. I’ve wrote very detailed documents about it
So let’s take a look at these testing scenario’s which can be divided into three main subjects:
- Hardware (ex. power redundancy)
- Connectivity and failover within the hardware (This is Virtual Connect in my design but could also be normal (SAN)switch configurations, this is depending on the modules that are present in the enclosure.)
- Connectivity and failover within the OS (vSphere Configuration)
As a short introduction: I’ve have been working with HP c-Class components ever since the first c7000 enclosure was placed in the Netherlands. In this time I’ve seen many HP c-Class implementations were people just rely on the fact that “everything is redundant” and thus assume that it simply works. Like Travis Dane (Under Siege 2) said: Did you see the body? Assumption is the mother of all F*CK UPS!
My statement is clear, it isn’t working until you’ve seen the behavior in failure scenario yourself.
So what hardware should we all test in the c-Class enclosure?
- Power Reduncancy, test what happens if either 3 power supplies on a single phase enclosure or 1 power line feed on a three phase enclosure fails;
- Fans, just randomly pull out some fans in the enclosure and see if this is noticed by the OA; (don’t pull out to many fans, keep it real. There’s no chance that 8 fans fail simultaneously in a 10 fan enclosure);
- Onboard Administrator Redundancy, does the second OA take over full functionality when the first one fails? (Pull it out of the sleeve to test this, there’s no power off function on this component);
- Onboard Administrator Link Loss Failover (if configured);
- Redundancy on the Interconnect Modules (I will describe this later on in detail since this also covers the Connectivity and failover behavior in hardware and OS).
- Verify that all the HP component firmwares are compatibly with eachother, see the BladeSystem Matrix
For all subjects above it’s important to verify that enclosure alerting is working via either alert mails and/or SNMP traps.
Connectivity and failover within the hardware
So now let’s look at the Virtual Connect Configuration I got in place:
In the image above you can see that only Interconnect Bay 5 and 6 have external connections. So you could conclude that these 2 Interconnect’s are the only important ones for testing a failover on hardware level. This is not true and I’ll show you why in the next picture.
The red lines indicate 10 Gb connections between all individual Interconnect Bay’s, which all tied together form the “Virtual Connect Domain”.
The horizontal lines are the X0 ports which are internally connected by the c7000 backplane. The vertical and diagonal lines are 0.5 meter CX4 cables. (Note that since IC1 and 2 and IC 5 and 6 are Flex-10 modules, they are horizontal linked with 2 links (20 Gb) as designed by HP)
So imagine a packet coming from A (Onboard Port 2) which needs to get out at point B. By powering down Interconnect Bay 1 this packet can only find its way to B via the diagonal CX4 cable. Knowing this it’s obvious that you should test a power down of every individual Interconnect Module.
Powering down a Interconnect Module can be done from the Onboard Administrator, this isn’t a graceful shutdown and thus a good test.
Testing the powering off of IC Modules is in fact a double test, first of all you are testing the Virtual Connect Domain behavior and secondly the failovers from within the ESX (I will dive into this subject later on).
Please make sure that you’ve opened up a ping -t to different IP addresses in the enclosure (ex. to VM’s and Service Console) to get a view on packet losses and re-established connections.
Word of advice: my experiences are that the failovers mostly work fine and re-enabling an IC module causes the real problems. Please write down what exactly happens, count ping losses and report them to the network team since most of the time this is caused by network misconfiguration.
To give you an example, I had scenario’s in were the failover went fine but re-enabling the IC module caused the network to send a topology change which blocked the whole network for a small period of time.
So, after testing IC1,2,3 and 4 we reach up to IC5 and 6 which differ from the rest since they have active links to the outside.
When powering down for example IC5, Virtual Connect Manager has to failover the active link to IC6.
To verify the Shared Uplink Set (SUS) failover behavior you just take a look at the pings to the components that are using that specific SUS and you can also view Virtual Connect Manager itself:
So now we’ve tested to power down and power on all the IC modules let’s take a look at the last main subject.
Connectivity and failover within the OS
These tests all start again with powering down and powering on IC modules. Let’s take a look at the exact steps
Powering down IC1 causes the downlink to Onboard NIC Port 1 to fail (since this is hardwired via the c7000 backplane) leaving Onboard NIC Port 1 with no connection.
Since Onboard NIC Port 1 is divided into FlexNics (as described here) these FlexNics will all fail as illustrated in the image below.
So the failure of IC1 causes vmnic0, vmnic2, vmnic4 and vmnic6 to fail from ESX.
From vSwitch perspective this looks like this:
As designed this shouldn’t cause anything in ESX to fail since all the vSwitches are still being served by the other NIC port. Just test this to be sure!
The example above applies to all the IC modules since all the NIC’s are configured to be used from within vSphere.
Next to the failover behavior of the vSwitches the following subjects also need to be tested:
- VMware VMotion to every host in the Cluster;
- Test High Availability (power down a host, or see this blog from Maish Saidel-Keesing which explains how to disable a specifc vmknic);
- In my design I also tested what happens if I powered down a complete enclosure (which in fact meants that 2 of the 4 ESX Hosts will get unavailable);
- If applicable test the RAID from the Physical Server hosting ESX.
So hopefully I gave you some helpful hints in this blog, happy testing! 🙂