I recently had a design discussion about the best way to configure the VMware Fault Tolerance Logging network. During the discussion we quickly established that you want to assure redundancy for the VMware FT Logging network. The most interesting part of the discussion was how to configure the vSwitch dedicated for FT traffic.
Before discussing the options you have in teaming/load balancing lets focus a bit more on the VMware FT Interface. VMware FT keeps the Primary and Secondary VMs in lockstep using VMware vLockstep technology. vLockstep technology ensures that the Primary and Secondary VMs execute the same x86 instructions in an identical sequence. The Primary VM captures all nondeterministic events and sends them across the VMware FT logging network to the Secondary VM. The Secondary VM receives and then replays those nondeterministic events in the same sequence as the Primary VM, typically with a very small lag time.
The FT interface is configured per ESX(i) Host as a vmkernel port and needs an unique IP address assigned. This unique IP address is used to send out the logging traffic as shown in the figure below.
Looking at a typical design with two vmnics configured for FT you can set the failover configuration to either active/active or active/passive as shown in the overview below.
or
The active/passive connection is pretty straight forward, in case the active adapter fails the standby adapters takes over seamlessly.
The active/active connection however could cause some confusion since we have a few options in there which I like to explain a bit more.
Active/Active with default “Route based on the originating virtual port ID” load balancing
Since the FT Interface has only got one MAC address it will distribute the traffic to only one vmnic in this configuration. The vmnic that will be used is the first vmnic that listed in the “Active Adapters” as shown in the overview below. vmnic1 will utilize all FT logging traffic and vmnic5 will only be used whenever vmnic1 fails.
Note that you can change the order of active vmnic dynamically by just moving another vmnic to the first position in the list.
Active/Active with “Route Based on source MAC hash”
Equally to the “Route based on the originating virtual port ID”
Active/Active with “Route based on IP Hash”
IP Hash load balancing is another policy that you can use on the FT logging network. This policy uses both source and destination IP addresses of the ESX(i) Hosts that are running FT-protected VM’s. Etherchannel /Link Aggregation is mandatory on the physical network to be able to correctly use the IP Hash algorithm.
In my opinion there is a risk in using the “Route based on IP Hash” load balancing policy. Generally you only enable FT on important VMs that need to meet very high availability SLAs. Using the IP Hash load balancing method will utilize both vmnics, spreading out the load, but keep in mind what happens whenever one vmnic will fail. You don’t want to get in a situation with congestion on the FT network which could impact Fault Tolerance since this is a key element in your high availability.
The key take-away here is to hand out the design decisions and let you make the best selection for your specific scenario.
Alastair Cooke
/ December 14, 2010I like avoiding IP hash based load balancing, I’ve seen too many failures due to misconfiguration.
Particularly as each combinations of source & destination IP address still only uses one physical NIC. Consequently all FT traffic for a single protected VM will only use one physical NIC.
Duncan
/ December 14, 2010I don’t get the article. isn’t the impact of ANY “load balancing” mechanism that there is a risk that you will end up running on a single link.
In the case of IP-HASH a single link will be used for a single VM as the result of the Hash algorithm will always be the same(source-destination!). Only when you are running multiple VMs in FT a situation might occur where you will be balancing the load between nics.
In the case of Virtual Port ID, only a single Port ID is used and no balancing will take no matter what.
In this case it seems to me that with IP-Hash chances of running into congestion issues are reduced in times of normal operation while when using Virtual Port ID the risk is always there.
In this case I personally feel there is a bigger risk when using VPID then when using IP-Hash.
Erik Zandboer
/ December 14, 2010Load balancing based on IP hash is not going to help at all in an FT network. Like Alastair posted above, you have a single source and a single destination IP. So the hashing algorithm will always come up with the same uplink (when both are available that is). IP hashing also includes a lot of unnecessary overhead. Each packet (L2!) has to be inspected for their L3 payload, hashing has to take place, only to have the one same result. I think (although I never actually measured) latency will be introduced when using IP hashing because of this. And latency is what you do not want, specially in an FT network.
Erik Zandboer
/ December 14, 2010Duncan you beat me to it 🙂
Load balancing could work when you have more than two hosts and more than one FT-protected VM, but I suspect the load balancing is not worth the extra overhead (and latency) when using the IP-hashing algorithm.
Kenneth van Ditmarsch
/ December 14, 2010I’m not stating that it is going to help with anything; I’m only outlining the available options here.
If multiple VMs and ESX(i) Hosts are used you got different Fault Tolerance IP Addresses which can result in usage of different uplinks.
Kenneth van Ditmarsch
/ December 14, 2010Correct, there’s always a risk in ending up with one link.
I’ve discussed the congestion issues with a network admin and he stated that it’s easier to monitor one physical switch port for congestion (Virtual Port ID, using one vmnic) than to monitor two physical switch ports (IP Hash, using multiple vmnics) and check if “together” they won’t exceed the 100% bandwidth that only one vmnic can deliver.
Duncan
/ December 14, 2010Correct, IP-Hash could help balancing the load depending on the amount of VMs, ESX(i) hosts and outcome of the hash algorithm… I do still think the articles lacks a solid closing statement.