Dieser Artikel entstand als Beitrag in den internationalen Community-Foren von Citrix und ist daher englisch verfasst. Es gibt immer wieder Fragen und Diskussionen zu den Effekten, die bei aktiviertem HA Feature in einem XenServer Pool auftreten – daher möchte ich die Hintergründe und Funktionsweisen hier erläutern, denn so wird das Verhalten nachvollziehbar und auch weitgehend vorhersagbar (unerlässlich für eine sinnvolle Planung).
The case is described (very short version) as a pool of multiple XenServer hosts that loose network connectivity due to failover of a HP Flex-10 fabric or other reasons and then sending all hosts into a reboot loop.
This behaviour is more or less by design, I think (don’t take this as an official statement, I must not speak for the vendor or their specialists!).
A XenServer pool is able to act and react without a management server, because all pool members decide on their own, what their state and the state of the pool is. They do so by checking network heartbeat AND storage heartbeat for the availability of all other hosts. If both heartbeats fail, a host will analyze in what manner they have failed: Is nobody else reachable via neither of the heartbeats? Then I am all alone and the problem is most probably located in or at my very self –> fencing strikes! And it strikes by triggering a low (hypervisor) level hard reboot.
Imagine a different (more probable) situation: a pool with 5 servers. Due to split connectivity, three servers are on one „side“ and two on the other side. They can each reach the pool members on their side via at least one heartbeat, but not the other portion. In this case (where „some“ heartbeats are still good, others are bad) the smaller group will fence and reboot. The larger group will elect a new master and restart the VMs that were running on the now fenced servers. If the two groups would be equal in member size, the current pool master will make the difference, i.e. the group containing the master is the larger group.
This is a very intelligent concept and probably the only way to make this work automatically, but it has the downside of one case: Everybody uses all heartbeats. Everybody is alone. Everybody fences. Nobody is master.
Keen thesis: This case should be avoided by a good architecture/design on the fabric part. 😉
Because of todays „converged infrastructures“, the two heartbeats may not be that different anymore, because you don’t have storage and networking on two different fabrics in all cases, yes, this weakens the otherwise very strong concept of two heartbeats (storage heartbeat is actually a „quorum“ for deciding, who is there and who is not).