I’ve set up an OpenNebula 5.10.1 environment consisting of one controller and two satellites. The datastore is a linstor/drbd9. Live migration works fine, I’m quite happy with it.
To reach the next level I’d like to setup a fault tolerance which migrates machines from a broken satellite to a running one.
As it’s only a test environment, I don’t use fencing at all.
So that’s my hook config that I’ve added to OpenNebula using ‘onehook create’:
ARGUMENTS = "$TEMPLATE -m -p 2 -u" COMMAND = "ft/host_error.rb" NAME = "host_error" STATE = "ERROR" REMOTE = "no" RESOURCE = HOST TYPE = state
When shutting down one of the satellite nodes the test-vm gets migrated to a running node after a while. Yay.
Still I’ve got three open questions regarding the fault-tolerance subject:
(1) Is there an easy/builtin way to tag a VM to enable fault tolerance? I want only specific VMs to respawn on the other host in case of an error.
(2) That’s kinda critical for me: How to handle a connection error between the controller and the satellites? In my (later) setup the controller node is not in the same datacenter as the satellites. In worst case it might happen that the controller loses its connection to both of the satellites simultanously.
That’s my test so far:
$ onehost list ID NAME CLUSTER TVM ALLOCATED_CPU ALLOCATED_MEM STAT 1 satelliteB default 0 0 / 3200 (0%) 0K / 188.7G (0%) on 0 satelliteA default 1 200 / 3200 (6%) 1.5G / 188.7G (0%) on $ onevm list ID USER GROUP NAME STAT UCPU UMEM HOST TIME 34 oneadmin oneadmin testvm runn 2.0 240.6M satelliteA 0d 03h43 $ ip route add blackhole $satelliteA_IP ; ip route add blackhole $satelliteB_IP $ sleep 300 $ onehook log --hook-id 0 […] 0 1 02/13 13:55 0 SUCCESS 0 2 02/13 13:57 0 SUCCESS
The logs show that OpenNebula tries to migrate the VM from satelliteA to satelliteB at 13:55. Don’t know why it returns a “SUCCESS” as no satellite can be reached and the VM is still running on satelliteA. Two minutes later it tries to migrate the VM back from satelliteB to satelliteA which, obviously, has no other effect other than a “SUCCESS” message. That’s kinda weird.
(3) How to prevent the fault tolerance if the satellite is (for whatever reason) unreachable but the VM that runs on that satellite is still seen as “running” by OpenNebula? As an example I could imagine the management interface of that satellite to be down but the VM uses another NIC that’s still up.