Problem with migrating vms after host error hook launched

Hello Support,
Our plan is to migrate from Opennebula 5.2 to 5.4. We are in the lucky situation, that we have some spare servers to test the new version. So we set up a frontend cluster with three (the new HA mode) nodes and added two KVM nodes. We used GlusterFS, so on all frontends and kvms are shared datastores. We deployed two vm on each node for testing purposes. Live migration and cold migration works fine, when triggered from the frontend. Everything is working fine except the host error hook.
Our assumption is the following:

  1. The host goes down initiated by a IPMI poweroff
  2. Opennebula recognizes that the host is down (we can see this via ā€œonehost listā€ and in the oned.log)
  3. Since only cold migration could work on host down we assume that this should happen
  4. The host hook is launched, the fencing works and the vms seem to be reschuled (we can see this in sched.log and host_error.log). Whats confusing us, is that the oned.log shows ā€œLive-migrating VM XYā€
  5. Sometimes only one vm is ā€œmigratedā€ to the other sometimes both but allways they are stuck in state MIGRATE

oned.log:

Mon Aug 28 13:28:45 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.
Mon Aug 28 13:28:46 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:28:46 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: Connection timed out
Mon Aug 28 13:28:46 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:48 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:28:48 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:28:48 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:51 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:28:51 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:28:51 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:54 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:28:54 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:28:54 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:54 2017 [Z0][ONE][E]: Error monitoring Host one-mgmrz-02 (1): -
Mon Aug 28 13:29:05 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.
Mon Aug 28 13:29:12 2017 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore default (1)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore files (2)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore system-qcow2 (100)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore image-qcow2 (101)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore image-raw (102)
Mon Aug 28 13:29:13 2017 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Mon Aug 28 13:29:13 2017 [Z0][ImM][D]: Datastore image-raw (102) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][ImM][D]: Datastore image-qcow2 (101) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][VMM][D]: VM 76 successfully monitored: DISK_SIZE=[ID=0,SIZE=8] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=8] DISK_SIZE=[ID=1,SIZE=1]
Mon Aug 28 13:29:14 2017 [Z0][ImM][D]: Datastore system-qcow2 (100) successfully monitored.
Mon Aug 28 13:29:25 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.
Mon Aug 28 13:29:29 2017 [Z0][DiM][D]: Setting rescheduling flag on VM 76
Mon Aug 28 13:29:29 2017 [Z0][HKM][D]: Message received: EXECUTE SUCCESS 1 error:
Mon Aug 28 13:29:35 2017 [Z0][DiM][D]: Live-migrating VM 76
Mon Aug 28 13:29:37 2017 [Z0][InM][D]: Monitoring host one-mgmrz-02 (1)
Mon Aug 28 13:29:40 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:29:40 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:40 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:42 2017 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore default (1)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore files (2)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore system-qcow2 (100)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore image-qcow2 (101)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore image-raw (102)
Mon Aug 28 13:29:43 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:29:43 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:43 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:43 2017 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Mon Aug 28 13:29:43 2017 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Mon Aug 28 13:29:43 2017 [Z0][ImM][D]: Datastore image-qcow2 (101) successfully monitored.
Mon Aug 28 13:29:44 2017 [Z0][ImM][D]: Datastore image-raw (102) successfully monitored.
Mon Aug 28 13:29:44 2017 [Z0][VMM][D]: VM 76 successfully monitored: DISK_SIZE=[ID=0,SIZE=8] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=8] DISK_SIZE=[ID=1,SIZE=1]
Mon Aug 28 13:29:44 2017 [Z0][ImM][D]: Datastore system-qcow2 (100) successfully monitored.
Mon Aug 28 13:29:44 2017 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Mon Aug 28 13:29:46 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:29:46 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:46 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:49 2017 [Z0][InM][I]: Command execution fail: ā€˜if [ -x ā€œ/var/tmp/one/im/run_probesā€ ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā€™
Mon Aug 28 13:29:49 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:49 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:49 2017 [Z0][ONE][E]: Error monitoring Host one-mgmrz-02 (1): -
Mon Aug 28 13:29:50 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.

vm.log:

Mon Aug 28 13:10:40 2017 [Z0][VM][I]: New LCM state is RUNNING
Mon Aug 28 13:28:54 2017 [Z0][VM][I]: New LCM state is UNKNOWN
Mon Aug 28 13:29:14 2017 [Z0][LCM][I]: VM found again by the drivers
Mon Aug 28 13:29:14 2017 [Z0][VM][I]: New LCM state is RUNNING
Mon Aug 28 13:29:35 2017 [Z0][VM][I]: New LCM state is MIGRATE

sched.log:

Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Getting scheduled actions information. Total time: 0.01s
Mon Aug 28 13:29:34 2017 [Z0][VM][D]: Pending/rescheduling VM and capacity requirements:
ACTION VM CPU Memory PCI System DS Image DS
RESCHED 76 50 524288 0 2048 DS 101: 0
Mon Aug 28 13:29:34 2017 [Z0][HOST][D]: Discovered Hosts (enabled):
ID : 0
CLUSTER_ID : 0
MEM_USAGE : 0
CPU_USAGE : 0
MAX_MEM : 8127736
MAX_CPU : 400
FREE_DISK : 442067
RUNNING_VMS : 0
PUBLIC : 0
DSID FREE_MB
0 442067
PCI ADDRESS CLASS VENDOR DEVICE VMID
Mon Aug 28 13:29:34 2017 [Z0][VM][D]: VMs in VMGroups:
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Getting VM and Host information. Total time: 0.04s
Mon Aug 28 13:29:34 2017 [Z0][VMGRP][D]: VM Group Scheduling information
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Setting VM groups placement constraints. Total time: 0.00s
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Match Making statistics:
Number of VMs: 1
Total time: 0s
Total Cluster Match time: 0s
Total Host Match time: 0.00s
Total Host Ranking time: 0.00s
Total DS Match time: 0s
Total DS Ranking time: 0s
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Scheduling Results:
Virtual Machine: 76
PRI ID - HOSTS
0 0
PRI ID - DATASTORES
0 100
Mon Aug 28 13:29:35 2017 [Z0][SCHED][D]: Dispatching VMs to hosts:
VMID Priority Host System DS
76 0 0 100
Mon Aug 28 13:29:35 2017 [Z0][SCHED][D]: Dispatching VMs to hosts. Total time: 0.55s
Mon Aug 28 13:30:05 2017 [Z0][SCHED][D]: Getting scheduled actions information. Total time: 0.01s
Mon Aug 28 13:30:05 2017 [Z0][SCHED][D]: Getting VM and Host information. Total time: 0.00s

host_error.log:

[2017-08-28 13:28:54 +0200][HOST 1][I] Hook launched
[2017-08-28 13:28:54 +0200][HOST 1][I] hostname: one-mgmrz-02
[2017-08-28 13:28:54 +0200][HOST 1][I] Wait 1 cycles.
[2017-08-28 13:28:54 +0200][HOST 1][I] Sleeping 30 seconds.
[2017-08-28 13:29:24 +0200][HOST 1][I] Fencing enabled
[2017-08-28 13:29:29 +0200][HOST 1][I] Success: Rebooted
[2017-08-28 13:29:29 +0200][HOST 1][I] Fencing success
[2017-08-28 13:29:29 +0200][HOST 1][I] states: 3, 5, 8
[2017-08-28 13:29:29 +0200][HOST 1][I] vms: [ā€œ76ā€]
[2017-08-28 13:29:29 +0200][HOST 1][I] resched 76
[2017-08-28 13:29:29 +0200][HOST 1][I] Hook finished

Some Questions:

  1. Why is the oned.log showing this ā€œMon Aug 28 13:29:14 2017 [Z0][VMM][D]: VM 76 successfully monitored: DISK_SIZE=[ID=0,SIZE=8] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=8] DISK_SIZE=[ID=1,SIZE=1]ā€, but the host is down where the vm was running? The vm.log show at the same time ā€œVM found again by the driversā€. This could not be possible
  2. Why is the oned.log showing ā€œMon Aug 28 13:29:29 2017 [Z0][DiM][D]: Setting rescheduling flag on VM 76
    Mon Aug 28 13:29:29 2017 [Z0][HKM][D]: Message received: EXECUTE SUCCESS 1 error:
    Mon Aug 28 13:29:35 2017 [Z0][DiM][D]: Live-migrating VM 76ā€? Shouldnt this be a cold migration? What means ā€œEXECUTIVE SUCCESS 1 errorā€?
  3. Why is the VM stuck in state MIGRATE?
  4. How can we fix this behaviour?

THX 4 UR HELP

  1. It seems that there is a bug that causes the VM to change to running state when disk monitoring arrives. Iā€™ve opened an issue:

https://dev.opennebula.org/issues/5331

  1. EXECUTE SUCCESS 1 error: means that the hook was successfully executed and wrote no error message

  2. The VM changed from UNKNOWN to RUNNING before the scheduler picked it for rescheduling. That made it think itā€™s really running and tried to do a live migration. On UNKNOWN state the VM is not migrated, itā€™s redeployed. This was caused by the bug youā€™ve found.

  3. We aim to fix the bug in version 5.4.1. Meanwhile, as a workaround, you can disable disk monitoring. To do this you can comment the following lines from /var/lib/one/remotes/tm/qcow2/monitor and issue the command onehost sync --force in the frontend:

1 Like

A BIG THX for the help. Now everything works as ā€œexpectedā€