Host error hook not working (5.4.13)

The host error hook is not working complaining about fencing when hypervisor goes down.

Versions of the related components and OS (frontend, hypervisors, VMs):
opennebula 5.4.13
centos7

Steps to reproduce:

  • uncomment host kook in oned.conf on all HA frontend nodes
    HOST_HOOK = [
    NAME = “error”,
    ON = “ERROR”,
    COMMAND = “ft/host_error.rb”,
    ARGUMENTS = “$ID -m -p 0”,
    REMOTE = “no” ]
  • restart oned on all HA frontend nodes
  • shutdown hypervisor with VM on it

Current results:
oned.log:
Wed Jul 11 18:46:07 2018 [Z0][InM][I]: Command execution fail: ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 2 ord-virt-004; else exit 42; fi’
Wed Jul 11 18:46:07 2018 [Z0][InM][I]: ssh: connect to host ord-virt-004 port 22: Invalid argument
Wed Jul 11 18:46:07 2018 [Z0][InM][I]: ExitCode: 255
Wed Jul 11 18:46:09 2018 [Z0][ReM][D]: Req:3520 UID:0 one.zone.raftstatus invoked
Wed Jul 11 18:46:09 2018 [Z0][ReM][D]: Req:3520 UID:0 one.zone.raftstatus result SUCCESS, “<SERVER_ID>0</…”
Wed Jul 11 18:46:09 2018 [Z0][ReM][D]: Req:9296 UID:0 one.vmpool.info invoked , -2, -1, -1, -1
Wed Jul 11 18:46:09 2018 [Z0][ReM][D]: Req:9296 UID:0 one.vmpool.info result SUCCESS, “<VM_POOL>13<…”
Wed Jul 11 18:46:09 2018 [Z0][ReM][D]: Req:9968 UID:0 one.vmpool.info invoked , -2, -1, -1, -1
Wed Jul 11 18:46:09 2018 [Z0][ReM][D]: Req:9968 UID:0 one.vmpool.info result SUCCESS, “<VM_POOL>13<…”
Wed Jul 11 18:46:11 2018 [Z0][InM][I]: Command execution fail: ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 2 ord-virt-004; else exit 42; fi’
Wed Jul 11 18:46:11 2018 [Z0][InM][I]: ssh: connect to host ord-virt-004 port 22: Invalid argument
Wed Jul 11 18:46:11 2018 [Z0][InM][I]: ExitCode: 255
Wed Jul 11 18:46:14 2018 [Z0][InM][I]: Command execution fail: ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 2 ord-virt-004; else exit 42; fi’
Wed Jul 11 18:46:14 2018 [Z0][InM][I]: ssh: connect to host ord-virt-004 port 22: Invalid argument
Wed Jul 11 18:46:14 2018 [Z0][InM][I]: ExitCode: 255
Wed Jul 11 18:46:18 2018 [Z0][InM][D]: Monitoring host ord-virt-003 (1)
Wed Jul 11 18:46:18 2018 [Z0][InM][I]: Command execution fail: ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 2 ord-virt-004; else exit 42; fi’
Wed Jul 11 18:46:18 2018 [Z0][InM][I]: ssh: connect to host ord-virt-004 port 22: Invalid argument
Wed Jul 11 18:46:18 2018 [Z0][InM][I]: ExitCode: 255
Wed Jul 11 18:46:18 2018 [Z0][ONE][E]: Error monitoring Host ord-virt-004 (2): -
Wed Jul 11 18:46:19 2018 [Z0][ReM][D]: Req:8960 UID:0 one.system.config invoked
Wed Jul 11 18:46:19 2018 [Z0][ReM][D]: Req:8960 UID:0 one.system.config result SUCCESS, “<AUTH_MAD>…”
Wed Jul 11 18:46:19 2018 [Z0][ReM][D]: Req:6400 UID:0 one.host.info invoked , 2
Wed Jul 11 18:46:19 2018 [Z0][ReM][D]: Req:6400 UID:0 one.host.info result SUCCESS, “2<NAM…”
Wed Jul 11 18:46:19 2018 [Z0][HKM][D]: Message received: LOG I 2 Command execution fail: /var/lib/one/remotes//hooks/ft/host_error.rb 2 -m -p 0
Wed Jul 11 18:46:19 2018 [Z0][HKM][D]: Message received: LOG I 2 ExitCode: 255
Wed Jul 11 18:46:19 2018 [Z0][HKM][D]: Message received: EXECUTE FAILURE 2 error: -

host_error.log:
[2018-07-11 18:46:19 +0000][HOST 2][I] Hook launched
[2018-07-11 18:46:19 +0000][HOST 2][I] hostname: ord-virt-004
[2018-07-11 18:46:19 +0000][HOST 2][I] Fencing enabled
[2018-07-11 18:46:19 +0000][HOST 2][E]
[2018-07-11 18:46:19 +0000][HOST 2][E] Fencing error
[2018-07-11 18:46:19 +0000][HOST 2][E] Exiting due to previous error.

Expected results:
VM should be restarted on an available node

hi, it seems the master cannot reach the hypervisor, see:

ssh: connect to host ord-virt-004 port 22: Invalid argument

Please make sure that the oneadmin user can use passwordless login to all nodes, so that
“ssh oneadmin@ord-virt-004” works without a password-prompt (so use SSH keys) or any other interaction.

hope this helps!

EDIT: oops, misunderstood that part of the logs, you wanted to show the fencing part in the end, not the errors in the beginning - my bad :upside_down_face:

Do you have a fencing configured for the hosts for use by the host_error script??

Without proper fencing you could have split brain situation with two instances of same VM which if shared datastore is in use could lead to data corruption.

Please read this docs section thoroughly.

If you are definitely sure what you are doing add -u to the arguments of the host_error script and restart the opennebula service.

Hope this helps,
Anton Todorov

1 Like

Well, ord-virt-004 is the host down. So I expect that oneadmin is not able to ssh to that host while is it down.

Thanks for the info, I just noticed the fence_host.sh was not configured as the first line is still “exit 1” !
Can I suggest to add "echo “Fence host not configured, please edit ft/fence_host.sh” just before the “exit 1” ?

Thank you

Hi,

You could open a feature request at https://github.com/OpenNebula/one/issues

Best Regards,
Anton Todorov

Done here: https://github.com/OpenNebula/one/issues/2282