Host in ERROR after hard reboot

Please, describe the problem here and provide additional information below (if applicable) …

Hello,

While doing HA VM test, we did a hard reboot of a host to see the VM migrate to another host, this worked fine but now the host is in ERROR state.
We tried reinstalling Opennebula rpm and no luck, we removed the host and re-added it again, still no luck.
We also tried onehost sync --force but that made no difference.

in oned.log we see this:

Wed Aug 21 10:59:50 2019 [Z0][InM][I]: Command execution failed (exit code: 134): ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 5 HV1; else exit 42; fi’
Wed Aug 21 10:59:50 2019 [Z0][InM][I]: /var/tmp/one/im/run_probes: line 34: 17851 Aborted ./$i $ARGUMENTS
Wed Aug 21 10:59:50 2019 [Z0][InM][E]: Error executing collectd-client.rb
Wed Aug 21 10:59:53 2019 [Z0][InM][I]: Command execution failed (exit code: 134): ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 5 HV1; else exit 42; fi’
Wed Aug 21 10:59:53 2019 [Z0][InM][I]: /var/tmp/one/im/run_probes: line 34: 18360 Aborted ./$i $ARGUMENTS
Wed Aug 21 10:59:53 2019 [Z0][InM][E]: Error executing collectd-client.rb


Our setup consists of FC storage backend with cLVM and GFS2 filesystem; two hosts and one frontend.
Besides that there is no other visible error.

We see UDP communication back and forth between the frontend and the host, meaning they are talking to each other but some probe script is failing in the host side.

Any suggestions?

Versions of the related components and OS (frontend, hypervisors, VMs):
OpenNebula 5.8.1

Steps to reproduce:
Hard reboot the host.

Current results:
Host in ERROR state, sometimes we see RETRY state and then back to ERROR.

Expected results:
Status=OK

For anyone in the same problem, I was getting Aborted due to collectd-client-shepherd.sh killing the wrong PID.
I commented the kill -6 part and added a noop command in bash and that did the trick.
It appears, under some special conditions, collect-d scripts cannot handle PIDs correctly.
This will avoid the SIGABRT to any collectd-client process found running, do a killall ruby to clean this up as it will keep spawning more process over time.

I added killall ruby to cron every 5 mins and works like a charm and doesn’t grow over time.