Host in ERROR after hard reboot

felartu · August 21, 2019, 3:08pm

Please, describe the problem here and provide additional information below (if applicable) …

Hello,

While doing HA VM test, we did a hard reboot of a host to see the VM migrate to another host, this worked fine but now the host is in ERROR state.
We tried reinstalling Opennebula rpm and no luck, we removed the host and re-added it again, still no luck.
We also tried onehost sync --force but that made no difference.

in oned.log we see this:

Wed Aug 21 10:59:50 2019 [Z0][InM][I]: Command execution failed (exit code: 134): ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 5 HV1; else exit 42; fi’
Wed Aug 21 10:59:50 2019 [Z0][InM][I]: /var/tmp/one/im/run_probes: line 34: 17851 Aborted ./$i $ARGUMENTS
Wed Aug 21 10:59:50 2019 [Z0][InM][E]: Error executing collectd-client.rb
Wed Aug 21 10:59:53 2019 [Z0][InM][I]: Command execution failed (exit code: 134): ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 5 HV1; else exit 42; fi’
Wed Aug 21 10:59:53 2019 [Z0][InM][I]: /var/tmp/one/im/run_probes: line 34: 18360 Aborted ./$i $ARGUMENTS
Wed Aug 21 10:59:53 2019 [Z0][InM][E]: Error executing collectd-client.rb

Our setup consists of FC storage backend with cLVM and GFS2 filesystem; two hosts and one frontend.
Besides that there is no other visible error.

We see UDP communication back and forth between the frontend and the host, meaning they are talking to each other but some probe script is failing in the host side.

Any suggestions?

Versions of the related components and OS (frontend, hypervisors, VMs):
OpenNebula 5.8.1

Steps to reproduce:
Hard reboot the host.

Current results:
Host in ERROR state, sometimes we see RETRY state and then back to ERROR.

Expected results:
Status=OK

felartu · August 22, 2019, 11:20pm

For anyone in the same problem, I was getting Aborted due to collectd-client-shepherd.sh killing the wrong PID.
I commented the kill -6 part and added a noop command in bash and that did the trick.
It appears, under some special conditions, collect-d scripts cannot handle PIDs correctly.
This will avoid the SIGABRT to any collectd-client process found running, do a killall ruby to clean this up as it will keep spawning more process over time.

I added killall ruby to cron every 5 mins and works like a charm and doesn’t grow over time.

Topic		Replies	Views
Solved: "Error monitoring Host" when trying to add host General solved	1	4966	January 29, 2019
Host monitoring error ( Centos 7 KVM) Community Support	15	2398	July 31, 2016
Error adding Host on with Ubuntu16.04 Community Support	11	2344	November 20, 2017
[SOLVED] Host err - unable to recover it Community Support	1	876	June 9, 2015
[SOLVED] Error monitoring Host (2): Error executing probes Community Support	9	10091	March 30, 2019

Host in ERROR after hard reboot

Related Topics