Major issue: error from service: CheckAuthorization

Hi all,

After some week of test with no major issue, we have pass our OpenNebula in production this week end.
We have 36 VM on 3 hosts.
Today, we have problems:
Each host was detected by one as done but OS is correctly running and be accessible by ssh.
We have this error in syslog

Aug 22 22:13:58 adnpvirt07 libvirtd[2117]: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Aug 22 22:13:58 adnpvirt07 libvirtd[2117]: End of file while reading data: Erreur d’entrée/sortie

oneadmin is membrer of libvrt group.
this issue cause corruption of the running vdisk and cause some problem with our production.
in addition, (HOOK on error host is enable to activate HA for our hosts)

After rebooting each host (and apply an apt-get upgrade), it seems to be good, but I want to understand where is the problem to fix it.

We are using OpenNebula 5.0.2

Thanks for helps,
Yannick

The image corruption seems due because the one monitoring system detect the host as done, but VM are always running on it. So One restart VM on another hosts, but image disk are already in use… I think a fencing method must be use on host monitoring failure to avoid this kind of failure.

I try to apply this

a feature is already open to be able to fence a device on host hook : http://dev.opennebula.org/issues/4659

Hi Yannick,

I’ve extracted the fencing part of the FT host-hook in our addon to the following patch: host_error.rb.patch (1.4 KB)

It is adding additional argument ‘-s’ to the host_hook.rb:

host_error.rb <other-args> -s /usr/sbin/myfencing-script.sh

Then on host failure event and if there are VM’s on it the host_error.rb will call the pointed script providing the hostname in the FT_HOSTNAME environment variable. Your /usr/sbin/myfencing-script.sh should be something like:

#!/bin/bash

PATH=/sbin:/usr/sbin:/bin:/usr/bin:$PATH

ipmitool ... -H $FT_HOSTNAME.ipmi.fqdn chassis power off

logger -t ${0##*/} "fence $FT_HOSTNAME $?"

Keep in mind that it will be called as oneadmin user…

Hope this helps,

Anton Todorov

Hi,

could you explain how to apply this patch ?
In addition, $FT_HOSTNAME is the host name that failed, is it correct ?
So, I can use it to manage (with case instruction) different fence method.

Thanks,
Yannick

Hi Yannick,

To apply the patch use the following (adjust thepath if they differ from defaults)

cd /var/lib/one/remotes/hooks/ft
cp host_error.rb host_error.rb.orig
patch -p0 < /path/to/host_hook.rb.patch

Yes. The $FT_HOSTNAME is the host name of the failed node as it is seen in opennebula.

Correct. Feel free to change the script to fit your needs.

Kind Regards,
Anton Todorov

I think there is additional patch to be sure that host_hook is not missing an failed node

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3

change to

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3 and host.state != 5

where host.state = 5 is MONITORING_ERROR (Monitoring the host from error state)

When host fails its state changes to ERROR and the host-hook is triggered. But if you enable --pause the host will swap between ERROR and MONITORING_ERROR. Without the above change, if the host-hook waits some time and check the host it is possible the host state to be MONITORING_ERROR and it will decide (IMO wrongly) that the host is back and no action is done.

Kind Regards,
Anton Todorov

I can not test this on my production servers so we must wait to add new servers in the cluster (in the next month).
I think to dev a bash script, with:

  1. an array with server hostname as key and fencing command as value.
  2. a request on the array to execute the correct command.
  3. some log message and test (as check the exe ou script invoked).

Yannick

This error is back.
it’s a monitoring problem. I have disable host error hook, so VM are always running on the host and they works as fine.
The change do in /etc/libvrt/libvirt.conf on my debian 8 don’t solved the issue.

any help to fix this issue is welcome.

Thansk,
Yannick

I’m wrong.
I don’t have apply the setting in the correct files :frowning:
I’m have apply this in /etc/libvirt/libvirt.conf and not in /etc/libvirt/libvirtd.conf

Just to be sure you can use libvirt disk locking mechanism
https://libvirt.org/locking.html

Hi Anton,

Thanks for your contribution and patch. I improved the hook, so your patch doesn’t apply exactly, but I borrowed the same fundamental idea.

https://github.com/OpenNebula/one/blob/master/share/hooks/host_error.rb
https://github.com/OpenNebula/one/blob/master/share/hooks/fence_host.sh

Closing #4659