Hosts entering the ERROR state

Hello,

I have ONe cluster with ~25 hosts running CentOS 7 and ONe 4.14, and occasionally some of the hosts enters the ERROR state with the following message visible in Sunstone:

ERROR Mon May 23 09:19:40 2016 : Error monitoring Host myhost25 (48): error: failed to connect to the hypervisor error: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. ERROR MESSAGE --8<------ Error executing kvm.rb ERROR MESSAGE ------>8-- ARCH=x86_64 MODELNAME=“Intel(R) Core™2 Duo CPU E8500 @ 3.16GHz”

The existing VMs deployed on the host are running without problems. I have discovered that running “service libvirtd restart” fixes the problem. So it might be that libvirtd stops accepting requests from kvm.rb for some reason. Other hosts are running OK, and the problem occurs on a different host next time. I see the problem with a host entering the ERROR state about once or twice a week.

Can you help me to find out where the problem could be?

Thanks!

Hi,
I’m seeing the same problem with ONE 4.14 and Debian (both 7 and 8)

Here’s a log from libvirtd on one host that has died yesterday:
2016-05-22 22:50:04.295+0000: 11577: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:50:04.317+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:50:34.376+0000: 11575: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:50:34.376+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:51:04.387+0000: 11576: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:51:04.388+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:51:34.399+0000: 11579: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:51:34.399+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:52:04.408+0000: 11578: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:52:04.409+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:54:59.479+0000: 11577: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:54:59.480+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:55:07.505+0000: 11580: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:55:07.506+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:55:38.703+0000: 11581: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:55:38.703+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:56:09.900+0000: 11575: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:56:09.901+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:56:41.100+0000: 11578: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:56:41.100+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 22:59:59.545+0000: 11574: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 22:59:59.545+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 23:04:59.609+0000: 11577: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 23:04:59.609+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 23:09:59.673+0000: 11576: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 23:09:59.673+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 23:14:59.735+0000: 11575: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 23:14:59.735+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 23:19:59.795+0000: 11578: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 23:19:59.795+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 23:24:59.858+0000: 11574: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2016-05-22 23:24:59.858+0000: 11573: error : virNetSocketReadWire:1571 : End of file while reading data: Input/output error 2016-05-22 23:29:59.922+0000: 11577: error : virDBusCall:1537 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Hi

We are also affected by this issue, same problem, after a week running libvirtd service bites the dust and we get this message:

service libvirtd status -l

Redirecting to /bin/systemctl status -l libvirtd.service
● libvirtd.service - Virtualization daemon
Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2016-06-23 00:59:50 CEST; 1 weeks 5 days ago
Docs: man:libvirtd(8)
http://libvirt.org
Main PID: 19551 (libvirtd)
CGroup: /system.slice/libvirtd.service
└─19551 /usr/sbin/libvirtd

Jul 05 13:42:48 hyp115.swablu.os libvirtd[19551]: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Jul 05 13:42:48 hyp115.swablu.os libvirtd[19551]: End of file while reading data: Input/output error

And OpenNebula changes the host status to error state and running VM to unknown but the VMs are still running and virsh connect is still working.

In this case we are running:

  • CentOS 7.2
  • libvirt 1.2.17-13
  • OpenNebula 4.14.2

We didn’t have this issue before, it seems that is related with the new libvirt version and the Opennebula probes, but I’m not sure. For the moment we didn’t have this issue in our OpenNebula 5.0 testing cluster. As @Yenya said the workaround is to restart libvirtd service but it’s a bit annoying.

Has anyone else notice this issue with Opennebula and libvirtd?

Cheers
Alvaro

I can confirm that this issue also affects to Opennebula 5.0 hypervisors running libvirtd 1.2.17-13

Hi,
What is the possibility that you have the issue described by @vholer?

Kind Regards,
Anton Todorov

Hi @atodorov_storpool

Nice tip! yes it seems that is related with a polkit issue. We will try the workaround suggested by @vholer for the moment.

Reading the bug discussion it should be fixed in polkit-0.113…

Cheers and thanks!
Alvaro

btw we have generated a new CentOS polkit-0.133 rpm from Fedora 22 srpm package. And we have installed this in one hypervisor to see what happens after a while…

Cheers
Alvaro

I know it’s an old thread but I have this issue on a Dell Optiplex i7 CPU desktop computer running CentOS 7.4 latest, OpenNebula 5.4.6 and libvirt-3.2.0. From time to time libvirtd sees to crash with the message from this thread. @Anton Todorov fix seems to work as expected! (I had most of the settings in my libvirtd.conf file, except for the auth_unix_ro = "none" auth_unix_rw = "none" which were disabled).

We had the same problem, and my research turned out to this fix.
However, after applying this fix, it still happens our libvirtd crashes.
About once a week I need to restart the libvirtd-daemon.

CentOS Linux release 7.4.1708 (Core)
OpenNebula 5.4.1.1

libvirt-3.2.0-14.el7_4.9.x86_64

libvirtd.conf has the firx as described:
auth_unix_ro = “none”
auth_unix_rw = “none”
unix_sock_group = “oneadmin”
unix_sock_ro_perms = “0770”
unix_sock_rw_perms = “0770”

/var/log/messages:

Apr 15 06:10:29 Node-1 libvirtd: 2018-04-15 04:10:29.145+0000: 187146: error : virDBusCall:1570 : error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Apr 15 06:10:29 Node-1 libvirtd: 2018-04-15 04:10:29.149+0000: 187143: error : virNetSocketReadWire:1808 : End of file while reading data: Input/output error

Anyone else who is still having this problem, or has another fix?

Hi,

I was facing this issue often getting an error State on the host node

Tue Jan 29 08:02:52 2019 : Error monitoring Host xxxxx.com (21): Timeout executing ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 21 mwncloud5.maintenis.com; else exit 42; fi’

i already doing this step

  1. remove /var/tmp/one and then onehost sync --force but still didnt solve the problem.
  2. restart libvirtd daemon

but still didnt solve the problem, any other solutions ?

or what type of preimum support should i buy to solve this case ?

Hello,

Here are some general details about the OpenNebula Systems Support Subscriptions.
http://opennebula.systems/opennebula-support/

You will see that there are various options available to align with your SLA requirements (schedule and speed needed for issue resolution). Take a look at the documents here, and you can reach out directly to me if you would like to discuss options in more detail.

Michael Abdou
mabdou@opennebula.systems.

Best regards.