High cpu usage after a while since 5.8.0

Since we have upgraded from 5.6.2 to 5.8.0 we are expériencing some high cpu usage on oned threads. After approx 24h we have 2 threads stuck at 100%. And after 48h we have even more threads stuck at 100%.
If I do some strace on those threads, I can see that they connect to RPC/XML port (I guess), I can see some HTTP headers about that, and then a lot of connection timeout.

If we restart opennebula, we are fine for approx 24h.

I’ve change this two keys in oned.conf concerning the timeout, KEEPALIVE_TIMEOUT and TIMEOUT. So we have this now :

MAX_CONN           = 240
MAX_CONN_BACKLOG   = 480
#KEEPALIVE_TIMEOUT  = 15
KEEPALIVE_TIMEOUT  = 30
#KEEPALIVE_MAX_CONN = 30
#TIMEOUT            = 15
TIMEOUT            = 30

I don’t know yet if it’s fine. Is that a good idea ?

Next I will try a onedb purge to remove old done VMs, and also to clean long history.

In oned.log I can only see some slow queries detected, mostly about replacing some value in vm_pool. I don’t know if it’s related… Nothing about connection timeout tho.

Any other lead I could follow ?

Best regards,
Edouard


Versions of the related components and OS (frontend, hypervisors, VMs):

OpenNebula 5.8.0 on CentOS 7
1681 VMs

Steps to reproduce:

It was fine with 5.6.2. Then upgrade to 5.8.0 make this problem happens

Current results:

oned threads stuck at 100% CPU

Expected results:

No threads stuck at 100%

Can you send the output of the following command when one of the oned threads is at 100%:

sudo gdb -q -ex 'thread apply all bt' -ex 'detach' -ex 'quit' `which oned` \ `pgrep oned` > oned.trace

You can PM the file.

hi @ruben

We also have this issue now in our production machine after the centos 7.8 upgrade.
Our idea is to upgrade also opennebula to 5.12 as soon as possible in production, but we do not know it there is any workaround that we can apply meanwhile.

For now just an opennebula service restart fixes the issue for a while.

Cheers
Álvaro