Upgrade 5.8.x to 5.10. opennebula service failed to start (MAD did not answer INIT command)

Hi,

The following steps has been done:

/etc/yum.repos.d/opennebula.repo
out: baseurl=https://downloads.opennebula.org/repo/5.8/CentOS/7/x86_64
in: baseurl=https://downloads.opennebula.org/repo/5.10/CentOS/7/x86_64

# yum update
Put back mysql section from oned.conf.rpmsave to oned.conf
$ onedb upgrade -v -S localhost -u oneadmin -p '!Password' -d opennebula

And now I have to failed to start services:

$ systemctl list-units --state=failed
  UNIT                   LOAD   ACTIVE SUB    DESCRIPTION
● opennebula-hem.service loaded failed failed OpenNebula Hook Execution Service
● opennebula.service     loaded failed failed OpenNebula Cloud Controller Daemon

Last /var/log/oned.log entiees:

Wed Nov 27 10:22:35 2019 [Z0][ONE][I]: Checking database version.
Wed Nov 27 10:22:35 2019 [Z0][ONE][I]: oned is using version 5.10.0 for local_db_versioning
Wed Nov 27 10:22:35 2019 [Z0][ONE][I]: oned is using version 5.10.0 for db_versioning
Wed Nov 27 10:22:35 2019 [Z0][ACL][I]: Starting ACL Manager...
Wed Nov 27 10:22:35 2019 [Z0][ACL][I]: ACL Manager started.
Wed Nov 27 10:22:35 2019 [Z0][HKM][I]: Starting Hook Manager...
Wed Nov 27 10:22:35 2019 [Z0][HKM][I]: Loading Hook Manager driver.
Wed Nov 27 10:22:35 2019 [Z0][HKM][I]: Hook Manager started.
Wed Nov 27 10:22:35 2019 [Z0][MAD][E]: MAD did not answer INIT command
Wed Nov 27 10:22:35 2019 [Z0][ONE][E]: Could not load driver

Please advise how to solve the issue.


Versions of the related components and OS (frontend, hypervisors, VMs):

OS: CentOS Linux release 7.7.1908 (Core)
Kernel: 3.10.0-1062.4.3.el7.x86_64

Installed Packages
Name : opennebula
Arch : x86_64
Version : 5.10.0
Release : 1.el7

Hello @gray380,

Please could you run /usr/lib/one/mads/one_hm write INIT and share the output?

Hello Christian,

Pretty empty:

[root@one-srv-01 sadm]# /usr/lib/one/mads/one_hm
[root@one-srv-01 sadm]# /usr/lib/one/mads/one_hm INIT
[root@one-srv-01 sadm]#

BTW /etc/one/hm/hmrc is empty (contains only comments)

More findings:

Nov 27 11:31:09 one-srv-01 systemd[1]: Started OpenNebula Hook Execution Service.
Nov 27 11:31:09 one-srv-01 ruby[10784]: Unable to load this gem. The libzmq library exists, but cannot be loaded.
Nov 27 11:31:09 one-srv-01 ruby[10784]: libzmq library was found at:
Nov 27 11:31:09 one-srv-01 ruby[10784]: ["/usr/lib64/libzmq.so", "/usr/lib64/libzmq.so"]
Nov 27 11:31:09 one-srv-01 ruby[10784]: /usr/share/one/gems/gems/ffi-rzmq-core-1.0.7/lib/ffi-rzmq-core/libzmq.rb:67:in `rescue in <module:LibZMQ>': The libzmq library (or DLL) could not be loaded (LoadError)
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/one/gems/gems/ffi-rzmq-core-1.0.7/lib/ffi-rzmq-core/libzmq.rb:10:in `<module:LibZMQ>'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/one/gems/gems/ffi-rzmq-core-1.0.7/lib/ffi-rzmq-core/libzmq.rb:7:in `<top (required)>'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/one/gems/gems/ffi-rzmq-core-1.0.7/lib/ffi-rzmq-core.rb:3:in `<top (required)>'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/one/gems/gems/ffi-rzmq-2.0.7/lib/ffi-rzmq.rb:66:in `<top (required)>'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:135:in `require'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:135:in `rescue in require'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:144:in `require'
Nov 27 11:31:09 one-srv-01 ruby[10784]: from /usr/lib/one/onehem/onehem-server.rb:47:in `<main>'

libzmq.so:

[sadm@one-srv-01 ~]$ ls -al /usr/lib64/libzmq.so
lrwxrwxrwx 1 root root 15 Nov 27 09:37 /usr/lib64/libzmq.so -> libzmq.so.5.0.0

[sadm@one-srv-01 ~]$ yum whatprovides /usr/lib64/libzmq.so
Matched from:
Filename    : /usr/lib64/libzmq.so

zeromq3-devel-3.2.5-1.el7.x86_64 : Development files for zeromq3
Repo        : epel
Matched from:
Filename    : /usr/lib64/libzmq.so

zeromq-devel-4.1.4-6.el7.x86_64 : Development files for zeromq
Repo        : @epel
Matched from:
Filename    : /usr/lib64/libzmq.so

[sadm@one-srv-01 ~]$ yum whatprovides /usr/lib64/libzmq.so.5.0.0
Matched from:
Filename    : /usr/lib64/libzmq.so.5.0.0

zeromq-4.1.4-6.el7.x86_64 : Software library for fast, message-based applications
Repo        : @epel
Matched from:
Filename    : /usr/lib64/libzmq.so.5.0.0

Can I switch off the Hook Execution Service in some way as workaround?

Hello @gray380,

do you have SELinux enabled? If yes, can you pls check for any related denied actions?

grep -i deni /var/log/audit/audit.log

If there is nothing suspicious, can you please attach list of all your packages and their versions?

rpm -qa

(ideally, also please do a verification of the installed files via rpm -Va).

If still nothing strange, can you please fallback to installation of Ruby dependencies via install_gems as described here? http://docs.opennebula.org/5.10/deployment/opennebula_installation/frontend_installation.html#step-4-ruby-runtime-installation-optional

  1. first drop symlink unlink /usr/share/one/gems
  2. then continue with /usr/share/one/install_gems

Thank you,
Vlastimil

Hi,

SELinux disabled:

[root@one-srv-01 sadm]# getenforce
Disabled

installed_packages.txt (30.7 KB)

The following commands executed:

# rpm -Va
# test -L /usr/share/one/gems && unlink /usr/share/one/gems
# /usr/share/one/install_gems

Services restarted but still failed.
And the new warning message has apppeared in the system log:

Nov 27 13:42:53 one-srv-01 mysqld[1586]: 2019-11-27 13:42:53 1706 [Warning] Aborted connection 1706 to db: 'opennebula' user: 'oneadmin' host: 'localhost' (Got an error reading communication packets)

Meanwile I’ve downgraded to 5.8.x

Okay, I’ve tried upgrade again with help of the following link: Upgrading from OpenNebula 5.8.x

And faced the same issue:

[sadm@one-srv-01 ~]$ sudo systemctl start opennebula
Job for opennebula.service failed because the control process exited with error code. See "systemctl status opennebula.service" and "journalctl -xe" for details.
[sadm@one-srv-01 ~]$ sudo systemctl status opennebula
● opennebula.service - OpenNebula Cloud Controller Daemon
   Loaded: loaded (/usr/lib/systemd/system/opennebula.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Thu 2019-12-05 15:31:17 EET; 23s ago
  Process: 17242 ExecStopPost=/usr/share/one/follower_cleanup (code=exited, status=0/SUCCESS)
  Process: 17222 ExecStart=/usr/bin/oned -f (code=exited, status=255)
  Process: 17219 ExecStartPre=/usr/sbin/logrotate -f /etc/logrotate.d/opennebula -s /var/lib/one/.logrotate.status (code=exited, status=0/SUCCESS)
 Main PID: 17222 (code=exited, status=255)

Dec 05 15:31:12 one-srv-01 systemd[1]: Failed to start OpenNebula Cloud Controller Daemon.
Dec 05 15:31:12 one-srv-01 systemd[1]: Unit opennebula.service entered failed state.
Dec 05 15:31:12 one-srv-01 systemd[1]: opennebula.service failed.
Dec 05 15:31:17 one-srv-01 systemd[1]: opennebula.service holdoff time over, scheduling restart.
Dec 05 15:31:17 one-srv-01 systemd[1]: Stopped OpenNebula Cloud Controller Daemon.
Dec 05 15:31:17 one-srv-01 systemd[1]: start request repeated too quickly for opennebula.service
Dec 05 15:31:17 one-srv-01 systemd[1]: Failed to start OpenNebula Cloud Controller Daemon.
Dec 05 15:31:17 one-srv-01 systemd[1]: Unit opennebula.service entered failed state.
Dec 05 15:31:17 one-srv-01 systemd[1]: opennebula.service failed.
[sadm@one-srv-01 ~]$ systemctl list-units --state=failed
  UNIT                   LOAD   ACTIVE SUB    DESCRIPTION
● opennebula-hem.service loaded failed failed OpenNebula Hook Execution Service
● opennebula.service     loaded failed failed OpenNebula Cloud Controller Daemon

oned.log:

Thu Dec  5 15:31:11 2019 [Z0][ONE][I]: Log level:3 [0=ERROR,1=WARNING,2=INFO,3=DEBUG]
Thu Dec  5 15:31:11 2019 [Z0][ONE][I]: Support for xmlrpc-c > 1.31: yes
Thu Dec  5 15:31:11 2019 [Z0][ONE][I]: Set up 50 DB connections using encoding latin1
Thu Dec  5 15:31:11 2019 [Z0][ONE][I]: Checking database version.
Thu Dec  5 15:31:11 2019 [Z0][ONE][I]: oned is using version 5.10.0 for local_db_versioning
Thu Dec  5 15:31:11 2019 [Z0][ONE][I]: oned is using version 5.10.0 for db_versioning
Thu Dec  5 15:31:11 2019 [Z0][ACL][I]: Starting ACL Manager...
Thu Dec  5 15:31:11 2019 [Z0][ACL][I]: ACL Manager started.
Thu Dec  5 15:31:11 2019 [Z0][HKM][I]: Starting Hook Manager...
Thu Dec  5 15:31:11 2019 [Z0][HKM][I]: Loading Hook Manager driver.
Thu Dec  5 15:31:11 2019 [Z0][HKM][I]: Hook Manager started.
Thu Dec  5 15:31:12 2019 [Z0][MAD][E]: MAD did not answer INIT command
Thu Dec  5 15:31:12 2019 [Z0][ONE][E]: Could not load driver

For some reason I’ve failed to downgrade like at first attempt.
It says db mismatch:

Thu Dec  5 16:21:43 2019 [Z0][ONE][I]: Log level:3 [0=ERROR,1=WARNING,2=INFO,3=DEBUG]
Thu Dec  5 16:21:43 2019 [Z0][ONE][I]: Support for xmlrpc-c > 1.31: yes
Thu Dec  5 16:21:43 2019 [Z0][ONE][I]: Checking database version.
Thu Dec  5 16:21:43 2019 [Z0][ONE][E]: Database version mismatch ( local_db_versioning). Installed OpenNebula 5.8.5 needs DB version '5.8.0', and existing DB version is '5.10.0'.
Thu Dec  5 16:21:43 2019 [Z0][ONE][E]: Use onedb to upgrade DB.

But onedb itself says version is okay:

[oneadmin@one-srv-01 ~]$ onedb upgrade -v -S localhost -u oneadmin -d opennebula
Version read:
Shared tables 5.6.0 : OpenNebula 5.8.0 daemon bootstrap
Local tables  5.8.0 : OpenNebula 5.8.0 daemon bootstrap

MySQL dump stored in /var/lib/one/mysql_localhost_opennebula_2019-12-5_16:20:46.sql
Use 'onedb restore' or restore the DB using the mysql command:
mysql -u user -h server -P port db_name < backup_file


>>> Running migrators for shared tables
Database already uses version 5.6.0

>>> Running migrators for local tables
Database already uses version 5.8.0

Okay, I’ve fixed DB issue (droped opennebula database from the mysql and restore it with onedb).
So opennebula srvice is up and running.
At least for now :slight_smile:

But what is wrong with the upgrade?