Opennebula High Availability - Sunstone does not start

I have configured 2 Opennebula Front-End servers with high availability on CentOS-7 using this guide

https://docs.opennebula.org/5.2/advanced_components/ha/frontend_ha_setup.html#what-to-do-after-a-fail-over-event

But I am getting the following errors, and Opennebula-Sunstone does not automatically start on the other server in case of a simulated Fail-Over Event. There is not enough information, how do I go about troubleshooting this?

pcs status
Cluster name: opennebula
Stack: corosync
Current DC: oneserver2 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Mon Feb 20 02:20:08 2017 Last change: Mon Feb 20 01:09:44 2017 by root via cibadmin on oneserver1

2 nodes and 8 resources configured

Online: [ oneserver1 oneserver2 ]

Full list of resources:

** fence_server1 (stonith:fence_virsh): Started oneserver2**
** fence_server2 (stonith:fence_virsh): Started oneserver1**
** Cluster_VIP (ocf::heartbeat:IPaddr2): Started oneserver2**
** opennebula (systemd:opennebula): Started oneserver2**
** opennebula-sunstone (systemd:opennebula-sunstone): Stopped**
** opennebula-gate (systemd:opennebula-gate): Started oneserver2**
** opennebula-flow (systemd:opennebula-flow): Started oneserver2**
** opennebula-novnc (systemd:opennebula-novnc): Started oneserver2**

Failed Actions:
*** opennebula-sunstone_start_0 on oneserver2 ‘not running’ (7): call=41, status=complete, exitreason=‘none’,**
** last-rc-change=‘Mon Feb 20 02:18:11 2017’, queued=0ms, exec=2139ms**
*** opennebula_start_0 on oneserver1 ‘not running’ (7): call=27, status=complete, exitreason=‘none’,**
** last-rc-change=‘Mon Feb 20 02:19:08 2017’, queued=0ms, exec=2241ms**
*** opennebula-sunstone_start_0 on oneserver1 ‘not running’ (7): call=26, status=complete, exitreason=‘none’,**
** last-rc-change=‘Mon Feb 20 02:19:06 2017’, queued=0ms, exec=2167ms**

Daemon Status:
** corosync: active/enabled**
** pacemaker: active/enabled**
** pcsd: active/enabled**

I’m facing exactly this issue on 5.2.1, does anybody has an update? CC: @Ruben
I tried adding a delay to the SystemD unit file for the sunstone service so it would wait for opennebula.service to come up first but that didn’t do anything different.

Another thing, when I disconnect the node1 from network, let’s say during production scenario the network goes down on Active Node, the Passive node does not take over the services. It only let’s me know that the Active node is in ‘UNCLEAN’ state.

You have to change in /etc/corosync/corosync.conf the member hostname to it’s IP address to overcome that, from this:

nodelist {
node {
ring0_addr: ONE1 <-hostname
nodeid: 1
}

to this:

nodelist {
node {
ring0_addr: 10.0.0.1 <- IP Address
nodeid: 1
}

After that re-authorize the nodes with the IP addr. instead.

Followed the steps but still having the same problem.

Running Open Nebula 5.2.1 on two Ubuntu 16.04 VMs, I have a similar issue: Pacemaker has issues starting opennebula-sunstone:

Stack: corosync
Current DC: one-a (version 1.1.14-70404b0) - partition with quorum
2 nodes and 5 resources configured

Online: [ one-a one-b ]

Full list of resources:

 Resource Group: opennebula-cluster
     VIP        (ocf::heartbeat:IPaddr2):       Started one-b
     opennebula (systemd:opennebula.service):   Started one-b
     opennebula-gate    (systemd:opennebula-gate.service):      Started one-b
     opennebula-sunstone        (systemd:opennebula-sunstone.service):  Stopped
     opennebula-flow    (systemd:opennebula-flow.service):      Stopped

Failed Actions:
* opennebula_monitor_30000 on one-b 'not running' (7): call=25, status=complete, exitreason='none',
    last-rc-change='Mon Mar 13 10:54:37 2017', queued=0ms, exec=0ms
* opennebula-sunstone_start_0 on one-b 'not running' (7): call=28, status=complete, exitreason='none',
    last-rc-change='Mon Mar 13 10:53:38 2017', queued=0ms, exec=2015ms
* opennebula-sunstone_start_0 on one-a 'not running' (7): call=28, status=complete, exitreason='none',
    last-rc-change='Fri Mar 10 19:19:56 2017', queued=0ms, exec=2012ms
* opennebula_monitor_30000 on one-a 'OCF_PENDING' (196): call=25, status=complete, exitreason='none',
    last-rc-change='Fri Mar 10 19:20:25 2017', queued=0ms, exec=0ms

I’ve put ExecStartPre=/usr/bin/logger "sunstone systemd script invoked" into /lib/systemd/system/opennebula-sunstone.service, rebooted that node and as expected, don’t get this message anywhere (above’s output comes from that reboot of node one-b):

root@one-b:~# /usr/bin/logger "TEST: sunstone systemd script invoked"
root@one-b:~# grep -r "sunstone systemd script invoked" /var/log/
/var/log/syslog:Mar 13 11:34:46 one-b root: TEST: sunstone systemd script invoked
root@one-b:~# systemctl status opennebula-sunstone.service 
● opennebula-sunstone.service - OpenNebula Web UI Server
   Loaded: loaded (/lib/systemd/system/opennebula-sunstone.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

Mär 13 10:53:40 one-b systemd[1]: Stopped OpenNebula Web UI Server.

I’m no systemd expert, but to me it seems as if the service isn’t started at all and I can’t make out how it is supposed to work in the first place:

root@one-b:~# grep opennebula.*service /lib/systemd/system/*.service 
/lib/systemd/system/opennebula-novnc.service:Before=opennebula-sunstone.service
/lib/systemd/system/opennebula-scheduler.service:After=opennebula.service
/lib/systemd/system/opennebula-scheduler.service:BindTo=opennebula.service
/lib/systemd/system/opennebula.service:Before=opennebula-scheduler.service
/lib/systemd/system/opennebula.service:BindTo=opennebula-scheduler.service
/lib/systemd/system/opennebula-sunstone.service:After=opennebula.service
/lib/systemd/system/opennebula-sunstone.service:After=opennebula-novnc.service
/lib/systemd/system/opennebula-sunstone.service:BindTo=opennebula-novnc.service

opennebula-sunstone.service wants to be started after opennebula.service and opennebula-novnc.service, the latter wants to be started before opennebula-sunstone.service. opennebula-novnc.service seems not to be started explicitely anywhere, I guess that magic happens due to systemd? It’s at least running on node-b:

root@one-b:~# systemctl status opennebula-novnc.service 
● opennebula-novnc.service - OpenNebula noVNC Server
   Loaded: loaded (/lib/systemd/system/opennebula-novnc.service; disabled; vendor preset: enabled)
   Active: active (running) since Mo 2017-03-13 10:54:10 CET; 1h 39min ago
  Process: 1399 ExecStart=/usr/bin/novnc-server start (code=exited, status=0/SUCCESS)
 Main PID: 1638 (python2)
    Tasks: 1
   Memory: 27.9M
      CPU: 1.923s
   CGroup: /system.slice/opennebula-novnc.service
           └─1638 python2 /usr/share/one/websockify/websocketproxy.py --target-config=/var/lib/one/sunstone_vnc_tokens 29876

Mär 13 10:53:38 one-b systemd[1]: Starting OpenNebula noVNC Server...
Mär 13 10:54:10 one-b novnc-server[1399]: VNC proxy started
Mär 13 10:54:10 one-b systemd[1]: Started OpenNebula noVNC Server.

I’m at loss here, starting Sunstone via systemd manually “just works” at this point, starting it via Pacemaker doesn’t and it looks like it’s not even tried ("status=complete, exitreason='none'", see above)?

Got it working, after fiddling with the systemd file …

sed -i -e "s/^After=opennebula-novnc.service/#After=opennebula-novnc.service/g" -e "s/^BindTo=opennebula-novnc.service/#BindTo=opennebula-novnc.service/g" /lib/systemd/system/opennebula-sunstone.service

… and start the novnc service via Pacemaker:

root@one-b:~# crm config show
node 1: one-a
node 2: one-b
primitive VIP IPaddr2 \
	params ip=10.11.12.13 \
	op monitor interval=10s
primitive opennebula systemd:opennebula.service \
	op monitor interval=60s
primitive opennebula-flow systemd:opennebula-flow.service \
	op monitor interval=60s
primitive opennebula-gate systemd:opennebula-gate.service \
	op monitor interval=60s
primitive opennebula-novnc systemd:opennebula-novnc.service \
	op monitor interval=60s
primitive opennebula-sunstone systemd:opennebula-sunstone.service \
	op monitor interval=60s timeout=30s \
	meta target-role=Started
group opennebula-cluster VIP opennebula opennebula-gate opennebula-novnc opennebula-sunstone opennebula-flow
property cib-bootstrap-options: \
	have-watchdog=false \
	dc-version=1.1.14-70404b0 \
	cluster-infrastructure=corosync \
	cluster-name=opennebula \
	stonith-enabled=false \
	no-quorum-policy=ignore \
	last-lrm-refresh=1490027009

Works for me on two independent Ubuntu 16.04-based setups; hope that helps others as well.

If you unplug network cable on one instance of the front-end server (simulating real-time scenario in case network connectivity is lost on one front-end server), do the services automatically come up on the second one?

Mine just says “UNCLEAN” and services do not come up. I don’t think this is true High Availability, or is it?

I did an “ip link set down ens3” on one-a, one-b took over within a few seconds.

 Resource Group: opennebula-cluster
     VIP        (ocf::heartbeat:IPaddr2):       Started one-b
     opennebula (systemd:opennebula.service):   Started one-b
     opennebula-gate    (systemd:opennebula-gate.service):      Started one-b
     opennebula-novnc   (systemd:opennebula-novnc.service):     Started one-b
     opennebula-sunstone        (systemd:opennebula-sunstone.service):  Started one-b
     opennebula-flow    (systemd:opennebula-flow.service):      Started one-b

As data is stored on a Galera MariaDB cluster, which is accessed through MaxScale running on one-a and one-b and is bound to the VIP, it’s the expected behaviour.

Thanks!
But when I do “ip link set down ens3” on one-a server, the one-b server just says that (as displayed in my screenshot above) the one-a server as gone “UNCLEAN”. Duh!

one-b does not take over the services at all.

P.S. I have tested this without configuring any DB Cluster. But that shouldn’t be the reason right? Because DB would be used to sync the VMs and all the ‘data’ between one-a and one-b. Even without DB, at least the services should be taken over by one-b?

Are you using the Sqlite3 “DB” or an external MySQL-type one?

EDIT: @BattleX100 I see in your screenshot a) you are using fencing and b) have an error regarding fencing. Might the “UNCLEAN” stem from that? I don’t fence at all (since it violates KISS principle in this case; if the ONe VM dies that doesn’t mean the HV has issues).

Not using any DB at all for now.

Just testing that all services should fail-over from one server to the other. Btw I’m doing this on CentOS 7.
But server B just tells me that server A has gone ‘UNCLEAN’ instead of taking over the services.

See on left side, services are running on 1st server, then I run command ifdown ens3.
Now the second server just says that 1st one has gone UNCLEAN. And that’s it. It does not take over the services.

You have two distinct VMs, so each VM would operate on a local version of /var/lib/one/one.db? So, if you add a host on one-a and a failover happens, one-b would not have that host configured in OpenNebula, right? Unless your one.db is on shared storage, in which case you really need to fence, I don’t think that’s a viable setup?

http://serverfault.com/questions/656374/pacemaker-node-is-unclean-offline — I assume UNCLEAN comes from your failed fencing attempt. And of course pacemaker must not start anything if it’s unsure if the conditions to do so (one-a dead and buried) are met.

The fencing attempt should not fail as both the servers are configured in a cluster, and manually fencing one server or the other works correctly. Also, if I shutdown the cluster on server A, the server B takes over within few seconds.

It is only when, in case, server A loses network connectivity or anything, this UNCLEAN error starts happening.

You have an error thrown by Pacemaker regarding fencing, see you screenshot. So you need to fix that, before Pacemaker will even consider starting the resources. Try without stonith/fencing, then work out your fencing issues in Pacemaker; or the other way around.