OpenNebula 5.6 RAFT two nodes

(Razvan Crainea) #1


I’m running OpenNebula 5.6.1 on Debian 9, with two nodes in a HA scenario using RAFT. Everything runs OK until the leader fails (or is turn down using systemctl). When I shut the leader down (systemctl stop opennebula), the second node, which (I hope) should become leader, gets stuck in the candidate state:
0 error - - - - -
1 candidate 1006 27400 0 -1 -1

The only logs I see are:
Mon Oct 15 19:25:00 2018 [Z0][RCM][I]: Error requesting vote from follower 0:libcurl failed to execute the HTTP POST transaction, explaining: Failed to connect to port 2633: Connection refused
Mon Oct 15 19:25:00 2018 [Z0][RCM][I]: No leader found, starting new election in 2790ms

I am expecting to see these errors, since the leader node is down, but my expectations are also that the failover node ( to take over and become leader. Are my expectations correct, or there is something wrong with my scenario?

Thank you!

(Anton Todorov) #2

Hi @razvanc,

OpenNebula recommends 3 or 5 nodes

The RAFT consensus algorithm needs to have N/2+1 nodes available to create a quorum. The remaining node in your case is in split-brain situation waiting for other node(s) to become available to start the election.

Hope this helps.

Best Regards,
Anton Todorov

(Razvan Crainea) #3

Hi, Anton!

Thank you for your prompt response! This was actually what I was thinking too, but I couldn’t pinpoint the hard requirement of having N/2+1 nodes, I only saw the recommendation you pointed out. TBH, I didn’t read the RAFT specifications, I am sorry about that!
Do you know if there is a method of adding a 3rd, lightweight node (not an actual installment), just for ensuring consensus? Not sure whether only deploying a RAFT generic implementation will help, since it will still need to implement some of the OpenNebula logic.


(Anton Todorov) #4

Hi Razvan,

I am not sure is it possible to add just a voting beacon. Please feel free to issue a feature request though.

Best Regards,
Anton Todorov

(Razvan Crainea) #5

Done! One can follow the feature request here.

Thank you very much for your help!

(petr108m) #6

status of feature
Code committed to upstream release/hotfix branches

does it mean a possibility to download and install?
can u provide details?

(Razvan Crainea) #7

According to the ticket, nothing was done yet - those are just bullets that need to be checked when completed.

(Ruben S. Montero) #8


AFAIK this is not possible for RAFT. A node must be leader, follower or candidate. Note that log entries are committed once a majority of followers have replicated the entry, so the algorithm assumes that any of them could take the leadership in case of failure…

I guess the light way approach would be to create a VM with your third oned server running in it…