HA upgrade path

I’m currently preparing our upgrade from 5.4 to 5.4.1 and i have to say that i’m not entirely happy with the procedure.

To have a stable reproduce able setup we prebuild our opennebula master hosts with packer which works awesome in normal production use. For the upgrade though this get’s me in a pretty unpleasant position as the Upgrade description requires me to stop all nodes and upgrade those. Sure i can do that manually but it breaks the HA idea and is at least cumbersome which means longer downtime.

Our setup with prebuilt nodes might not be used widely but to upgrade production systems manually is not really something i want to do and maybe i’m not alone in that. I see that a db upgrade needs a downtime and i’m fine with that - it would still be great to at least be able to upgrade node by node in the future, at least on minor releases.

For our approach with prebuilt images it would be great to have a chance to add new nodes with the new version in the old cluster. That would at least allow me to cleanly migrate to new hosts on the new version without the need of either doing a manual upgrade of the old ones or rebuilding the whole cluster.

Most awesome would be to have a zero downtime update obviously for the HA setup. That would require the db sync to only push in data that fits the new db schema. This would allow to add new hosts with the upgraded version and then switch the leader to one of those nodes and remove the old ones. I don’t have enough insight in the data sync to say how much effort would be needed for that. It would just be awesome to be able to do zero downtime upgrades (or close to zero for the time needed to switch to a new leader).

1 Like

I think that minor version upgrades can be relaxed as the DB versioning and API is not going to change. When this requirement is met nodes can be upgraded one by one.

However, for major releases with changes in the DB schema we would need to stop the cluster. In that case we could speed up the process by breaking the cluster, i.e.:

1.- Stop and remove follower HA nodes from the cluster (onezone delete…)
2.- With one front-end enable. Stop OpenNebula, upgrade and restart the service.
3.- Upgrade follower, and add it to cluster

Probably this procedure could be further improved by implementing the ability to “disable a follower”, i.e. disable replication on it. This way you do not have to remove/add the followers (with the associated change in ID).

I’ve filled and issue to take a look at it: https://dev.opennebula.org/issues/5382

Hi ruben,

that sounds pretty good.
A downtime is expected if the db schema has to be upgraded and that’s all fine but the addon of a follower disable sounds very neat as it would save the time to “rebuild” the cluster.

My upgrade with node replacements will most likely run like that:

  • remove the follower nodes and scale down to a solo setup
  • upgrade OpenNebula on the remaining node manually
  • re-add new nodes and rebuild the cluster
  • remove the manually upgraded node from the zone

Need to test it but that should be easy to script and run with a downtime of a few minutes (which would save me from going to work on the weekend)

Edit:
Tested it in staging and it ran through in ~70sec … that’s a downtime i can live with. It will take longer in production as the db is a lot bigger but should still be finished ~2min.