Shared mode for the ceph based system datastires

Hi, I want to add an option to the ceph and few new datastore drivers which I currently develop, for allow them working with system datastores which have shared filesystem between nodes.

This option will be handled by internal transfer scripts: tm/postmigrate, tm/premigrate, tm/mv and if it is exists it will return 0 without any changes.

I was thinking about using SHARED=yes in this PR, but I was wrong, because this option is hardcoded and used by OpenNebula itself.

Another option is required for disable uploading checkpoint file into datastore, or just leave it as file if system datastore already shared by vmm/save.<TM_MAD> and vmm/restore.<TM_MAD> scripts.

@ruben mainly this question is for you, how would you like to organize that?

I think to add two new common options for system datastores:

DS_SHARED=<yes|no>                  # datastore have shared filesystem
CHECKPOINT_LOCATION=<system|file>   # upload checkpoint file to ceph or not

Do you agree that?

As I’ve already done such changes and I’d like to share my findings and how it is implemented in addon-storpool :wink:

The configuration of SHARED=<YES|NO> in the TM_MAD_CONFis for internal use by OpenNebula. Based on the value OpenNebula do account the Datastore’s capacity per Host (SHARED=NO) or globally for the Datastore (SHARED=YES)

The migration is indeed hard-coded where tm/vm is used for cold migration; tm/premigrate, tm/postmigrate and tm/failmigrate are used for the live migrations.

Generally, there is only one difference regarding the backing filesystem when dealing wit the migration - on “shared” filesystem there is no need to “delete” the VM’s base path after a successful VM transfer and on a “ssh” filesystem there is need to copy the VM’s base path before the migration.

My alternative to DS_SHARED=<yes|no> is in the Datastore Template named SP_SYSTEM=<ssh|shared> so I am ok with such change.

Regarding the checkoint file there is a third option that I prefer and use if possible(needs decent version of qemu-kvm, qemu-kvm-ev for CentOS7 is fine) - store the checkpoint file directly on block device and also use it for restore.
To achieve this I am patching vmm/save and vmm/restore to have tm/save.<TM_MAD>-pre and tm/restore.<TM_MAD>-post. (We’d be happy to create pull request for them :slight_smile:) So there is no need of a transient file on the Host’s filesystem (ot shared one…) just a symlink to the block device. The only tricky part is how to determine the size of the block device. To be on the safe bet I am using the VM’s memory x2 as the size of the block device :slight_smile:
I am almost confident it will work for Ceph too just no tested…

So I am not sure how the above could fit in the definition of the CHECKPOINT_LOCATION variable…

Cheers,
Anton

Hi Anton,

Heh, I was need to execute these actions locally on the frontend side, so I’ve just owerrided them as local actions for the kvm vm_mad driver:

 VM_MAD = [
     NAME           = "kvm",
-    ARGUMENTS      = "-t 15 -r 0 kvm",
+    ARGUMENTS      = "-t 15 -r 0 kvm -l save=save_linstor_un,restore=restore_linstor_un",
 ]

Otherwise I don’t like this idea much, looks like that vmm/save.<TM_MAD> and vmm/restore.<TM_MAD> actions are more concern to TM drivers than to VMM ones. TM driver actions are always executes locally on the frontend therefore vmm/save.<TM_MAD> and vmm/restore.<TM_MAD> actions also should be executed locally in my opinnion.
It maybe done by modifying one_vmm_exec.rb driver executor. Like we’ve decided here

This step will allow to handle these actions by TM_MAD manufacturer without adding extra patches into standard VMM driver.
This is a little breaking changes, so it’s up to dicussion.

To be true I don’t like an idea to store whole VM directroy on the storage system, filesystems are usually not so reliable unlike just using block devices. They can hung on mount/unmount operations. Better to avoid using them. In my opinion all the information stored there (symlinks and context CDs) can be calculated automatically during VM deploy.

The only thing which should be saved is checkpoint file, however it can be simple uploaded as block file. Any way, copying checkpoint file to one location, then upload it into storage, and same way for restore it back is so annoying. I want to solve it somehow.

I’m just checked: virsh can directly save checkpoint onto block device, restore is more problematic, but still possible:

virsh -c qemu:///system restore one-58 < <(cat /dev/<device>)

I would like to develop this idea, instead just saving it into filesystem which is stored on the shared block device. :slight_smile:

Well I am using the local interface to replace deploy and snapshot*, the change of the deploy script become mandatory due to this issue with the volatile disks that raised with recent libvirt update…

Anyway, totally agree that these operations could be executed on the front-end.

I’d say that when there is block storage there is no need to have a shared filesystem at all - it is just one more service to take care for. The symlinks are created by the TM_MAD scripts anyway.
Also the context CD works fine when it is dumped on a block device too :wink:

As I’ve already said with recent qemu-kvm it works fine. Just need to create the block device in advance, attach to the host and create a symlink to the checkpoint (save-pre), dump the VM’s stuff (save) and detach the block device (save-post). Same with restore - attach in advance (restore-pre), do the restore stuff (restore) and at the end detach/destroy the checkpoint device (restore-post)

I’ve achieved this by just patching save/restore to have both pre/post steps. But honestly do prefer to be called on the front-end :slight_smile:

So totally there is room for a lot of improvements :+1:

Cheers,
Anton

1 Like

Another option might be to add simple check into these actions, eg.:

  1. write random string into some file in system_ds location on SRC_HOST
  2. check if file exist and contains the same string on DST_HOST:
    • If it is the same, then exit. (shared mode)
    • If not, then continue migration over ssh (non-shared mode)

We can also compare latest deplotment hash, but better to have an option, I think

@atodorov_storpool

Greetings, I found easy way for predict checkpoint file size and prepare block device for it in adance.

virsh dommemstat

Parameter rss shows the real memory usage.
Empirically I found out that the resulting checkpoint file is always bit smaller than this value.

So my vmm/save driver action now looks like:

# First I suspend the VM:
virsh suspend $DEPLOY_ID
# Then get the real size:
virsh dommemstat $DEPLOY_ID | awk '$1 == "rss" {print $2}'
# Then prepare `/dev/blkdevice` device in my block storage
...
# Then run save operation
virsh save $DEPLOY_ID /dev/blkdevice

Action vmm/restore looks even simpler:

# Run restore command
virsh save $DEPLOY_ID /dev/blkdevice
# Then remove `/dev/blkdevice` device from my block storage
...

This is working really fine!

You can see the concrete implementation it in my new linstor_un driver:

Great! I’ll do some tests too :slight_smile:

Regarding vmm save/restore I prefer to follow the open nature of OpenNebula and when possible do the changes in a way that do not brake the compatibility with other storages. That’s why I do patch save/restore to extend them to work with pre/post scripts and do the changes in the extra scripts. Actually I just borrow the missing part code from save/restore and add it where relevant - get the “pre” part from vmm/restore and add it to vmm/save and insert the “post” part from vmm/save" to vmm/restore. This way you have

vmm/save
     call if exists vmm/save.<tm_mad>-pre # routine borrowed from vmm/restore
     do the default task
     call if exists vmm/save.<tm_mad>
vmm/restore
    call if exists vmm/restore.<tm_mad>
    do the default task
    call if exists vmm/restore.<tm_mad>-post # routine borrowed from vmm/save

This way OpenNebula will work with several SYSTEM Datastores backed with different storage backends. Replacing save/restore with scripts dedicated to a single Datastore you are actually doing a Vendor Lock of the entire OpenNebula to this Datastore…

I’ve just created a pull request with the patches to vmm save/restore.

Cheers,
Anton

@atodorov_storpool

Hey, I still thinking that these actions should be handled by TM driver, not by VM driver. I made some investigation into this and found we can do the same thing like with premigrate and postmigrate scripts, by adding few new actions: presave, postsave, prerestore and postrestore. It will be more right, then we can update existing drivers to stop using save.ceph, restore.ceph format.

I’ve already try to do that, but I need some help because I something miss:

There changes are compiling, but VM is hanging on save / restore operation, and I don’t know why.

Well, I think it is something between. Following the logic the checkpoint file is more a part of the VM than from the storage system. When the code is in the tm_mad you should handle all possible virtualisation - there is no KVM only, XEN is not so supported but should work, other virtualization technology could be managed by ONE too - there is no reason to not have virtualbox for example… So the tm_mad must know how to get the size of the checkpoint file from one side. So I think it is better to drop a file in exact vmm_mad with the name of the tm_mad than to have tm_mad/.kvm, etc.

They are two sides of same coin at the end but I think it is better to just add the missing 1/3 of the code to the vmm_mad and drop tm_mad files there than patching the core. This way you know that this piece of code is for exact vmm_mad. Changing the core means that these scripts should be available and called even when they are not needed/capable to do the task/ …

Cheers,
Anton

Yeh, that’s bit easier, but it is making the mess for the future development.

Why do we have normal premigrate and postmigrate TM scripts in the core, and have no presave, postsave and preresume, postresume there?

Setup checkpoint location and upload it into storage it is storage driver deal, isn’t it? Then it should be handled exactly by TM driver, the same like with premigrate and postmigrate operations.

I can tell more, your storpool driver does not requires execution on the frontend, but some other drivers, like my one may require that.
That’s why TM driver actions are always execute on the frontend. And these hooks: save.<TM_DIVER>, restore.<TM_DRIVER> are break this idyll, because executed by VM driver directly on the host.

Having presave, postsave and preresume, postresume actions is simpe and clean solution for handle these actions by TM driver, which should prepare location and setup symlink for saving checkpoint file.

That’s not fully true, because the save operation will execute by VM driver as before. TM driver will just set the stage for that.

So we will have the next chain for suspend:

# driver action execution description
1 tm presave frontend preparing the place and setups checkpoint file symlink
2 vmm save remote saves VM state into checkpoint file (or device) under symlink
3 tm postsave frontend optional disconnects the device with checkpoint from the compute-node

And for the restore:

# driver action execution description
1 tm prerestore frontend connects the device with checkpoint to the compute-node, and setups symlink on it
2 vmm restore remote restores VM action from the symlink
3 tm postrestore frontend removes old checkpoint file

PS: context is also TM-driver operation, because it is operates and stored on the system datastore

By the way, I’ve prepared prototype for that:

If changes will not taken to the core for some reason, we can use it as extended API for handle save/restore operations by TM-driver.

But I still thinking that it is should be in the core, because the same ceph driver still requires it.

I am not opennebula developer and do not know the exact reason but could guess by the OpenNebula’s behavior that premigrate/postmigrate are called in the IMAGE datastore context, not the SYSTEM datastore context - to prepare for migration the images from different IMAGE datastores that are used as disks in the VM that will be migrated.

As I said, I think it is something between. The contra-question is who should determine the size of the checkpoint file at first? :wink:

The tm_mad drivers are called with exact values, like “create a storage device with size X and make sure it is accessible at location Y”. If we agree on this then what should be set as size of the checkpoint file?
Our both (empiric) findings regarding the size of the checkpoint file are determined more or less from the running VM. So it is task of the VM_MAD to determine the size of the checkpoint file and then request from the tm_mad to provide the storage for it. That is the reason behind vm_mad/save.ceph, vm_mad/save.storpool.

Other example when the tm_mad/presave is called by LXD I believe it is LXD’s VMM_MAD business to determine the size of the checkpoint file, not TM_MAD’s… (YES, it is not supported currently but we must look at the future). With current approach of dropping TM_MAD related files in the VM_MAD’s space we indirectly define the capabilities of the TM_MAD. I still believe that it is not tm_mad/{pre,post}{save,restore} responsibility to deal with the matter.

Anyway, if the above concerns are properly addressed I am OK to adapt addon-storpool accordingly. And yes, it is improvement against the current state but I am just not convinced(yet :slight_smile: ) it is the best way to deal with the matter.

tm_mad/context is fed with exact content as files to encapsulate in an extremely-volatile ISO image in the SYSTEM Datastore. And OpenNebula expect it to be a file(also hard-coded in the core but this is another beer :smile: ). I hope we could agree that it does not depend on the virtualisation technology.

Finlay I’d like to bold that I appreciate the discussion(no offence at all, honestly). I totally agree that all orchestration should be done on the front-end because of …endless reasons. VM_MAD has a lot of room for improvements. For example (It is my personal opinion) the entire domain XML compilation should be extracted in external easy to extend module…

addon-storpool do some of the operations on the Hosts because this is the way how OpenNebula works, how older versions of OpenNebula works (we couldn’t force the customers to upgrade to latest version) and not last - all of this with minimal effort if possible. Also each change to the core should be observed(looked?) from all possible angles and less intrusive changes made that do no break other things.

I hope you will agree that this is the root of this discussion? :slight_smile:

That said, the last word have the ONE developers like @ruben and co. After all we do not know the full roadmap or the reasons behind a lot of the decisions that are taken…

Cheers,
Anton Todorov

RE: tmsave/tmrestore sounds fair :wink:

It’s will be handled by TM driver, the same way like in cpds snap_create_live actions it will use virsh which is somehow standard for all vmm drivers

This is not always needed, eg. If you have shared filesystem you don’t need to know exactly size. We can just place it as file, right?

However it can be organized like:

  • vmm/suspend --> tm/presave --> vmm/save --> tm/postsave
  • tm/prerestore --> vmm/restore --> tm/postrestore --> vmm/unsuspend

But in my opinnion it is too difficult, because requires to develop much more changes to add new vmm actions and adding suspend state.
Also not any tm driver requires it, so I just thinking to handle suspend/unsuspend operations by tm-driver itself when it is exactly need eg. for ceph, storpool and linstor_un.

Simple! LXD have not save and restore support in imported actions. So it does not support to save and restore state to the disk. In future if it would, it will use the same checkpoint file with same libvirt interface, I guess. All other VMM drtivers which support save/restore-operations, have union interface via virsh.

Yep, sure, thanks for the discussion. I also found a lot of useful for myself. :slight_smile:
Let’s wait for someone from core developers to comment this change.

1 Like

Hi guys,

Just a brief note that we are processing all your wonderful feedback to share with you our opinion on the proposed approaches.

We’ll be updating either this post or the associated issues

1 Like