ONE5.6.1 Live-migrating of some guests never ends

Hi All,

We recently migrated from 5.4.13 to 5.6.1. We use CEPH as a datastore and KVM as a hypervisor. On most VMs live-migration seems to work fine, but on some larger VMs (as one-63 with 8 CPU/VCPU and 8G RAM) it never ends.
What could be the culprit for this strange behavior?
On the old version(5.4.13) it takes around 2-2:30 minutes to live-migrate the same VM.
Here is some output from the logs:

Fri Nov 16 13:38:53 2018 [Z0][VM][I]: New LCM state is MIGRATE
Fri Nov 16 13:38:54 2018 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_premigrate.
Fri Nov 16 13:38:55 2018 [Z0][VMM][I]: ExitCode: 0
Fri Nov 16 13:38:55 2018 [Z0][VMM][I]: Successfully execute network driver operation: pre.
************Here I have no more patience and hit RECOVER=>SUCCESS
Fri Nov 16 14:08:53 2018 [Z0][VM][I]: New LCM state is RUNNING

But the KVM process stays on originating host [root@blackmirror3 ~]#(output from top-command):
6447 ? Sl 7210:58 /usr/libexec/qemu-kvm -name guest=one-63,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-22-one-63/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off

And also on the host that I want to migrate to [root@blackmirror4 ~]#(output from top-command):
8736 ? Sl 11:41 /usr/libexec/qemu-kvm -name guest=one-63,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-one-63/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off

I choose to kill the process 8736 on blackmirror4 host and VM works, but is not migrated.
If I choose to kill the process on blackmirror3 host, VM gets destroyed.

Please help!

BR,
nalexandrov

PS: Hosts are CentOS7.5 and are also CEPH-nodes. VMs are CentOS7.5 too.

Hello, I think, that it is not related to opennebula. You can optimize live migration process. You can use post-copy instead of default precopy behavior, or better, post-copy-after-pre-copy. Another way is to stick with default and more stable pre-copy, but with compression enabled.

From my experience from 1-2years ago, post-copy is not safe as pre-copy and sometimes VM get destroyed. So I stick with precopy compressed

Hello Kristian and thanks for the answer!

Did you mean adding this line:
MIGRATE_OPTIONS="--postcopy --postcopy-after-precopy" to the file /var/lib/one/remotes/etc/vmm/kvm/kvmrc ?
Because if you do, this still doesn’t work for me - same results.
Also, I dont know how to enable the compression.
Live migration worked fine before the upgrade.
Maybe something is missing in the procedure: https://docs.opennebula.org/5.6/intro_release_notes/upgrades/upgrade_54.html

BR,
nalexandrov

Hi, I use MIGRATE_OPTIONS=--compressed and I also edited files:

/var/lib/one/remotes/vmm/kvm/migrate
/var/lib/one/remotes/vmm/kvm/migrate_local

by adding this line just before exec_and_log

virsh --connect $LIBVIRT_URI migrate-compcache $deploy_id --size 1073741824

a in migrate_local

virsh --connect $QEMU_PROTOCOL://$src_host/system migrate-compcache $deploy_id --size 1073741824

You can check migration status by issuing

virsh domjobinfo one-{ID}

you shoulkd also read this for some config variables

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/virtualization_administration_guide/sect-virtualization-kvm_live_migration-live_kvm_migration_with_virsh

I used TCP communication with libvirt. I have this in kvmrc

export LIBVIRT_URI=qemu+tcp://localhost/system
export QEMU_PROTOCOL=qemu+tcp

You can read more about it https://docs.opennebula.org/5.6/deployment/open_cloud_host_setup/kvm_driver.html#tuning-extending

Thanks, Kristian, for the examples - I’ll experiment with those values.
I figured out, that the slowdown of the live migration was due to VM’s heavy load !
virsh domjobinfo one-{ID} helped a lot.
:beer:

Welcome :slight_smile: