We recently migrated from 5.4.13 to 5.6.1. We use CEPH as a datastore and KVM as a hypervisor. On most VMs live-migration seems to work fine, but on some larger VMs (as one-63 with 8 CPU/VCPU and 8G RAM) it never ends.
What could be the culprit for this strange behavior?
On the old version(5.4.13) it takes around 2-2:30 minutes to live-migrate the same VM.
Here is some output from the logs:
Fri Nov 16 13:38:53 2018 [Z0][VM][I]: New LCM state is MIGRATE Fri Nov 16 13:38:54 2018 [Z0][VMM][I]: Successfully execute transfer manager driver operation:tm_premigrate. Fri Nov 16 13:38:55 2018 [Z0][VMM][I]: ExitCode: 0 Fri Nov 16 13:38:55 2018 [Z0][VMM][I]: Successfully execute network driver operation: pre.
************Here I have no more patience and hit RECOVER=>SUCCESS Fri Nov 16 14:08:53 2018 [Z0][VM][I]: New LCM state is RUNNING
But the KVM process stays on originating host [root@blackmirror3 ~]#(output from top-command): 6447 ? Sl 7210:58 /usr/libexec/qemu-kvm -name guest=one-63,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-22-one-63/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off
And also on the host that I want to migrate to [root@blackmirror4 ~]#(output from top-command): 8736 ? Sl 11:41 /usr/libexec/qemu-kvm -name guest=one-63,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-one-63/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off
I choose to kill the process 8736 on blackmirror4 host and VM works, but is not migrated.
If I choose to kill the process on blackmirror3 host, VM gets destroyed.
Please help!
BR,
nalexandrov
PS: Hosts are CentOS7.5 and are also CEPH-nodes. VMs are CentOS7.5 too.
Hello, I think, that it is not related to opennebula. You can optimize live migration process. You can use post-copy instead of default precopy behavior, or better, post-copy-after-pre-copy. Another way is to stick with default and more stable pre-copy, but with compression enabled.
From my experience from 1-2years ago, post-copy is not safe as pre-copy and sometimes VM get destroyed. So I stick with precopy compressed
Did you mean adding this line: MIGRATE_OPTIONS="--postcopy --postcopy-after-precopy" to the file /var/lib/one/remotes/etc/vmm/kvm/kvmrc ?
Because if you do, this still doesn’t work for me - same results.
Also, I dont know how to enable the compression.
Live migration worked fine before the upgrade.
Maybe something is missing in the procedure: https://docs.opennebula.org/5.6/intro_release_notes/upgrades/upgrade_54.html
Thanks, Kristian, for the examples - I’ll experiment with those values.
I figured out, that the slowdown of the live migration was due to VM’s heavy load ! virsh domjobinfo one-{ID} helped a lot.