We recently migrated from 5.4.13 to 5.6.1. We use CEPH as a datastore and KVM as a hypervisor. On most VMs live-migration seems to work fine, but on some larger VMs (as one-63 with 8 CPU/VCPU and 8G RAM) it never ends.
What could be the culprit for this strange behavior?
On the old version(5.4.13) it takes around 2-2:30 minutes to live-migrate the same VM.
Here is some output from the logs:
Fri Nov 16 13:38:53 2018 [Z0][VM][I]: New LCM state is MIGRATE
Fri Nov 16 13:38:54 2018 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_premigrate.
Fri Nov 16 13:38:55 2018 [Z0][VMM][I]: ExitCode: 0
Fri Nov 16 13:38:55 2018 [Z0][VMM][I]: Successfully execute network driver operation: pre.
************Here I have no more patience and hit RECOVER=>SUCCESS
Fri Nov 16 14:08:53 2018 [Z0][VM][I]: New LCM state is RUNNING
But the KVM process stays on originating host [root@blackmirror3 ~]#(output from top-command):
6447 ? Sl 7210:58 /usr/libexec/qemu-kvm -name guest=one-63,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-22-one-63/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off
And also on the host that I want to migrate to [root@blackmirror4 ~]#(output from top-command):
8736 ? Sl 11:41 /usr/libexec/qemu-kvm -name guest=one-63,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-one-63/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off
I choose to kill the process 8736 on blackmirror4 host and VM works, but is not migrated.
If I choose to kill the process on blackmirror3 host, VM gets destroyed.
PS: Hosts are CentOS7.5 and are also CEPH-nodes. VMs are CentOS7.5 too.