first the hard facts:
VM Hosts: Debian 8
ONE Management: Ubuntu 14.04 LTS
VM Guests: Ubuntu 14.04 LTS
NO SHARED FILESYSTEM (only ssh to deploy the VMs)
All VMs are running with this parameters:
/usr/bin/qemu-system-x86_64 -name one-125 -S -machine pc-i440fx-2.1,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-xx.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/data/one/0/xx/disk.0,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/data/one/0/xxx/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=46,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=02:00:xx:xx:xx:xx,bus=pci.0,addr=0x3 -vnc 0.0.0.0:125 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
So I ran into this weird isse a few days ago:
VM was running with 2 vCPU (0.25 CPU) and 4 gig memory, doing some basic VM-ing with an avg load off 0.5-0.9. And suddenly, the load spiked to above the healthy level and got stuck at an load avg of 10. Also the I/O wait was fired to the moon and all services stopped responding. No ssh, no console via sunstone, no “force” reboot/shutdown, nothing. All operations timed out.
Sunstone showed me 100% “real” CPU too.
Solution: find PID of the KVM process, kill it and wait for oned to recognize the killed VM to resume it.
How do I:
- find the root cause on the KVM/libvirt side? (No other VMs on the same host where affected, only this one) OR
- avoid with some tweaks high I/O wait
- handle VMs stuck in this state “automatically” with OpenNebula? (I don’t want to get up at 3 in the morning to kill a process and start it again)
If you need more information, I’m happy to share it with you.
EDIT: I currently have another VM in this “high-load” state.