VM goes from RUNNING state to POWEROFF state even though it is running on cluster/hypervisor

Hi all,

Gene Liu back in February reported this. Don’t know if this was from last year or this, this issue is still occurring in 4.14. Below log of one of my recently deployed VMs in VCenter:

Sat Oct 3 09:08:35 2015 [Z0][VM][I]: New state is ACTIVE
Sat Oct 3 09:08:35 2015 [Z0][VM][I]: New LCM state is PROLOG
Sat Oct 3 09:08:35 2015 [Z0][VM][I]: New LCM state is BOOT
Sat Oct 3 09:08:35 2015 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/13/deployment.0
Sat Oct 3 09:08:35 2015 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Sat Oct 3 09:08:42 2015 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Sat Oct 3 09:08:42 2015 [Z0][VMM][I]: Successfully execute network driver operation: post.
Sat Oct 3 09:08:42 2015 [Z0][VM][I]: New LCM state is RUNNING
Sat Oct 3 09:15:03 2015 [Z0][LCM][I]: VM running but monitor state is POWEROFF
Sat Oct 3 09:15:03 2015 [Z0][VM][I]: New LCM state is SHUTDOWN_POWEROFF
Sat Oct 3 09:15:03 2015 [Z0][VM][I]: New state is POWEROFF
Sat Oct 3 09:15:03 2015 [Z0][VM][I]: New LCM state is LCM_INIT

What could be the reason for causing this? Currently if I go back to Infrastructure > Hosts > select VCenter node > Wilds I can see my newly deployed VM. If I imported from here, OpenNebula starts monitoring. For some odd reason when it is first deployed either monitoring fails, or never initiates.

Thanks,
Jesus

There are two cases where a VM enters the poweroff state:

  • the monitoring probe returns a state ‘d’, in vCenter meaning the VM is powered off

  • the monitoring probe does not return any info about the VM (for instance, if the VM has been deleted in vCenter)

We have reviewed the code carefully and made several test cases and could not reproduce the problem. Could you please send the output of the following command when a VM enters the poweroff state (as oneadmin, in the front-end):

$ /var/lib/one/remotes/im/run_probes vcenter x x x x VCENTER_CLUSTER_NAME

$ onevm show vid

where:

  • VCENTER_CLUSTER_NAME is the name of the vCenter cluster as shown in the “onehost list” command.
  • vid is the ID of the VM in poweroff state

Hi Tino,

Seems like the issue is not present anymore. I always build OpenNebula from source and a few nights ago I noticed this issue, but on the latest source code/build I no longer see this issue. I’ll respond back if I see any issues in the future, but for now we can table it.

Thanks for responding.
Jesus

Hi,

We are infrequently seeing the same behaviour as Jesus described. After starting a VM the state is “ACTIVE”. After the succesfull boot on the cluster, the monitor thinks that the VM is in poweroff state. After 5 minutes the VM is found again as running. Could this be a race condition between the VM and IM drivers?

This is the log of the VM:
Thu Aug 25 11:12:29 2016 [Z0][VM][I]: New LCM state is LCM_INIT
Thu Aug 25 11:12:52 2016 [Z0][VM][I]: New state is ACTIVE
Thu Aug 25 11:12:52 2016 [Z0][VM][I]: New LCM state is BOOT_POWEROFF
Thu Aug 25 11:12:52 2016 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/699/deployment.2
Thu Aug 25 11:12:53 2016 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Thu Aug 25 11:13:25 2016 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Thu Aug 25 11:13:25 2016 [Z0][VMM][I]: Successfully execute network driver operation: post.
Thu Aug 25 11:13:25 2016 [Z0][VM][I]: New LCM state is RUNNING
Thu Aug 25 11:13:25 2016 [Z0][HKM][I]: Success executing Hook: mail.sh: .
Thu Aug 25 11:13:28 2016 [Z0][LCM][I]: VM running but monitor state is POWEROFF
Thu Aug 25 11:13:28 2016 [Z0][VM][I]: New LCM state is SHUTDOWN_POWEROFF
Thu Aug 25 11:13:28 2016 [Z0][VM][I]: New state is POWEROFF
Thu Aug 25 11:13:28 2016 [Z0][VM][I]: New LCM state is LCM_INIT
Thu Aug 25 11:16:52 2016 [Z0][VMM][I]: VM found again by the drivers
Thu Aug 25 11:16:52 2016 [Z0][VM][I]: New state is ACTIVE
Thu Aug 25 11:16:52 2016 [Z0][VM][I]: New LCM state is RUNNING
Thu Aug 25 11:16:52 2016 [Z0][HKM][I]: Success executing Hook: mail.sh: .

The corresponding log in oned.log:
Thu Aug 25 11:11:50 2016 [Z0][VMM][D]: VM 699 successfully monitored: STATE=a CPU=0.00 MEMORY=68608 NETRX=0 NETTX=0 ESX_HOST=xyz GUEST_STATE=notRunning VMWARETOOLS_RUNNING_STATUS=guestToolsNotRunning VMWARETOOLS_VERSION=0 VMWARETOOLS_VERSION_STATUS=guestToolsNotInstalled
Thu Aug 25 11:13:28 2016 [Z0][VMM][D]: VM 699 successfully monitored: STATE=d
Thu Aug 25 11:18:34 2016 [Z0][VMM][D]: VM 699 successfully monitored: STATE=a CPU=0.00 MEMORY=75776 NETRX=0 NETTX=0 ESX_HOST=abc GUEST_STATE=notRunning VMWARETOOLS_RUNNING_STATUS=guestToolsNotRunning VMWARETOOLS_VERSION=0 VMWARETOOLS_VERSION_STATUS=guestToolsNotInstalled
Thu Aug 25 11:20:14 2016 [Z0][VMM][D]: VM 699 successfully monitored: STATE=a CPU=0.00 MEMORY=75776 NETRX=0 NETTX=0 ESX_HOST=abc GUEST_STATE=notRunning VMWARETOOLS_RUNNING_STATUS=guestToolsNotRunning VMWARETOOLS_VERSION=0 VMWARETOOLS_VERSION_STATUS=guestToolsNotInstalled

Note that after starting the vm vsphere chooses another esx host (DRS) to boot. The 5 minute delay is typical and corresponds with a snipped of code in MonitorThread.cc:

        // Move the VM to power off if it is not reported by the Host and:
        // 1.- It has a history record
        // 2.- It is supposed to be in RUNNING state
        // 3.- It has been monitored at least once
        if (vm->hasHistory() &&
            vm->get_last_poll() != 0 &&
             ( vm->get_lcm_state() == VirtualMachine::RUNNING ||
               vm->get_lcm_state() == VirtualMachine::SHUTDOWN ||
               vm->get_lcm_state() == VirtualMachine::SHUTDOWN_POWEROFF ||
               vm->get_lcm_state() == VirtualMachine::SHUTDOWN_UNDEPLOY))
        {
            lcm->trigger(LifeCycleManager::MONITOR_POWEROFF, *its);
        }
        // If the guest is shut down before the poll reports it at least
        // once, the VM gets stuck in running. An individual poll action
        // is triggered after 5min (arbitrary number)
        else if (vm->hasHistory() &&
                vm->get_last_poll() == 0 &&
                vm->get_lcm_state() == VirtualMachine::RUNNING &&
                (time(0) - vm->get_running_stime() > 300))
        {
            vmm->trigger(VirtualMachineManager::POLL,vm->get_oid());
        }

Any ideas Tino?

Best regards,

Sebastiaan Smit

@tinova
So, i got the same problem
Sat Feb 19 01:12:57 2022 [Z0][ReM][D]: Req:3552 UID:-1 IP:127.0.0.1 one.vmpool.info invoked , -2, -1, -1, -1, "" Sat Feb 19 01:12:57 2022 [Z0][ReM][E]: Req:3552 UID:- one.vmpool.info result FAILURE [one.vmpool.info] User couldn't be authenticated, aborting call. Sat Feb 19 01:12:57 2022 [Z0][InM][D]: VM_STATE update from host: 0. VM id: 11, state: POWEROFF
I’m using vOneCloud 6.2.0 deployment in vcenter.

Also i’ve resolved this problem.
oneadmin password were changed by tty commands, it seems like password changes not properly and monitoring cant auth with new password.
problem resolves after restart instance with vonecloud

Update: In vonecloud docs writed that programm should be rebooted after password changes