What to do with a ghost VM?

The Problem:

Hi there guys. My team and I recently updated one of our OpenNebula Clusters but once upgraded one VM went somehow a ghost, as it only seems in the Sunstone and while typing onevm list, if I try to access the VM page I get:

image

And when I try to show it via opennebula cli I get the very same error:

onevm show 62
[one.vm.info] Error getting virtual machine [62].

I can’t do anything with that VM, can’t delete, not show, not resume, not even undeploy it! So I can’t undeploy it, that VM still has attached 2 NICs which now I can’t release :frowning:


Context:

  • The opennebula is running inside an upgraded Ubuntu 22.04.4 LTS.
  • The previous version was OpenNebula 6.4, and the upgrade was up to the 6.8.
  • In order to upgrade we follow the official documentation.

We noticed that the VM’s log (the ones under /var/log/one/{$ID}.log) have completly dissapear! BUT the disks that VM has attached did not disappear from our SDS.


What have we tried in order to recover or at least delete the VM?

The list shows actions we have made in order to recover or delete the VM but with no luck at all :frowning:

  • We tried the recover options from the Sunstone
  • We tried the recover using the cli: onevm recover --delete-db $ID
  • We tried with onedb purge-history --id $ID

From all of the previous attempt always we always got the [one.vm.info] Error getting virtual machine [62].


What we want?

At this point we only want to know what the heck happened with that VM and how we get ride of it!

Also we want to delete it as we were already able to, using the disks from the SDS, replicate the VM in a new template.

Thanks!!!

Try running a onedb fsck to correct information on the database. Things like references to a missing VM should be handled by it. Is the VM actually running on the hypervisor node ? Try also looking on /var/log/one/oned.log for errors referencing said VM. Send also the output of the following SQL commands run against the opennebula database

  • select oid from vm_pool;
  • select body from vm_pool where oid=62;

As for what happened, is hard to tell. Ideally you have a backup of the database prior to the upgrade so you can repeat the upgrade process and inspect what happens with --verbose mode when running onedb upgrade.

Hi there @dclavijo, thanks so much for kind answer.

Yes, I do have a backup of the database before any upgrade. My team and I recreated the situation in a virtual environment just to know more about this.

Before to proceed: I tried the onedb restore -f -v (please, note the -v) command but I got no feedback on the command line, is there any way to get feedback in my terminal from the onedb restore?


So… We repeated the process, and hoping to get more info about this problem we rised the log level to 5 on the /etc/one/oned.conf:

LOG = [
  SYSTEM      = "file",
  DEBUG_LEVEL = 5,
  USE_VMS_LOCATION = "NO"
]

But no luck at all, meaning that we got no info about this VM on the /var/log/one/oned.log

Here is the output of the onedb upgrade --verbose:
image

And we still got the very same error:

image


About the command’s output you ask for I will leave them here (sorry about the long texts):


I guess we could, somehow delete the VM from the database, but: is there any way to recover the VM from this state?

Have you tried issuing a onedb fsck. On the output you sent, the vm appears on the vm_pool table, with a clear XML template on its body column, yet somehow your system is not able to query it.

You can manually tinker with the database entries using onedb change-body and onedb update-body to correct problems like this. You’d have to make it so the VM database entry matches what is really happening on the hypervisor node. Take a look at this section.

You write that you executed onedb purge-history --id $ID but in the VM body I can see HISTORY_RECORDS->HISTORY->SEQ = 16 This is strange, because after the purge-history the SEQ should be 1.
This indicates the issue is in the VM history records, please paste output of the following SQL

  • select * from history where vid = 62; - this is to check the VM history
  • select * from local_db_versioning; - to check the history of OpenNebula upgrades

Do you remember if onevm show 62 works on version 6.4?