Error during undeploy

Hi,

I try to undeploy a VM in sunstone. It fails with the following error (from the log):

Tue May 29 11:54:25 2018 [Z0][VM][I]: New LCM state is EPILOG_UNDEPLOY
Tue May 29 11:54:32 2018 [Z0][TM][I]: Command execution fail: /var/lib/one/remotes/tm/ssh/mv localhost:/var/lib/one//datastores/0/5 server.anonymous.nowhere:/var/lib/one//datastores/0/5 5 0
Tue May 29 11:54:32 2018 [Z0][TM][I]: mv: Moving localhost:/var/lib/one/datastores/0/5 to server.anonymous.nowhere:/var/lib/one/datastores/0/5
Tue May 29 11:54:32 2018 [Z0][TM][E]: mv: Command “set -e -o pipefail
Tue May 29 11:54:32 2018 [Z0][TM][I]:
Tue May 29 11:54:32 2018 [Z0][TM][I]: tar -C /var/lib/one/datastores/0 --sparse -cf - 5 | ssh server.anonymous.nowhere ‘tar -C /var/lib/one/datastores/0 --sparse -xf -’
Tue May 29 11:54:32 2018 [Z0][TM][I]: rm -rf /var/lib/one/datastores/0/5” failed: tar: 5: Cannot stat: No such file or directory
Tue May 29 11:54:32 2018 [Z0][TM][I]: tar: Exiting with failure status due to previous errors
Tue May 29 11:54:32 2018 [Z0][TM][E]: Error copying disk directory to target host
Tue May 29 11:54:32 2018 [Z0][TM][I]: ExitCode: 2
Tue May 29 11:54:32 2018 [Z0][TM][E]: Error executing image transfer script: Error copying disk directory to target host
Tue May 29 11:54:32 2018 [Z0][VM][I]: New LCM state is EPILOG_UNDEPLOY_FAILURE

After that the Image corresponding to the VM is in state ERROR.
I use OpenNebula 5.4.0 and libvirt 3.9.0 on CentOS 7.

Could this be related to my mountpoints in datastore?
from my fstab:
/dev/opennebula_datastore_1/vm_images /var/lib/one/datastores/1 xfs defaults 0 0
/dev/data/opennebula_datastore_0 /var/lib/one/datastores/0 xfs defaults 0 0

Cheers,
Felix

Same problem here, did you manage do solve it?

My solution is to simply not use the feature undeploy. I just delete VMs. I was not able to fix it. It seems like the undeploy script is trying to move a file which is already deleted during undeploy process.

I have created an github issue, let’s see what the developers think of this issue.

Can you elaborate on the setup?

If you are using shared filesystem for the datastores, why you use ssh as TM_MAD?

BR,
Anton

Debian 9, OpenNebula 5.6.0

I have not changed any settings at the datastore, these are the default settings that were present after the installation.
Sorry, but what do you mean with “TM_MAD”?

Thanks!

Hi,

TM_MAD is the transfer manager driver configured for the given datastore. More info you could find in the Open cloud storage setup - Filesystem Datastore

The ssh driver deletes the destination path before copying the VM disk image files back to the front-end during Undeploy. So if the system datastore is on a shared filesystem you’ll have behavior like the explained. Also if the source and the destination is same machine but with different names you’ll have same situation - same filesystem for both source and destination…

Hope this helps,

Best Regards,
Anton Todorov

Hi Anton,

Thank you very much for your answer!
So I have to change TM_MAD to “shared”, right?
And do I have to change it on the files and default datastore, or only one of them?

Jvud46t

Thanks!

Well it depend on your setup.
Could you clarify are you using a single server setup or multiple servers with shared filesystem for the datastores?

BR,
Anton

I’m using a single server setup.

There are two options then:
a) fix the hostnames so the source and the destination paths will be same. Then the ssh driver will detect that it is dealing with same host and do nothing.
b) try changing the TM_MAD for both SYSTEM and IMAGE datastores to “shared”

BR,
Anton

Thank you.
Unfortunately changing TM_MAD to shared didn’t help.
But I would be very happy if you could explain in more detail what you mean by option a.

EDIT: I have attached some screenshots.


Regards
v3ng

Did you try instantiatiing a new VM and then undeploy it?

Here is the relevant log from the other thread:

Sat Jul 21 11:00:58 2018 [Z0][TM][I]: Command execution failed (exit code: 2): /var/lib/one/remotes/tm/ssh/mv localhost:/var/lib/one//datastores/0/51 virt.xxxx.de:/var/lib/one//datastores/0/51 51 0
Sat Jul 21 11:00:58 2018 [Z0][TM][I]: mv: Moving localhost:/var/lib/one/datastores/0/51 to virt.xxxxx.de:/var/lib/one/datastores/0/51

Here the tm/ssh/mv script is ordered to move VM’s home folder from hypervisor named localhost to front-end named virt.xxxxx.de. If it is a single host both source and destination should be with same host:path and tm/ssh/mv will hit the following condition: https://github.com/OpenNebula/one/blob/master/src/tm_mad/ssh/mv#L62

That will skip the entire directory move because it is same host.

BR,
Anton

1 Like

I just deployed a new VM and now it seems to work!

But I’ll also have a look at the SSH problem you mentioned.
I have to figure out where he takes the hostname from, when I added the host I have set “localhost” as the hostname.

Thanks!

Probably you should configure the hostname to the loopback ip address (127.0.0.1 in /etc/hosts)…

BR,
Anton

I have adapted my /etc/hosts, but still the same problem.
When changing TM_MAD to shared however it works.

IPv4

127.0.0.1 localhost.localdomain localhost
78.xx.xx.98  virt.xxxx.de
127.0.0.1       virt.xxxx.de   virt.xxxx.de
::1     virt.xxxx.de   virt.xxxx.de

#
# IPv6
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
2a01:xxx:xxx:71a3::2  virt.xxxx.de

Still the same problem, can’t figure out whats causing it…
Any idea? @atodorov_storpool

ping4 virt.xxxx.de
PING virt.xxxx.de (127.0.0.1) 56(84) bytes of data.
64 bytes from virt.xxxx.de (127.0.0.1): icmp_seq=1 ttl=64 time=0.071 ms

ping6 virt.xxxx.de
PING virt.xxxx.de(virt.xxxx.de (::1)) 56 data bytes
64 bytes from virt.xxxx.de (::1): icmp_seq=1 ttl=64 time=0.061 ms

My /etc/hosts:

78.xx.xx.98 virt.xxxx.de
127.0.0.1 virt.xxxx.de virt.xxxx.de
::1 virt.xxxx.de virt.xxxx.de

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
2a01:xxx:xxx:71a3::2 virt.xxxx.de

Log of VM:
Fri Jul 27 16:07:54 2018 [Z0][VM][I]: New state is ACTIVE
Fri Jul 27 16:07:54 2018 [Z0][VM][I]: New LCM state is PROLOG
Fri Jul 27 16:07:56 2018 [Z0][VM][I]: New LCM state is BOOT
Fri Jul 27 16:07:56 2018 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/66/deployment.0
Fri Jul 27 16:07:57 2018 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Fri Jul 27 16:07:57 2018 [Z0][VMM][I]: ExitCode: 0
Fri Jul 27 16:07:57 2018 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Jul 27 16:07:57 2018 [Z0][VMM][I]: ExitCode: 0
Fri Jul 27 16:07:57 2018 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Fri Jul 27 16:07:57 2018 [Z0][VMM][I]: ExitCode: 0
Fri Jul 27 16:07:57 2018 [Z0][VMM][I]: Successfully execute network driver operation: post.
Fri Jul 27 16:07:58 2018 [Z0][VM][I]: New LCM state is RUNNING
Fri Jul 27 16:08:51 2018 [Z0][VM][I]: New LCM state is SAVE_STOP
Fri Jul 27 16:08:52 2018 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/save: line 58: warning: command substitution: ignored null byte in input
Fri Jul 27 16:08:52 2018 [Z0][VMM][I]: ExitCode: 0
Fri Jul 27 16:08:52 2018 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.
Fri Jul 27 16:08:52 2018 [Z0][VMM][I]: ExitCode: 0
Fri Jul 27 16:08:52 2018 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Fri Jul 27 16:08:52 2018 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Jul 27 16:08:53 2018 [Z0][TM][I]: Command execution failed (exit code: 2): /var/lib/one/remotes/tm/ssh/mv localhost:/var/lib/one//datastores/0/66 virt.xxxx.de:/var/lib/one//datastores/0/66 66 0
Fri Jul 27 16:08:53 2018 [Z0][TM][I]: mv: Moving localhost:/var/lib/one/datastores/0/66 to virt.xxxx.de:/var/lib/one/datastores/0/66
Fri Jul 27 16:08:53 2018 [Z0][TM][E]: mv: Command “set -e -o pipefail
Fri Jul 27 16:08:53 2018 [Z0][TM][I]:
Fri Jul 27 16:08:53 2018 [Z0][TM][I]: tar -C /var/lib/one/datastores/0 --sparse -cf - 66 | ssh virt.xxxx.de ‘tar -C /var/lib/one/datastores/0 --sparse -xf -’
Fri Jul 27 16:08:53 2018 [Z0][TM][I]: rm -rf /var/lib/one/datastores/0/66” failed: tar: 66: Cannot stat: No such file or directory
Fri Jul 27 16:08:53 2018 [Z0][TM][I]: tar: Exiting with failure status due to previous errors
Fri Jul 27 16:08:53 2018 [Z0][TM][E]: Error copying disk directory to target host
Fri Jul 27 16:08:53 2018 [Z0][TM][E]: Error executing image transfer script: Error copying disk directory to target host
Fri Jul 27 16:08:53 2018 [Z0][VM][I]: New LCM state is EPILOG_STOP_FAILURE

You are still having same issue - trying to move files on same host where the mv script can’t recognize that it is same host and is deleting files instead of bailing out.

BR,
Anton

Yes, but I thought fixing the hostname resolution should fix the issue, which is unfortunately not the case.

It was wild guess. There is something other that needs to be fixed but I am out of ideas thought.

BR,
Anton