Opennebula & DFS like GlusterFS & Sheepdog & Ceph

Hello,

we at ungleich.ch are testing Opennebula w/ Ceph, Gluster and Sheepdog backends. So far we have collected various results,
roughly leading to:

  • Very bad performance (<30 MiB/s write speed) and VM kernel panics on Ceph
  • Good to great performance with GlusterFS 3.4.2, 3.6.2 on Ubuntu 14.04. and 3.6.2 on CentOS 7: > 50 MiB/s in the VM
  • Bad performance / small amount of test data with Sheepdog (~11 MiB/s in the VM, but only a short test)
  • We mostly looked at the Sheepdog integration status - as of 2015-02-15 there seems to be some cleanup required until things are working smoothly.
  • We think that in theory Sheepdog would be the best fit for a VM cluster, as it is simple and designed for the single use of VM images
  • We were running Sheepdog in a qemu only cluster before with great performance

We are interested in your experiences with various filesystems & wanted to share our experiences here as well.

Try LizardFS http://lizardfs.com/

we are using it on our OpenNebula powered platform (NodeWeaver), you can just mount /var/lib/one on all nodes to the LizardFS root and you’re good to go, just as it was a NFS share, but way more scalable.

Very reliable and we achieved great performance with a bit of SSD chaching and some tuning.
check this results:


A great advantage of LizardFS is that you can use it with just little modifications from the shared filesystem, and host everything ONE-related in a single reliable datastore (the TM for moosefs and lizardfs is here: http://wiki.opennebula.org/ecosystem:moosefs )
On a two node (2 rotational devices+2 EnhanceIO SSD caches) we got 11K write IOPS, and we easily reach 90MB/sec within the VMs.
Another advantage is the copy-on-write snapshot capability, that greatly enhances what you can do with OpenNebula for thinly provisioned images, without performance problems.

1 Like

Hello Carlo,
Short question. Did you do any evaluation of using other caching systems? (flashcache, bcache, dm-cache e.g.).
I am currently testing different setups - actually it is bcache and LSI cachecade, but bcache looks promising. Perhaps you made similar tests and can share your experiences.
Tried also gluster (which was from the performance point of view very good, especially on 10GB networks, but usability is not yet very nice).
And on the last ONE conf everybody was fine with CEPH, but that needs a more expensive footprint in terms of hardware.

Best, Michael

Sheepdog was horrendously unstable when I last used it. Sometimes, simple storage host reboots would destroy all data in the cluster.

I’ve been using ceph with KVM on Ubuntu for a few years now (since the argonaut release) and have had very few problems with it. I’ve only recently added Open Nebula to the mix but it dropped right in with no changes needed to my ceph config.

The biggest drawback is that ceph’s a network hog. Make sure you have lots and lots of bandwidth for it. If you don’t you might start seeing sufficient IO lag on your VMs to cause problems.

I’ve not used gluster so I can’t speak to it.

We are using bcache backed glusterfs bricks. Don’t expect miracles in benchmarks but there is a measurable increase in IOPS performance, but you have to consider that for KVM at least there are quite a few tunables that should be addressed first before looking at SSD caching as a performance enhancer.

I’m also not a fan of the qemu-glusterfs integration. It doesn’t feel complete yet, there is some work that needs to be done. Also keeping the shared filesystem layer separate from the hypervisor is easier for support. We are using glusterfs-fuse backed shared fs and it’s working great so far with qemu images.

We are using sheepdog for non-productional cluster. It is much stable in version 0.9.1. Also you need to use qemu version 1.7 or higher, for autofailover support.

We are using GlusterFS with shared storage and the performance is good. The bad point is that sometimes some node hangs and must be rebooted (and the documentation is quite poor IMHO).

Hey,

great thread! I am just looking into the same issue. So far i am looking forward to try GlusterFS first.
@nico_opennebula_org : Could you point out the storage hardware you use?
I am running on IBM PureFlexNodes and a storwize storage solution which is attached via 10 Gbit FCoE.

Cheers,
Christian

Hey Christian,

we are using a very simple architecture: Every of our cluster consists of 2 nodes. They have two network cards, one connected to the public network, one connected to the other host.

The hosts are only using replicated mechanism and we build n of these gluster clusters.

Hardware wise they are mid range servers (16-128 GiB RAM, 8-32 cores, 1-12TB).

Cheers,

Nico

I finally managed to set Gluster up in my setup. I use three IBM Pureflex nodes which are each attached via 10 Gbit FCoE uplink to their own volume managed by an IBM storwize v7000. These volumes are managed by GlusterFS Servers on each node and accessed by OpenNebula from each node respectively. The volume has a combined size of 6 TB and runs in standard distributed mode (that is: no replication here).

Results from first tests:

  • Overall Performance is pretty good (will conduct benchmarks later)
  • Deplyoing 40 VMs very quick (<4 minutes)
  • In 40 simultaneous deployments, 6 - 10 VMs fail to deploy and have to re-deployed

Does anybody have the failing deployment with GlusterFS as well?

What options are you using? We’ve had no issues with Glusterfs 3.4.3 (our current version) and we have tested massive loads (50-60 Gbps) across our distributed replicate clusters. The options on the vms and the storage matters. I would hold back on the bleeding edge (glusterfs 3.6.X) if possible. Are you sure your deployment issues are storage related?

Hi,

Thanks for your reply!
I am using bleeding edge gluster as it seems:

glusterfs.x86_64                      3.6.2-1.el7        @glusterfs-epel
glusterfs-api.x86_64                  3.6.2-1.el7       @glusterfs-epel
glusterfs-cli.x86_64                  3.6.2-1.el7        @glusterfs-epel
glusterfs-fuse.x86_64                 3.6.2-1.el7      @glusterfs-epel
glusterfs-libs.x86_64                 3.6.2-1.el7       @glusterfs-epel
glusterfs-server.x86_64               3.6.2-1.el7      @glusterfs-epel

And these are my volume infos, in cluding the options set from the ‘virt’ group, which you are advised to set when installing from the opennenbula documentation:

Volume Name: one
Type: Distribute
Volume ID: e64309f5-88d8-4d55-9272-16611acebe25
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: molokai:/data/gluster/brick
Brick2: lanai:/data/gluster/brick
Brick3: maui:/data/gluster/brick
Options Reconfigured:
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: on
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
storage.owner-gid: 9869
storage.owner-uid: 9869
server.allow-insecure: on

As for the VM options: What do you mean? I use the qcow2 model on the images, if the images are in qcow2 format. What other option could I set on the VMs regarding GlusterFS?

Hi Shankhadeep Shome,

I’m just wondering if we are exposing all the needed parameters for tuning VM performance. Are you currently relying on RAW? Could we benefit from exposing some of these parameters in Sunstone as advanced options?

Cheers

The exact error message is this:

Wed Mar 11 12:35:57 2015 [Z0][TM][I]: Command execution fail: 
/var/lib/one/remotes/tm/shared/clone 
molokai:/var/lib/one/datastores/114/3fbc702cdff9fdced57c7b95c33b2459 
lanai:/var/lib/one//datastores/120/169/disk.0 169 114
Wed Mar 11 
12:35:57 2015 [Z0][TM][I]: clone: Cloning 
/var/lib/one/datastores/114/3fbc702cdff9fdced57c7b95c33b2459 in 
lanai:/var/lib/one//datastores/120/169/disk.0
Wed
 Mar 11 12:35:57 2015 [Z0][TM][E]: clone: Command "cd 
/var/lib/one/datastores/120/169; cp 
/var/lib/one/datastores/114/3fbc702cdff9fdced57c7b95c33b2459 
/var/lib/one/datastores/120/169/disk.0" failed: Warning: Permanently 
added 'lanai,141.22.29.23' (ECDSA) to the list of known hosts.
Wed Mar 11 12:35:57 2015 [Z0][TM][I]: sh: line 3: cd: /var/lib/one/datastores/120/169: No such file or directory
Wed
 Mar 11 12:35:57 2015 [Z0][TM][I]: cp: cannot create regular file 
'/var/lib/one/datastores/120/169/disk.0': No such file or directory
Wed
 Mar 11 12:35:57 2015 [Z0][TM][E]: Error copying 
molokai:/var/lib/one/datastores/114/3fbc702cdff9fdced57c7b95c33b2459 to 
lanai:/var/lib/one//datastores/120/169/disk.0
Wed Mar 11 12:35:57 2015 [Z0][TM][I]: ExitCode: 1
Wed
 Mar 11 12:35:57 2015 [Z0][TM][E]: Error executing image transfer 
script: Error copying 
molokai:/var/lib/one/datastores/114/3fbc702cdff9fdced57c7b95c33b2459 to 
lanai:/var/lib/one//datastores/120/169/disk.0

So the problem results from a directory no created. I checked it, the directory really does not get created.

These are the current storage/vm settings, I am just using qcow2 over gluster.fuse.

root@XXXXXXX:~# gluster volume info
 
Volume Name: PRODVMCLUSTERSTORE1
Type: Distributed-Replicate
Volume ID: 4cf1fbfd-caf7-44eb-a9fd-081dbd69a979
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: sgwa-glusterfs:/GLUSTERBRICK1/GLUSTERBRICK1
Brick2: sgwc-glusterfs:/GLUSTERBRICK1/GLUSTERBRICK1
Brick3: sgwb-glusterfs:/GLUSTERBRICK1/GLUSTERBRICK1
Brick4: sgwd-glusterfs:/GLUSTERBRICK1/GLUSTERBRICK1
Brick5: sgwa-glusterfs:/GLUSTERBRICK2/GLUSTERBRICK2
Brick6: sgwc-glusterfs:/GLUSTERBRICK2/GLUSTERBRICK2
Brick7: sgwb-glusterfs:/GLUSTERBRICK2/GLUSTERBRICK2
Brick8: sgwd-glusterfs:/GLUSTERBRICK2/GLUSTERBRICK2
Brick9: sgwa-glusterfs:/GLUSTERBRICK3/GLUSTERBRICK3
Brick10: sgwc-glusterfs:/GLUSTERBRICK3/GLUSTERBRICK3
Brick11: sgwb-glusterfs:/GLUSTERBRICK3/GLUSTERBRICK3
Brick12: sgwd-glusterfs:/GLUSTERBRICK3/GLUSTERBRICK3
Brick13: sgwa-glusterfs:/GLUSTERBRICK4/GLUSTERBRICK4
Brick14: sgwc-glusterfs:/GLUSTERBRICK4/GLUSTERBRICK4
Brick15: sgwb-glusterfs:/GLUSTERBRICK4/GLUSTERBRICK4
Brick16: sgwd-glusterfs:/GLUSTERBRICK4/GLUSTERBRICK4
Options Reconfigured:
server.allow-insecure: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 2500
storage.owner-gid: 2500
 
root@XXXXXX:~# cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/mapper/SYSTEM-STORAGEOS /               ext4    errors=remount-ro 0       1
/dev/mapper/SYSTEM-SWAP none            swap    sw              0       0
/dev/FSSTORAGEVOL/GLUSTERBRICK1  /GLUSTERBRICK1  xfs     inode64         0       0
/dev/FSSTORAGEVOL/GLUSTERBRICK2  /GLUSTERBRICK2  xfs     inode64         0       0
/dev/FSSTORAGEVOL/GLUSTERBRICK3  /GLUSTERBRICK3  xfs     inode64         0       0
/dev/FSSTORAGEVOL/GLUSTERBRICK4  /GLUSTERBRICK4  xfs     inode64         0       0
 
root@XXXXXX:~# lvs | grep GLUSTER
  GLUSTERBRICK1 FSSTORAGEVOL -wi-ao---   4.00t
  GLUSTERBRICK2 FSSTORAGEVOL -wi-ao---   4.00t
  GLUSTERBRICK3 FSSTORAGEVOL -wi-ao---   4.00t
  GLUSTERBRICK4 FSSTORAGEVOL -wi-ao---   4.00t

Client Side Mount Settings

mount -t glusterfs -o backupvolfile-server=sgwd-glusterfs,log-level=WARNING,log-file=/var/log/gluster.log sgwa-glusterfs:/PRODVMCLUSTERSTORE1 /var/lib/one/datastores/100

VM Settings

<disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/var/lib/one/datastores/100/firs_storage_server2.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>