LizardFS, anyone?


(Gandalf Corvotempesta) #1

Anyone using LizardFS in production willing to share some details?
How many VM ? Which hardware configuration is suggested ?
Any advice to get best performance ?

I’m evaluating both Gluster and LizardFS


(Florian Heigl) #2

For LizardFS you could do it like me and buy the whole thing preconfigured from NodeWeavers :wink:

What I’m running on my home cloud:
2 Nodes

  • ~3.5TB each of high-endurance SSD per node (Samsung SV843 and HGST HSUML400 - both are around 7PBW. That means each has an endurance rating of, idk, 20 normal Samsungs…)
  • 64GB Ram each
  • Dual 10GE Backend

One node is a low-power XeonD, the other is a highend Xeon E5.
I’m planning to soon add another low-power node which would have the same spec as the existing one.

I’ve found performance gets back if I launch over 80VMs, which is OK in my book.

I can’t say much about “best” performance. I’m getting good performance, which means around 4k random/rw iops in my standard fio-based test.
I expect the performance to scale up with the third node. I won’t need more compute power than those 3 combined, so if I would want to have even more performance I would probably add some more ARM64 based servers which only do storage, not compute.

LizardFS has pretty low requirements on memory and CPU. It is a lot more forgiving than GlusterFS was years ago when I played with it. (I know it’s better now. Still, LizardFS feels quite robust to me).

If you replace a disk you’ll see some re-filling traffic and that traffic is quite overwhelming if you have two nodes only. You would ideally have 10 nodes of 2 disks to have a nicely distributed recovery.

I’ve also done a test run of LizardFS itself last year. That was 4 nodes with 64GB each, and only one GigE interconnect.
For that I did also use SSD caching there using bcache and the data was on a few large SATA disks.
The working set was FTP and a few TB in size (so it didn’t fit in ram, and also not in SSD cache)
The shared filesystem handled a few 100 FTP clients downloading/uploading without any bottlenecks.
NodeWeaver includes some special patches for better locality (read where the VM is running) and I didn’t even have those patches in my benchmark.

Since a normal setup is just like 15 mins I’d recommend you give it a shot.


(Jan "Yenya" Kasprzak) #3

Hello,

Can I add CEPH to the mix? @gandalf, why did you ruled out CEPH? I am also interested in comparison between CEPH and LizardFS.

I use CEPH myself, and so far it looks good. On another project, we used GlusterFS, and in my opinion,
CEPH is much better than GlusterFS.

My setup is as follows: I have ~25 OpenNebula/CEPH nodes (5+ years old, so I don’t expect
to have an extreme performance), each with two HDDs. I have a RAID-1 volume for / and swap, and the rest of disk space is used without RAID as two CEPH OSDs.

Using 5+ years old disks, which were powered off for at least two years, allowed me to assess the robustness of CEPH :-). So far I did not experience a loss of data, even though out out the original ~50 disks about 10 have failed and been replaced so far.

I use only CEPH RBD, not CEPH filesystem. Altough I plan to use S3/Swift via radosgw as object storage for a different project.

Recently, I added a SSD-only pool and a SSD-first pool (the primary replica is kept on SSD, the other replicas on the spinning rust), and I plan to do performance testing of ONe VMs soon (probably after upgrade to 5.2).

There is also an older thread on the same topic, which mentions even more storage systems:

So, can anybody compare CEPH to LizardFS?

Thanks,

-Yenya


(Florian Heigl) #4

For the record, I found I had wrongly connected one of the two nodes, so the perf I saw equals only a gigabit connection ;>


(Florian Heigl) #5

I think you did absolutely right with having 25 Ceph nodes!
I often tell people ‘No, just DON’T’ if they wanna build 2-7 node Ceph clusters, it can’t get good results in a very small (in Ceph terms) setup.

I’ve not used Ceph with OpenNebula, that’s why I can’t comment too much.
The LizardFS feature set is a lot smaller but it exactly matches what I really need (i.e. I do not need geo-replication with site failover, but I do need “runs well even if only need 2-3 nodes”)

For what I have it’s great. Performance is nice, too.

I’d probably also go with Ceph for anything above 20 dedicated storage nodes, especially when expecting a lot more growth.

My gut numbers are: LizardFS makes sense from 2-100 nodes, and Ceph from 20-1000.
Ceph has a little few extras to ease large setup maintenance, like udev rules for disk plugging and all that kind of stuff.
I don’t need any of that, I need simple, fast, and also something that is very forgiving if I make a mistake. Ceph doesn’t forgive. :slight_smile:


(Jan "Yenya" Kasprzak) #6

I have found an interesting difference: as I said, I use CEPH on raw (non-RAID) disks. In LizardFS whitepaper, they constantly mention setups using RAID on storage nodes, so my guess is they might have less robust handling of a disk failure.

I know, handling disk failures (and even near-failures) is hard, and I have been positively surprised that CEPH handles them correctly.


(Gandalf Corvotempesta) #7

I’ve not ruled out ceph
gluster is way easier to configure, run and understand than ceph and has similiar performance

Additionally, a standard gluster cluster starts with 3 servers and no metadata server .ceph requires at least 6 servers (3 osd, 3 mons)

I really prefere gluster and with libgfapi integrated in qemu, performance are much better than fuse


(Jan "Yenya" Kasprzak) #8

Just a minor correction: yes, you need 3 storage nodes for a meaningful data replication, but CEPH happily runs with just one mon. Moreover, mons and osds can be colocated on the same server. So for a minimal meaningful CEPH configuration you need 3 servers, and run 3 OSDs and 1 or 3 mons on them.

I don’t get your last sentence mentioning fuse: qemu has CEPH/RADOS support built in, and uses the RADOS protocol natively via librados. No need for fuse there.

Anyway, do you have any experience with disk failures? How well does glusterfs handle them? Does it also handle partial failures (such as bad sector resulting in unreadable part of a single file)?

Thanks!

-Yenya


(Gandalf Corvotempesta) #9

Running with 1 mon is the best way to run into troubles for sure.
The minimum are 3 mons. Yes, they can be located on each OSD but is not a good practise.

Gluster doesn’t have mons at all, thus there is no need to locate multiple services on the same nodes.

I’m not referring to ceph with this sentence. I was referring to gluster. Gluster with libgfapi is very performant and by enabling the sharding features, even the healing doesn’t cause VM lockup. I’ve tested a rebalance and a full-healing with a single VM running a batch of kernel extract+compile+delete+rsync and so on, making high I/O and none of these processes was stopped during the whole healing phase. I’ve unplugged the network cable from 1 server, gluster immediatly noticed this and no I/O was lost. When I’ve plug the cable back, the healing was started and, again, no I/O was lost.

I’ve disconnected a single disk from the kernel (i don’t remember the exact command I did, but I’ve not physically removed the disk), gluster immediatly notices the brick failure.

I still have to test the single bad sector but probably i’ll go with a raid (or, better, ZFS with RAIDZ) thus a single bad sector should not be an issue. The same would happens for Ceph.

But let me ask on Gluster mailing list, i’m curious.


(Gandalf Corvotempesta) #10

Anyway, Gluster has bit-rot detection that could solve this issue automatically by resyncing from the other nodes.
The bad sector i automatically remapped by disks itself, thus a new rewrite should write on the good portion of disk.


(Gandalf Corvotempesta) #11

The kernel detect a disk failure and put the FS as read-only and disable the disk
When the FS is read-only, the gluster’s brick process pointing to that disk will exit and thus gluster detect a failed brick.


(Jan "Yenya" Kasprzak) #12

I think your notion of kernel behaviour during disk failures is not correct. First of all, this is heavily filesystem dependent. Moreover, when the bad sector is in the file data (as opposed to filesystem metadata), read(2) returns something like ENXIO, and the filesystem continues operating. When the bad sector is in the filesystem metadata, most filesystems remount themselves read-only (AFAIK with ext*fs, the exact behaviour can be set via tune2fs(8) as “remount r/only”, “panic”, and “ignore” for the brave :-). When the disk is bad to the point of generating unplug/replug sequence (e.g. SATA channel reset), the filesystem starts returning ENXIO for all operations, but it is still mounted. For systemd-based distributions, systemd sometimes detects an unplugged disk (if it is mounted via /etc/fstab entry), and umount(2)s it.

The kernel itself does not disable the disk, nor it remounts it r/only in response to all types of failure.


(Gandalf Corvotempesta) #13

Every time i had bad sectors on disks, the filesystem was automatically remounted as RO (yes this is tunable). This is enough for gluster to start healing procedures


(Gandalf Corvotempesta) #14

So, nobody is using Lizard in production willing to describe their infrastructure? How many nodes? How many disks and so on.