In the previous two installments we covered what ZFS is, and then went into some detail installing ZFS on a Mac OS X system and creating a simple mirrored pool on 2 USB disks with it. Now we will install the Linux version as it exists today on a Linux (Ubuntu) machine and take it for a spin.
Since the ZFS code was released a few years ago, geeks have been hoping and waiting for it to be ported to Linux, however due to the licensing differences (which you can read about here.), developers have been hesitant to port the CDDL licensed code for use in the Linux kernel. But there is another solution in the form of FUSE, or Filesystem in User Space. Filesystems can be built to run with FUSE without having to worry as much about licensing restrictions, the FUSE library that needs to be linked in to a filesystem implementation is LGPL licensed and thus can be dynamically linked to code under any license.
And someone has implemented ZFS as a FUSE filesystem in the zfs-fuse project. Initial reaction to the project was varied, but the feeling among many Linux users was that this was not a “real” implementation of ZFS, that it couldn’t be used in production environments or desktops and would be slow. However such criticism is probably unfounded, recent benchmarks show that the NTFS-3G filesystem can approach and meet the speed of some in-kernel filesystems, and the company behind NTFS-3G is known to have a commercial version of the filesystem that is even faster. In addition, the developer in charge of porting ZFS to fuse has stated a number of times on the google group for zfs-fuse, that he believes FUSE is not in any way a bottleneck for performance. However we shall see :)
Since these tests are being done on an Ubuntu 8.10 system, I decided to forgo building the zfs-fuse package from source and use binaries provided on the ppa.launchpad.net site:
Add these repositories to /etc/apt/sources.list.d/zfs-fuse.list
[[code]]czoxMTY6XCJkZWIgaHR0cDovL3BwYS5sYXVuY2hwYWQubmV0L2JyY2hhL3VidW50dSBpbnRyZXBpZCBtYWluDQpkZWItc3JjIGh0dHB7WyYqJl19Oi8vcHBhLmxhdW5jaHBhZC5uZXQvYnJjaGEvdWJ1bnR1IGludHJlcGlkIG1haW5cIjt7WyYqJl19[[/code]]
If you are using 8.04LTS you can replace Intrepid with Hardy in those lines.
Then update the package database and install zfs-fuse:
[[code]]czoyMzpcIg0Kc3VkbyBhcHQtZ2V0IHVwZGF0ZQ0KXCI7e1smKiZdfQ==[[/code]]
[[code]]czozMzpcIg0Kc3VkbyBhcHQtZ2V0IGluc3RhbGwgemZzLWZ1c2UNClwiO3tbJiomXX0=[[/code]]
APT should find any needed dependencies and install them.
This code is synced with Solaris Nevada 98, which is much newer than the OS X code I tested. The on-disk format is at 13 with this build. The developer describes the current codebase as feature complete and only bugs are being fixed right now, though there is significant development going on upstream for instance the ZFS encryption project.
I’ll start by creating a single drive zpool on an 80gb internal drive:
[[code]]czo2NDc6XCINCjxzdHJvbmc+cm9vdEBsaXZlOn4jPC9zdHJvbmc+IHN1ZG8genBvb2wgY3JlYXRlIHdkODAgL2Rldi9zZGINCjxzdHJ7WyYqJl19b25nPnJvb3RAbGl2ZTp+Izwvc3Ryb25nPiB6cG9vbCBsaXN0DQpOQU1FICAgIFNJWkUgICBVU0VEICBBVkFJTCAgICBDQVAgIEhFQXtbJiomXX1MVEggIEFMVFJPT1QNCndkODAgICA3NC41RyAgICA3MksgIDc0LjVHICAgICAwJSAgT05MSU5FICAtDQo8c3Ryb25nPnJvb3RAbGl2e1smKiZdfWU6fiM8L3N0cm9uZz4genBvb2wgc3RhdHVzDQogIHBvb2w6IHdkODANCiBzdGF0ZTogT05MSU5FDQogc2NydWI6IG5vbmUgcmVxdWV7WyYqJl19c3RlZA0KY29uZmlnOg0KDQoJTkFNRSAgICAgICAgU1RBVEUgICAgIFJFQUQgV1JJVEUgQ0tTVU0NCgl3ZDgwICAgICAgICBPTkxJTntbJiomXX1FICAgICAgIDAgICAgIDAgICAgIDANCgkgIHNkYiAgICAgICBPTkxJTkUgICAgICAgMCAgICAgMCAgICAgMA0KDQplcnJvcnM6IE5ve1smKiZdfSBrbm93biBkYXRhIGVycm9ycw0KDQo8c3Ryb25nPnJvb3RAbGl2ZTp+Izwvc3Ryb25nPiBtb3VudA0KL2Rldi9zZGQxIG9uIC8gdHl7WyYqJl19cGUgZXh0MyAocncscmVsYXRpbWUpDQpmdXNlY3RsIG9uIC9zeXMvZnMvZnVzZS9jb25uZWN0aW9ucyB0eXBlIGZ1c2VjdGwgKHJ3KXtbJiomXX0NCndkODAgb24gL3dkODAgdHlwZSBmdXNlIChydyxhbGxvd19vdGhlcikNCg0KXCI7e1smKiZdfQ==[[/code]]
Notice the zpool is created with a filesystem on it by default, it has been mounted automatically. There are a number of options that can be set both for zpools and for individual filesystems:
[[code]]czozMTIwOlwiPHN0cm9uZz5yb290QGxpdmU6fiM8L3N0cm9uZz4genBvb2wgZ2V0IGFsbCB3ZDgwDQpOQU1FICBQUk9QRVJUWSAgICB7WyYqJl19IFZBTFVFICAgICAgIFNPVVJDRQ0Kd2Q4MCAgc2l6ZSAgICAgICAgIDc0LjVHICAgICAgIC0NCndkODAgIHVzZWQgICAgICAgICA3MntbJiomXX1LICAgICAgICAgLQ0Kd2Q4MCAgYXZhaWxhYmxlICAgIDc0LjVHICAgICAgIC0NCndkODAgIGNhcGFjaXR5ICAgICAwJSAgICAgICAge1smKiZdfSAgLQ0Kd2Q4MCAgYWx0cm9vdCAgICAgIC0gICAgICAgICAgIGRlZmF1bHQNCndkODAgIGhlYWx0aCAgICAgICBPTkxJTkUgICAgICB7WyYqJl19LQ0Kd2Q4MCAgZ3VpZCAgICAgICAgIDEzNjQ5ODI4NDk2NzEwODQyMzcxICAtDQp3ZDgwICB2ZXJzaW9uICAgICAgMTMgICAgICAgIHtbJiomXX0gIGRlZmF1bHQNCndkODAgIGJvb3RmcyAgICAgICAtICAgICAgICAgICBkZWZhdWx0DQp3ZDgwICBkZWxlZ2F0aW9uICAgb24gICAge1smKiZdfSAgICAgIGRlZmF1bHQNCndkODAgIGF1dG9yZXBsYWNlICBvZmYgICAgICAgICBkZWZhdWx0DQp3ZDgwICBjYWNoZWZpbGUgICAgLSB7WyYqJl19ICAgICAgICAgIGRlZmF1bHQNCndkODAgIGZhaWxtb2RlICAgICB3YWl0ICAgICAgICBkZWZhdWx0DQo8c3Ryb25nPnJvb3RAbGl2ZXtbJiomXX06fiM8L3N0cm9uZz4gemZzIGdldCBhbGwgd2Q4MA0KTkFNRSAgUFJPUEVSVFkgICAgICAgICAgICAgIFZBTFVFICAgICAgICAgICAge1smKiZdfSAgICAgIFNPVVJDRQ0Kd2Q4MCAgdHlwZSAgICAgICAgICAgICAgICAgIGZpbGVzeXN0ZW0gICAgICAgICAgICAgLQ0Kd2Q4MCAgY3J7WyYqJl19ZWF0aW9uICAgICAgICAgICAgICBUaHUgTWFyICA1IDIyOjA4IDIwMDkgIC0NCndkODAgIHVzZWQgICAgICAgICAgICAgICAgICA2N3tbJiomXX0uNUsgICAgICAgICAgICAgICAgICAtDQp3ZDgwICBhdmFpbGFibGUgICAgICAgICAgICAgNzMuM0cgICAgICAgICAgICAgICAgICAte1smKiZdfQ0Kd2Q4MCAgcmVmZXJlbmNlZCAgICAgICAgICAgIDE4SyAgICAgICAgICAgICAgICAgICAgLQ0Kd2Q4MCAgY29tcHJlc3NyYXRpbyB7WyYqJl19ICAgICAgICAxLjAweCAgICAgICAgICAgICAgICAgIC0NCndkODAgIG1vdW50ZWQgICAgICAgICAgICAgICB5ZXMgICAgICAgICAgIHtbJiomXX0gICAgICAgICAtDQp3ZDgwICBxdW90YSAgICAgICAgICAgICAgICAgbm9uZSAgICAgICAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwe1smKiZdfSAgcmVzZXJ2YXRpb24gICAgICAgICAgIG5vbmUgICAgICAgICAgICAgICAgICAgZGVmYXVsdA0Kd2Q4MCAgcmVjb3Jkc2l6ZSAgICB7WyYqJl19ICAgICAgICAxMjhLICAgICAgICAgICAgICAgICAgIGRlZmF1bHQNCndkODAgIG1vdW50cG9pbnQgICAgICAgICAgICAvd2Q4MCAgIHtbJiomXX0gICAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwICBzaGFyZW5mcyAgICAgICAgICAgICAgb2ZmICAgICAgICAgICAgICAgICAgICBke1smKiZdfWVmYXVsdA0Kd2Q4MCAgY2hlY2tzdW0gICAgICAgICAgICAgIG9uICAgICAgICAgICAgICAgICAgICAgZGVmYXVsdA0Kd2Q4MCAgY297WyYqJl19bXByZXNzaW9uICAgICAgICAgICBvZmYgICAgICAgICAgICAgICAgICAgIGRlZmF1bHQNCndkODAgIGF0aW1lICAgICAgICAgICAgIHtbJiomXX0gICAgb24gICAgICAgICAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwICBkZXZpY2VzICAgICAgICAgICAgICAgb24gICAgICAgICAge1smKiZdfSAgICAgICAgICAgZGVmYXVsdA0Kd2Q4MCAgZXhlYyAgICAgICAgICAgICAgICAgIG9uICAgICAgICAgICAgICAgICAgICAgZGVmYXV7WyYqJl19bHQNCndkODAgIHNldHVpZCAgICAgICAgICAgICAgICBvbiAgICAgICAgICAgICAgICAgICAgIGRlZmF1bHQNCndkODAgIHJlYWRvbntbJiomXX1seSAgICAgICAgICAgICAgb2ZmICAgICAgICAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwICB6b25lZCAgICAgICAgICAgICAgICAge1smKiZdfW9mZiAgICAgICAgICAgICAgICAgICAgZGVmYXVsdA0Kd2Q4MCAgc25hcGRpciAgICAgICAgICAgICAgIGhpZGRlbiAgICAgICAgICB7WyYqJl19ICAgICAgIGRlZmF1bHQNCndkODAgIGFjbG1vZGUgICAgICAgICAgICAgICBncm91cG1hc2sgICAgICAgICAgICAgIGRlZmF1bHQNCntbJiomXX13ZDgwICBhY2xpbmhlcml0ICAgICAgICAgICAgcmVzdHJpY3RlZCAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwICBjYW5tb3VudCAge1smKiZdfSAgICAgICAgICAgIG9uICAgICAgICAgICAgICAgICAgICAgZGVmYXVsdA0Kd2Q4MCAgc2hhcmVpc2NzaSAgICAgICAgICAgIG9mZiB7WyYqJl19ICAgICAgICAgICAgICAgICAgIGRlZmF1bHQNCndkODAgIHhhdHRyICAgICAgICAgICAgICAgICBvbiAgICAgICAgICAgICAgICAgIHtbJiomXX0gICBkZWZhdWx0DQp3ZDgwICBjb3BpZXMgICAgICAgICAgICAgICAgMSAgICAgICAgICAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwe1smKiZdfSAgdmVyc2lvbiAgICAgICAgICAgICAgIDMgICAgICAgICAgICAgICAgICAgICAgLQ0Kd2Q4MCAgdXRmOG9ubHkgICAgICAgICAgICB7WyYqJl19ICBvZmYgICAgICAgICAgICAgICAgICAgIC0NCndkODAgIG5vcm1hbGl6YXRpb24gICAgICAgICBub25lICAgICAgICAgICAgICAgIHtbJiomXX0gICAtDQp3ZDgwICBjYXNlc2Vuc2l0aXZpdHkgICAgICAgc2Vuc2l0aXZlICAgICAgICAgICAgICAtDQp3ZDgwICB2c2NhbiAgICAge1smKiZdfSAgICAgICAgICAgIG9mZiAgICAgICAgICAgICAgICAgICAgZGVmYXVsdA0Kd2Q4MCAgbmJtYW5kICAgICAgICAgICAgICAgIG9mZiB7WyYqJl19ICAgICAgICAgICAgICAgICAgIGRlZmF1bHQNCndkODAgIHNoYXJlc21iICAgICAgICAgICAgICBvZmYgICAgICAgICAgICAgICAgIHtbJiomXX0gICBkZWZhdWx0DQp3ZDgwICByZWZxdW90YSAgICAgICAgICAgICAgbm9uZSAgICAgICAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwe1smKiZdfSAgcmVmcmVzZXJ2YXRpb24gICAgICAgIG5vbmUgICAgICAgICAgICAgICAgICAgZGVmYXVsdA0Kd2Q4MCAgcHJpbWFyeWNhY2hlICB7WyYqJl19ICAgICAgICBhbGwgICAgICAgICAgICAgICAgICAgIGRlZmF1bHQNCndkODAgIHNlY29uZGFyeWNhY2hlICAgICAgICBhbGwgICAgIHtbJiomXX0gICAgICAgICAgICAgICBkZWZhdWx0DQp3ZDgwICB1c2VkYnlzbmFwc2hvdHMgICAgICAgMCAgICAgICAgICAgICAgICAgICAgICAte1smKiZdfQ0Kd2Q4MCAgdXNlZGJ5ZGF0YXNldCAgICAgICAgIDE4SyAgICAgICAgICAgICAgICAgICAgLQ0Kd2Q4MCAgdXNlZGJ5Y2hpbGRyZW57WyYqJl19ICAgICAgICA0OS41SyAgICAgICAgICAgICAgICAgIC0NCndkODAgIHVzZWRieXJlZnJlc2VydmF0aW9uICAwICAgICAgICAgICAgIHtbJiomXX0gICAgICAgICAtDQpcIjt7WyYqJl19[[/code]]
You can guess what most of those options do, but one of the interesting ones is the mountpoint property. You can actually set a directory where a zfs filesystem will be mounted, and the system will respect that choice. This is actually stored within the filesystem, so if a drive is moved from one machine to another that mountpoint choice will stay with the drive. This may be a bit of a departure to traditional UNIX systems where the mountpoint for filesystems is either set by /etc/fstab or done dynamically by a higher level system in one of the various desktop packages.
One of the reasons i have been investigating ZFS on my operating systems of choice is the checksum feature, where ZFS writes a 256-bit checksum for every block on the pool, reading that checksum back and comparing it when data is read. In this way ZFS can tell when data has not been written correctly or if it has become corrupted on disk. This integrity checking is entirely transparent to the user, however if needed you can manually initiate a check of the entire disk, Sun calls it a scrub. With one command ZFS will crawl the entire pool and compare every checksum to its corresponding data block, if it finds any that are incorrect you will see a non-zero error count here:
<blockquote><pre>
<strong>root@live:~#</strong> zpool scrub wd80
<strong>root@live:~#</strong> zpool status
pool: wd80
state: ONLINE
scrub: scrub in progress for 0h1m, 0.58% done, 4h46m to go
config:
NAME STATE READ WRITE CKSUM
wd80 ONLINE 0 0 0
/dev/sdb ONLINE 0 0 0
errors: No known data errors</pre>
You can see that it does take a while to scrub the filesystem when there is data present, and at this particular time i had a large amount of test files on the pool. When and if it finds a problem one of those numbers will indicate what the problem was. They are also updated in real time so that if ZFS finds a problem you can issue the zpool status command and find out about it.
Now i’ll create a few arbitrary filesystems on top of the wd80 zpool:
<pre> <strong>root@live:~#</strong> zfs create wd80/music <strong>root@live:~#</strong> zfs create wd80/documents <strong>root@live:~#</strong> zfs create wd80/downloads <strong>root@live:~#</strong> zfs create wd80/movies <strong>root@live:~#</strong> zfs list NAME USED AVAIL REFER MOUNTPOINT wd80 170K 73.3G 22K /wd80 wd80/documents 18K 73.3G 18K /wd80/documents wd80/downloads 18K 73.3G 18K /wd80/downloads wd80/movies 18K 73.3G 18K /wd80/movies wd80/music 18K 73.3G 18K /wd80/music </pre>
Notice that although i have created 4 individual filesystems, i never input a size nor was i asked for one. ZFS makes all of the available space on a zpool available to the filesystems created on it, they will share the space. As mentioned in part 1, it helps if you simply think of the way this works as if you are giving ZFS drives to create a virtual pool of space, on top of which you then create filesystems. One reason this distinction is important is that at any time you can add storage to the pool and the existing filesystems will make use of it.
These filesystems also have compression turned off by default. Sun claims that for certain workloads compression actually speeds things up by trading some CPU load for I/O load. They also claim that because of the copy-on-write design, most writes are sequential, which should speed disk writes substantially. I would imagine this also significantly reduces fragmentation on a ZFS filesystem. Performance is acceptable, writing from /dev/zero to an empty file on the pool writes at 22MB/s for me, though i have read that ZFS doesn’t have to do much work to store zeros so i tested a raw copy from a real device as well:
<pre> <strong>root@live:/wd80#</strong> dd if=/dev/sdc of=uncompressedtest bs=128k count=1000 1000+0 records in 1000+0 records out 131072000 bytes (131 MB) copied, 8.83505 s, 14.8 MB/s </pre>
This is a respectable level of performance for a beta code implementation running as a FUSE filesystem, and the developer says performance can increase substantially. I also tested writing the same data to one of my ZFS filesystems with compression enabled:
<pre> <strong>root@live:/wd80/music#</strong> zfs set compression=on wd80/music <strong>root@live:/wd80/music#</strong> dd if=/dev/sdc of=compressedtest bs=128k count=1000 1000+0 records in 1000+0 records out 131072000 bytes (131 MB) copied, 6.90277 s, 19.0 MB/s </pre>
You can see that for this data, compression increased performance by a fair margin. For data that is highly compressible the speed will increase even further. Also note that this is just one disk, due to the dynamic striping used in ZFS, for every disk added to a pool the performance should increase in linear fashion. Speeds up to 300MB/s have been reported by zfs-fuse users with large arrays, so clearly it is possible for this code to perform very well even though it is not optimized yet.
While i don’t have 3 identical drives on which to build a raidz array, you can just as easily use file backed storage for testing purposes. The speed will of course be only of one device, but speed is not the focus of the next test.
I created a small, simple file backed raidz1 pool with 3 128MB files filled with zeros:
<pre> <strong>root@live:~#</strong> touch sda sdb sdc <strong>root@live:~#</strong> dd if=/dev/zero of=sda bs=128k count=1000 <strong>root@live:~#</strong> dd if=/dev/zero of=sdb bs=128k count=1000 <strong>root@live:~#</strong> dd if=/dev/zero of=sdc bs=128k count=1000 <strong>root@live:~#</strong> zpool create myraidz raidz /root/sda /root/sdb /root/sdc <strong>root@live:~#</strong> zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT myraidz 360M 141K 360M 0% ONLINE - </pre>
Because this raid pool is actually sitting on a flash drive i use for the root filesystem on this box, it is quite slow and zpool iostat shows very very low bandwidth readings:
<blockquote><pre>
<strong>root@live:/myraidz#</strong> zpool iostat
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
myraidz 37.8M 322M 0 2 6.01M 8.19M
---------- ----- ----- ----- ----- ----- -----
</pre>
Now i will delete one of the backing files while the pool is in use and see how long it takes ZFS to notice:
<blockquote><pre>
<strong>root@live:/root#</strong> rm sdc
<strong>root@live:/root#</strong> zpool scrub myraidz
<strong>root@live:/root#</strong> zpool status
pool: myraidz
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
scrub: scrub completed after 0h0m with 0 errors on Fri Mar 6 03:10:40 2009
config:
NAME STATE READ WRITE CKSUM
myraidz DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
/root/sda ONLINE 0 0 0
/root/sdb ONLINE 0 0 0
/root/sdc UNAVAIL 0 0 0 cannot open
<pre></blockquote>
Turns out, ZFS didn’t notice until i did a scrub of the pool, but i suspect a pool in use would have made zfs notice quickly. The pool keeps going even though a disk is missing, there is no rebuilding period or offline time, however it would be wise to add an identical disk to the pool so that zfs can heal it and regain the redundancy.
That’s it for now, i may do some additional tests, perhaps create a zpool sitting on top of an encrypted device mapper luks device or truecrypt, but i suspect it should function similarly though perhaps slower in truecrypts case since it is another fuse filesystem. Comments are welcome :)
One Comment
Fascinating, Captain…
But if your files/filesystems are spread out upon disks as new disks are added, how do you save files that are on a failing disk? Do you just dd the files to the new disk?
This would make life easier.
Thanks for the info… definitely something i’ll have to keep an eye out for.
Leave a comment or question