In our previous story we went through a basic rundown of what ZFS is and its advantages, and which operating systems are currently supporting it or plan to do so in the future. Apples OS X is one of those operating systems, and as noted before they are making available some of their pre-release code for testing by adventurous users and supergeeks. Caution is warranted here, this is highly experimental code, and while ZFS itself has a design goal to ensure the integrity of files stored on it, this code cannot be guaranteed to meet that design goal. Use of this code is at your own risk.
Apple is providing an OS X implementation of ZFS from MacForge. There are both binaries and source available, and Apples currently released code appears to be synced with upstream Solaris SNV Build 72. The current Solaris Nevada build appears to be 108, so the code may be missing some significant features. Some comments from the developers of Apples branch of ZFS on their mailing list indicate that they have reached a point where modifications to the kernel are needed to keep everything integrated and running smoothly, and those changes are being made in the Snow Leopard code base which is unreleased. So according to them, rather than fork development, they have been doing most of their new ZFS development as part of the Snow Leopard builds, which are not public. So while we don’t have the most up to date code, we can test it as-is to get an idea of how it will work, its advantages and disadvantages and compare ZFS to HFS+, the current OS X filesystem.
There are problems of course, some of them detailed here: http://zfs.macosforge.org/trac/wiki/issues. However for the truly adventurous, these can be worked around with ease. The most serious of these is a kernel panic that will happen if a drive is not exported before the device is removed, so for instance if you disconnect a USB or Firewire drive without exporting it first, the kernel will panic, all your open work will be lost, and the system will need to be restarted. So in short, don’t do that.
To start with lets grab the binaries from the site available here: http://zfs.macosforge.org/trac/wiki/downloads. The installation instructions are also on that page, basically you copy the right files to the right places, set the permissions correct on the files (OS X doesn’t like to load kernel extensions with incorrect permissions), and either reboot or manually load the kext:
sudo kextload /System/Library/Extensions/zfs.kext
If you are unfamiliar with how ZFS makes use of disk drives, it aggregates the drives you give it in a number of ways, either doing striping across all the devices to pool storage without any redundancy, or as a simple mirror, or in a number of different raid-like configurations that Sun calls RAIDZ1 and RAIDZ2 named for the number of disks the pool can lose without dropping data. So really you are not making a filesystem on a device, that is the wrong way to look at it. You are giving ZFS storage on which you can implement filesystems, an slight distinction but an important one once you get into more complex redundant pools with large numbers of disks.
For our purposes here we will simply give it 2 USB disks on which to implement a mirrored filesystem. Note that each of these disks have been prepared by erasing all partitions and making one single partition of type ZFS. This ensures that the disk actually has a label and a partition, which is sometimes necessary when exporting the pool and importing it on another OS like Solaris or Linux. It also ensures that OS X won’t bug you to initialize the disk when it can’t find a partition table anywhere on it.
Pardon the formatting, but you get the idea.
[[code]]czoyMDA0OlwiDQoNCk1pbmk6fiBTdGV2ZSQgZGlza3V0aWwgcGFydGl0aW9uZGlzayAvZGV2L2Rpc2sxIEdQVEZvcm1hdCBaRlMgJW57WyYqJl19b2Zvcm1hdCUgMTAwJQ0KU3RhcnRlZCBwYXJ0aXRpb25pbmcgb24gZGlzayBkaXNrMSANCkNyZWF0aW5nIHBhcnRpdGlvbiBtYXAgDXtbJiomXX0KWyArIDAlLi4xMCUuLjIwJS4uMzAlLi40MCUuLjUwJS4uNjAlLi43MCUuLjgwJS4uOTAlLi4xMDAlIF0gIA0KRmluaXNoZWQgcGFye1smKiZdfXRpdGlvbmluZyBvbiBkaXNrIGRpc2sxIA0KL2Rldi9kaXNrMQ0KICAgIzogICAgICAgICAgICAgICAgICAgICAgIFRZUEUgTkFNRSB7WyYqJl19ICAgICAgU0laRSAgICAgICBJREVOVElGSUVSDQogICAwOiAgICAgR1VJRF9wYXJ0aXRpb25fc2NoZW1lICAgICAgICAgICo0NjUuOHtbJiomXX0gR2kgICBkaXNrMQ0KICAgMTogICAgICAgICAgICAgICAgICAgICAgICBFRkkgICAgICAgICAgMjAwLjAgTWkgICBkaXNrMXMxDQoge1smKiZdfSAgMjogICAgICAgICAgICAgICAgICAgICAgICBaRlMgICAgICAgICAgNDY0LjkgR2kgICBkaXNrMXMyDQoNCk1pbmk6fiBTdGV2ZSR7WyYqJl19IGRpc2t1dGlsIHBhcnRpdGlvbmRpc2sgL2Rldi9kaXNrMiBHUFRGb3JtYXQgWkZTICVub2Zvcm1hdCUgMTAwJQ0KU3RhcnRlZCBwYXtbJiomXX1ydGl0aW9uaW5nIG9uIGRpc2sgZGlzazIgDQpDcmVhdGluZyBwYXJ0aXRpb24gbWFwIA0KWyArIDAlLi4xMCUuLjIwJS4uMzAlLi40e1smKiZdfTAlLi41MCUuLjYwJS4uNzAlLi44MCUuLjkwJS4uMTAwJSBdICANCkZpbmlzaGVkIHBhcnRpdGlvbmluZyBvbiBkaXNrIGRpc2syIA17WyYqJl19Ci9kZXYvZGlzazINCiAgICM6ICAgICAgICAgICAgICAgICAgICAgICBUWVBFIE5BTUUgICAgICAgU0laRSAgICAgICBJREVOVElGSXtbJiomXX1FUg0KICAgMDogICAgICBHVUlEX3BhcnRpdGlvbl9zY2hlbWUgICAgICAgICAqNDY1LjEgR2kgICBkaXNrMg0KICAgMTogICAgICAge1smKiZdfSAgICAgICAgICAgICAgICAgRUZJICAgICAgICAgIDIwMC4wIE1pICAgZGlzazJzMQ0KICAgMjogICAgICAgICAgICAgICAgICAgICB7WyYqJl19ICAgWkZTICAgICAgICAgIDQ2NC45IEdpICAgZGlzazJzMg0KDQpNaW5pOn4gU3RldmUkIHpwb29sIGNyZWF0ZSBteXZvbCBtaXJyb3tbJiomXX1yIC9kZXYvZGlzazFzMiAvZGV2L2Rpc2syczINCg0KTWluaTp+IFN0ZXZlJCB6cG9vbCBzdGF0dXMNCiAgcG9vbDogbXl2b2wNCiBze1smKiZdfXRhdGU6IE9OTElORQ0Kc3RhdHVzOiBUaGUgcG9vbCBpcyBmb3JtYXR0ZWQgdXNpbmcgYW4gb2xkZXIgb24tZGlzayBmb3JtYXQuICB7WyYqJl19VGhlIHBvb2wgY2FuDQoJc3RpbGwgYmUgdXNlZCwgYnV0IHNvbWUgZmVhdHVyZXMgYXJlIHVuYXZhaWxhYmxlLg0KYWN0aW9uOiBVcHtbJiomXX1ncmFkZSB0aGUgcG9vbCB1c2luZyBcXFwnenBvb2wgdXBncmFkZVxcXCcuICBPbmNlIHRoaXMgaXMgZG9uZSwgdGhlDQoJcG9vbCB3aWxsIHtbJiomXX1ubyBsb25nZXIgYmUgYWNjZXNzaWJsZSBvbiBvbGRlciBzb2Z0d2FyZSB2ZXJzaW9ucy4NCiBzY3J1Yjogbm9uZSByZXF1ZXN0ZWQNe1smKiZdfQpjb25maWc6DQoNCglOQU1FICAgICAgICAgICAgICAgICAgICAgIFNUQVRFICAgICBSRUFEIFdSSVRFIENLU1VNDQoJbXl2b2wgICB7WyYqJl19ICAgICAgICAgICAgICAgICAgT05MSU5FICAgICAgIDAgICAgIDAgICAgIDANCgkgIG1pcnJvciAgICAgICAgICAgICAgICAgIE9OTHtbJiomXX1JTkUgICAgICAgMCAgICAgMCAgICAgMA0KCSAgICAvZGV2L2Rpc2sxczIgICAgICAgICAgT05MSU5FICAgICAgIDAgICAgIDAgICAge1smKiZdfSAwDQoJICAgIC9kZXYvZGlzazJzMiAgICAgICAgICBPTkxJTkUgICAgICAgMCAgICAgMCAgICAgMA0KDQplcnJvcnM6IE5vIGtub3d7WyYqJl19biBkYXRhIGVycm9ycw0KDQpNaW5pOn4gU3RldmUkIG1vdW50DQovZGV2L2Rpc2swczIgb24gLyAoaGZzLCBsb2NhbCwgam91cm5hbHtbJiomXX1lZCkgPHN0cm9uZz48LS0tLXJvb3Q8L3N0cm9uZz4NCi4uLm90aGVyIGZpbGVzeXN0ZW1zIG9taXR0ZWQuLi4uDQpteXZvbCBvbiAve1smKiZdfVZvbHVtZXMvbXl2b2wgKHpmcywgbG9jYWwsIG5vZGV2LCBub3N1aWQsIG1vdW50ZWQgYnkgU3RldmUpIDxzdHJvbmc+PC0tLS1vdXJ7WyYqJl19IG5vdyBtb3VudGVkIFpGUyBtaXJyb3I8L3N0cm9uZz4NCg0KXCI7e1smKiZdfQ==[[/code]]
Now in the above example you can see 2 USB disks being formatted with a GPT partition table with one user-visible partition of type ZFS. Then a mirrored volume is created using the 2 disks and it is immediately brought online. By default ZFS will create a single filesystem on top of a pool, but you can also then create individual filesystems on the pool and each will only use as much space as it needs, while remaining an independent filesystem. By independent i mean you can take snapshots of ONLY the data on a filesystem and not an entire pool if you wish, you can also move the filesystem to another pool with a few simple commands. If you give ZFS enough space you can have a fully redundant, integrity checked file storage system with integrated versioning and backup capabilities at the block level. Did i mention it is free? Yea.
Storage speed with this build of the ZFS driver seems acceptable, i noticed no difference copying a large number of files of every size to the volume as a ZFS pool compared to when it was an HFS+ volume. So performance doesn’t seem to be an issue, though there is some slight overhead due to various features of the filesystem, for instance checksums require some calculation, and when a full disk scrub is being performed the process kernel_task shows a bit more activity than normal, but not much. For some very basic benchmarking, i used the unix dd command to copy raw data from another drive connected to the system through firewire400 to a file on the ZFS mirrored volume:
Due to the way ZFS is designed, data is supposed to be consistent on disk at all times. What does that mean? Well, it means that at any given time, the data on disk and what the operating system thinks is on disk should be in sync. I will leave the complex explanations to Suns documentation available on the first part of this series, however it quite surprised me to find that with a single command i could verify the integrity of my data on disk, while the filesystem was online and in use:
[[code]]czo0MzI6XCJNaW5pOn4gU3RldmUkIHpwb29sIHNjcnViIG15dm9sDQoNCk1pbmk6fiBTdGV2ZSQgenBvb2wgc3RhdHVzDQogIHBvb2x7WyYqJl19OiBteXZvbA0KIHN0YXRlOiBPTkxJTkUNCiBzY3J1Yjogc2NydWIgY29tcGxldGVkIGFmdGVyIDBoMjhtIHdpdGggMCBlcnJvcnMgb3tbJiomXX1uIE1vbiBNYXIgIDIgMjM6MjY6NDAgMjAwOQ0KY29uZmlnOg0KDQoJTkFNRSAgICAgICAgICAgICAgIFNUQVRFICAgICBSRUFEIFdSe1smKiZdfUlURSBDS1NVTQ0KCW15dm9sICAgICAgICAgICAgICBPTkxJTkUgICAgICAgMCAgICAgMCAgICAgMA0KCSAgbWlycm9yICAgICAgICB7WyYqJl19ICAgT05MSU5FICAgICAgIDAgICAgIDAgICAgIDANCgkgICAgL2Rldi9kaXNrMXMyICAgT05MSU5FICAgICAgIDAgICAgIDAgICAgIHtbJiomXX0wDQoJICAgIC9kZXYvZGlzazJzMiAgIE9OTElORSAgICAgICAwICAgICAwICAgICAwDQoNCjwvcHJlPlwiO3tbJiomXX0=[[/code]]</blockquote> What scrub does, is crawl the filesystem to check every single block of data against the 256-bit checksum that is stored of that block. That means that on every write, a checksum is calculated and written to the disk, so that when the data is read back ZFS knows if it is correct or not. Any sysadmin can tell you that this is highly valuable, silent data corruption is a huge problem. Data loss can become severe when you think you can trust a disk, but in reality it is not storing what you want it to store. Consider for instance that if a disk fails or shows signs of early failure and you had a backup, you can stop trusting the disk and go with the backup. But what happens if you think you can trust that disk, but it slowly loses small bits of data for you? Nothing large enough to get your attention, but significant blocks, the header of a binary file the operating system uses perhaps, or a mysql table, maybe the one storing the user data for a large website? Depending on the location of that corrupt block it could ruin your day. In a situation like this, as long as the volume is redundant ZFS will actually find the corrupt block and <strong><em>fix it</em></strong> for you. Even in a zpool that is not redundant you will get early warning that the disk is not reliable before you lose large quantities of important data. One of the advantages of ZFS is that this data integrity comes standard, and this is a significant difference from some hardware RAID systems. There is an assumption that if you are using hardware RAID, your data is safe even if a disk fails. This may be true in some situations, but it is not always true. It is possible to silently lose data even on a hardware RAID setup. Now lets look at what happens when you remove a device from a zpool in OS X without exporting it first. Right now, if that device is the only remaining copy of the data in the pool, the kernel will call a [[code]]czo3OlwicGFuaWMoKVwiO3tbJiomXX0=[[/code]]. The reason given by the developers for this, is that ZFS promises to be consistent on-disk at all times, so if a device is suddenly removed from a pool and that disk was the only copy of the data in that pool, the filesystem cannot guarantee the data anymore and gives up trying. Wherever the disk is at this point, the data is <strong><em>supposed</em></strong> to be valid and checksummed but who knows, the filesystem itself is sound and reliable but again this is pre-release code. Upon removing one disk from the mirror i created, this is what happened: <blockquote><pre>[[code]]czo3NDY6XCJNaW5pOn4gU3RldmUkIHpwb29sIHNjcnViIG15dm9sDQpNaW5pOn4gU3RldmUkIHpwb29sIHN0YXR1cw0KICBwb29sOiB7WyYqJl19bXl2b2wNCiBzdGF0ZTogREVHUkFERUQNCnN0YXR1czogT25lIG9yIG1vcmUgZGV2aWNlcyBjb3VsZCBub3QgYmUgb3BlbmVkLiAgU3tbJiomXX11ZmZpY2llbnQgcmVwbGljYXMgZXhpc3QgZm9yDQoJdGhlIHBvb2wgdG8gY29udGludWUgZnVuY3Rpb25pbmcgaW4gYSBkZWdyYWRle1smKiZdfWQgc3RhdGUuDQphY3Rpb246IEF0dGFjaCB0aGUgbWlzc2luZyBkZXZpY2UgYW5kIG9ubGluZSBpdCB1c2luZyBcXFwnenBvb2wgb25saXtbJiomXX1uZVxcXCcuDQogICBzZWU6IGh0dHA6Ly93d3cuc3VuLmNvbS9tc2cvWkZTLTgwMDAtMlENCiBzY3J1Yjogc2NydWIgY29tcGxldGVkIHd7WyYqJl19aXRoIDAgZXJyb3JzIG9uIFRodSBNYXIgIDUgMTA6MzE6MjEgMjAwOQ0KY29uZmlnOg0KDQoJTkFNRSAgICAgICAgICAgICAgICAgIHtbJiomXX0gICAgU1RBVEUgICAgIFJFQUQgV1JJVEUgQ0tTVU0NCglteXZvbCAgICAgICAgICAgICAgICAgICAgIERFR1JBREVEICAgICAwICAge1smKiZdfSAgMCAgICAgMA0KCSAgbWlycm9yICAgICAgICAgICAgICAgICAgREVHUkFERUQgICAgIDAgICAgIDAgICAgIDANCgkgICAgL2Rldi97WyYqJl19ZGlzazFzMiAgICAgICAgICBPTkxJTkUgICAgICAgMCAgICAgMCAgICAgMA0KCSAgICAvZGV2L2Rpc2syczIgICAgICAgICAgVU5BVntbJiomXX1BSUwgICAgICAwICAgICAwICAgICAwICBjYW5ub3Qgb3Blbg0KDQplcnJvcnM6IE5vIGtub3duIGRhdGEgZXJyb3JzDQpcIjt7WyYqJl19[[/code]]
ZFS probably would have noticed the missing disk on its own but i chose to do a scrub on the volume to make it check for me. As you can see the data is still available, ZFS knows the disk is gone but keeps going. It also makes a note that if you ‘online’ the disk it can continue right where it left off. I don’t however know if it would then replicate changes made while in this degraded state back to the reattached disk once it is connected again. I suspect that might be the case though, or it wouldn’t be a mirror any longer. I do know that it is possible to take one of the disks of a RAIDZ pool offline and replace it while the data is still online.
And of course i then had to test what happens when all of the disks in a pool are removed since i knew about the kernel panic issue (actually a feature as we will get to in a moment). So I saved what i was working on, and yanked the usb cable of the remaining disk. It took a second but the kernel then did a panic, wrote some stuff to the screen referencing the zfs kernel extension, and i rebooted the machine. Now as i mentioned this is actually a feature, in later builds of ZFS from Sun there is a zpool setting that can be changed as to what the behavior will be if this situation occurs, the options are currently to continue, to wait, or to call a panic. The OS X build available right now does not support the other 2 options nor does it offer the choice as a zpool option, but if it did you could set the pool to wait which at the least would not lose all of the other data in open programs on the machine as a panic does.
That about wraps up this relatively brief look at the state of ZFS on OS X right now. There are probably newer builds in the Snow Leopard seeds but i don’t have access to them and i probably could not test and post about them if I did. Please if you have questions, write a comment and i will try to answer quickly. The next post in this series will cover the zfs-fuse implementation of ZFS on Linux, which will be more indepth due to the more recent codebase and feature complete nature of the implementation.
ZFS detects and repairs silent corruption. That is the main point of using ZFS:
http://opensolaris.org/os/community/zfs/docs/zfs_last.pdf
“Measurements at CERN (using Linux)
● Wrote a simple application to write/verify 1GB file
● Write 1MB, sleep 1 second, etc. until 1GB has been written
● Read 1MB, verify, sleep 1 second, etc.
● Ran on 3000 rack servers with HW RAID card
● After 3 weeks, found 152 instances of silent data corruption
● Previously thought “everything was fine”
● HW RAID only detected “noisy” data errors
● Need end-to-end verification to catch silent data corruption”
http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
“When a drive returns garbage, since RAID5 does not EVER check parity on read (RAID3 & RAID4 do BTW and both perform better for databases than RAID5 to boot) if you write a garbage sector back garbage parity will
be calculated and your RAID5 integrity is lost! Similarly if a drive fails and one of the remaining drives is flaky the replacement will be rebuilt with garbage also propagating the problem to two blocks instead of just one.”
http://74.125.77.132/search?q=cacheA0-6Ufp8a8J:gridmon.dl.ac.uk/nfnn/slides/RichardHughesJonesNFNN2.pdf+%22One+failure+for+every+10**15+bits+read%22&hl=sv&ct=clnk&cd=2&gl=se
“One failure for every 10**15 bits read. 1 week). Data loss possible”
Leave a comment or question