UPDATE: due to a wordpress bug, the pictures for this article are all missing from the server. i am quite angry about this but im going to leave the post as-is because the text is still relevant and useful.

Use any complex operating system or software package long enough, and you will see an upgrade break it. Unforeseen changes made by users or administrators, mistakes made by software packagers, disk corruption, many different things can cause an upgrade to go bad. You can backup data and plan for problems but not everything can be prevented, and significant downtime can occur even if you take the right precautions before letting something or someone upgrade or change system files or databases.

Some operating systems provide sophisticated package managers that can assist in performing upgrades or help roll back changes if something goes wrong, for instance the APT package manager from the Debian operating system. APT supports binary and source installation, complex dependencies, package signing and more. It also happens to be used by Ubuntu and a number of other operating system distributions, which will become important shortly.

In the past problems with upgrades could be mitigated by using volume managers that can take a snapshot of a filesystem, or using a development system to test the changes before making them to a production server, however this can require significant planning and expertise, and not everyone can benefit from these technologies, most home users and small business don’t even have multiple servers or test systems, and rarely if ever use complex volume management.

But there is a new filesystem being integrated in various operating systems that can take care of many of these problems. ZFS is every serious geek’s favorite toy right now, and as of March 2009 almost all common operating systems support it in some form, even Linux. ZFS provides some features that are of interest to our little upgrade problem, for instance live writable snapshots of a filesystem, checksums on data, and bootable clones. Some of those things are already possible in high end enterprise hardware, and uber-geek Linux systems, however there is a new operating system that provides all of the advantages of the APT package manager with the ZFS capabilities described above and rolls them into a nice little tool that neatly solves the upgrade problem without expensive hardware.

Enter Nexenta, an open source OS which allows us to make use of mature and stable components from Ubuntu such as the GNU userland tools, high quality software repositories and large user base, with Solaris features like native ZFS, Zones, and DTrace. The result is an operating system that combines many of the strengths of Ubuntu and Solaris and provides them in one easy to install and use package.

Because the majority of the system is based on Ubuntu, it uses deb packages and the apt package management system for managing software. One of the nicest features of Nexenta I found by accident while trying to update the system with apt-get:

Apt-get fail

Apt-get fail

Because of the way Nexenta works, not everything comes from Ubuntu, some core packages come from Sun, those are the SUNW packages listed. The developers of Nexenta have modified some of the tools that come with the operating system to integrate ZFS clones and prevent the system from trying to upgrade these packages in place, as you can see the apt-get tool refused to upgrade them in place and appears to have tried to run apt-clone on its own, however it is beta software (this is Nexenta core platform 2, beta2) and there is what appears to be a bug, or perhaps it just wants us to upgrade things manually :)

So what exactly is apt-clone? It doesn’t come from Ubuntu or Sun, it is a special wrapper around several other tools written in Perl by the Nexenta developers. The manual page describes it as a “ZFS integrated APT package handling utility”. The ZFS integration described involves automatic cloning of the filesystem before upgrading, automatic handling of grub menus, and management of those zfs clones after upgrade.

So what can you do with apt-clone? Lets take a look, first lets upgrade those packages apt-get refused to upgrade for us:

apt-clone upgrade

It takes the same syntax as apt-get, so apt-clone upgrade will allow us to use ZFS to our advantage while upgrading packages.

apt-clone finished

apt-clone finished

Not only has it created a clone of our root filesystem it has even updated grub for us. The commands listed will allow us to either roll back or activate the new system once we test it to make sure it works. So lets allow it to reboot and see what happens :)

apt-clone finished

When grub loads we can see the new cloned system with the upgraded packages has been added to the menu, if however there is a problem the original grub entry is still present, we can reboot into that item and nothing will have changed, we can delete the upgraded snapshot and keep right on going, no significant downtime.

Now lets list the available root snapshots:

Root clones

Root clones

We can see that in the A column, the old rootfs-nmu-000 clone is still the default “active” one, however we are currently booted into the rootfs-nmu-001 as listed by the C column.

And now lets activate the new one since the upgrade went well:

Activate new snapshot

The tools provided allow you to keep multiple zfs clones and boot into any of them, by default the system will retain 32 of them and grub will list 16 of those clones in its menu. This too can be changed with the apt-clone tool. If you are sure you will not need the prior clone you can get rid of it and perhaps save some space, though in this example the space used by the original root clone is only 62MB:

Remove old clone

I have removed the old clone now and the system will operate as normal with the new upgrades.

So what does this mean? For Nexenta users it is great, going forward Nexenta will be an important player in the enterprise market, already there are commercial software packages based on Nexenta that make use of its native ZFS support to provide storage. One such tool is NexentaStor, a stable storage management system that can integrate with VMware, provide high availability storage, and easy management.

But what about Linux? It is true that ZFS on Linux is a fuse module right now, which slightly complicates things, though it may be possible for Linux to gain the sort of upgrade security we see here even with the zfs-fuse package. However, a number of other technologies on Linux already provide some of these advantages, the LVM layer can do writable snapshots, and grub2, when it becomes the default bootloader in Linux distributions, should be able to boot into those snapshots. There is also a new filesystem being created in part as a response to ZFS, called BTRFS, which will have many of the advantages of ZFS and may be capable of enabling “unbreakable upgrades” for average Linux distributions like Ubuntu in the future as well.

At some point it may be possible for Canonical and other Linux distributors to integrate fail-safe upgrades for critical system packages with no real effort on the part of the end user. You can already boot into an older kernel on Ubuntu by simply selecting it in the grub menu (assuming it is still there), but what about other packages? What if something like Xorg breaks? That exact scenerio happened in the recent past, an update to Xorg resulted in Ubuntu users booting into a system with no functional GUI. What if they could simply reboot and select “last known stable system” in the grub menu? With some minor effort as we see here with Nexenta, that would be possible, and it would reduce Canonicals support costs substantially.

And what about Mac OS X? Apple is already integrating ZFS in 10.6 “Snow Leopard”, if Apple supports root on ZFS they could implement an option in their core system upgrades that would snapshot the old system before installing upgrades, and allow the EFI bootloader to boot back into the old system where nothing has changed if something goes wrong, all without reinstalling or using the installation disc, or any significant downtime. If the installer breaks something, reboot and hit “old system” and there you are, back the way it was, instantly.

While some operating systems provide the ability to roll back changes, they usually require significant time, or an experienced user with some external repair tools. The goal here is to be instant, if something goes wrong with an upgrade, it should be possible to boot into the old system without running any terminal commands, booting any rescue system or installation disc, or otherwise manually rolling things back. With Nexenta we have what appears to be a very solid implementation of this sort of fail-safe upgrade system, and i can only hope others will take notice and work to implement it in other operating systems in the future.