Fedora Project

Fedora Project

In a previous article, I demonstrated how an Opensolaris/Debian hybrid operating system called Nexenta was providing failsafe operating system upgrades using features of the ZFS filesystem. At the time, I hoped that Linux distributions would provide the same capability using native technologies such as LVM, GRUB2, or perhaps Btrfs, a native Linux filesystem providing many of the same features that ZFS currently offers.

Now it appears that Fedora will be the first of the major Linux distributions to offer users the “unbreakable upgrades” that Nexenta has been providing for quite some time now. Though the Fedora wiki page outlining the new functionality lists the completion status as 0%, I have no doubt that this is such a useful feature that even if it does not make the Fedora 13 cutoff (mid 2010), it will no doubt be included in a future release.

Implementing this functionality requires support both in the kernel, and a number of userspace components such as the Yum package manager, and the Grub bootloader. The mainstream kernel already supports an early version of Btrfs, but so far it has been clearly marked as experimental due to a number of showstopper bugs that I and others have been hit with during testing. By the time Fedora 13 is released, the Btrfs code base will have gone through a substantial amount of improvement and those problems may be resolved.

As for why this functionality is so valuable, here is what I posted in the Nexenta article:

Use any complex operating system or software package long enough, and you will see an upgrade break it. Unforeseen changes made by users or administrators, mistakes made by software packagers, disk corruption, many different things can cause an upgrade to go bad. You can backup data and plan for problems but not everything can be prevented, and significant downtime can occur even if you take the right precautions before letting something or someone upgrade or change system files or databases.

In a production environment, things happen. You can’t always take a machine offline for every little security patch, and even patches that have been tested and are known to be OK on a development machine, might cause a production box to grind to a halt.

In a larger business deployment, a cluster or a failover setup can prevent real downtime in many cases, but such things are not always affordable or even possible in other situations such as a home environment or a smaller business. In either case, the ability to simply reboot and rollback system changes at the binary level will save a substantial amount of time both for administrators and users, and it can’t come fast enough.