Until today I don't know how many hours I spent into investigating how to make my backups reliable. Because I need to process several hundreds of gigabytes really every free backup solution available on the Linux market (April 2010) is not useful at all! Please correct me if I have overseen something important, but I will discuss these "backup" programs in detail. Note that by backup I mean not archiving which is a quite different topic to discuss about.
The oldest and still the most favorised way of doing backups is rsync. Using clever delta transmission the approach of transferring data and saving bandwidth is well tested and really good so far. If you now make revisioned backups you will certainly use one of the scripts which will helps you to create rotating backups. But this approach is limited to a per file basis. Having files which will have overall changes this is still a good idea, but this is not a real life case. Imagine you have a 20gigabytes virtualization image which just changes sparely you will end up with a lots of wasted space. Besides this the typical rsync script is not capable of handling renamed files properly, thus reordering your family video files is the most senseless waste of space (but it's possible for sure: it's stated that dirvish uses md5 sums to determine unique files and hardlinks to them). After some reading it look's like Apple's Time Machine seems to work in a similiar way.
The next step in backup tools was to combine the delta transmission with a clever storage which just stores these deltas. A tool which does this is rdiff-backup. But this is not very useful at all. First of all only the latest backup is accessible from filesystem. Accessing the backup-ped data using standard tools is surely one of the most important features a backup solution should provide, especially on Linux based systems, which may change faster than any small company can fix their solution (imagine an Acronis Trueimage like solution on Linux). Moreover rdiff-backup is not capable of detecting renaming but the most important point is that it doesn't support real big files (20gb+). It just tries and then gives up. Using any source code revision tools like svn or git (e.g. Flyback) is out of question.
Now take a look to another standard approach using tar, zip or other containers. These formats have several drawbacks. Most of these formats are quite old and therefore not capable of dealing with large files. Indeed there are some extentions to tar or zip which addresses this issue. But it's always unclear if your toolchain can support this or just will damage your backup. Another problem is to make revisioned backups using these containers. Tar can use extra files but these are limited to the file level again. Just tried some taring of some gigabytes on an 9.10 Ubuntu 64 bit machine and it just will fail. I suppose that you can't feel good having your backup in just one single file, which is prone to be damaged.
Let's come to the big players, like Bacula or Amanda. These tools had the chance to make it better by using their own "crawler" clients which collects the data on the clients and can store the data in any special format on their servers. However these tools are not able to handle duplicates or file renaming gracefully. Therefore I didn't check if they handle deltas. Further the architecture leads to the fact that these tools are rather complicated to setup and to maintain. Thus these tools are also disqualified.
Still there are these offline imager like mondo rescue or clonezilla. Nothing to say to the fact that you need to turn of your system and boot the backup OS. Who will do this all day? Nothing else to say.
Well what's left until now? Not much, because the problem can not be solved at the file level. Thus we need to look at some filesystems. The most important thing is to have a dedublicating filesystem. It would be great (but due to hardlinks not necessary) to have snapshots as well to make the revisions. Btrfs supports snapshots but has no dedub feature. Thus it's can become quite inefficent to use snapshots for file revisions. Besides that btrfs can not be really recommend because on disk format may still change in the future (but should be rolling changes). There is another candidate, LessFS. Seems not to have snapshots features in it's current release and looks quite amateurish to me. I even can't compile it on an Ubuntu 9.10 machine due to some not resolvable libraries (to new versions). Further there is no way to determine if the backup has been damaged. And it's also Fuse. Now, just one another amateurish looking project remains: ZFS-On-Fuse. The fork from the original zfs for fuse port has integrated the actual ZFS (which itself is indeed professionel) features, which is mainly the dedublication (beside secure hashsums, raid functions, self healing etc.) feature from my point of view. ZFS has a lot benefits when using it as a backup solution and does everything you need. It has to be discussed if it's ready to use, but it's obviously the only next generation filesystem which has been developed, tested and used in these large dimensions.
To make a conclusion I append a small groovy script which shows the principal idea of using ZFS on Fuse and rsync to make REAL backups. Now you can use space efficient, bandwith efficient, simple accessible, reliable (the Fuse port must be proofed) and verifiable (make a scrub) backups.
ZFS-On-Fuse is not ready for this purpose. I tweaked some options in /etc/zfs/zfsrc, the best thing was to increase the max-arc-size into gigabyte range, which then gives usable performance (still below 10mb/s with dedub on). But the worst thing was that it was dying (fuse daemon) in a copy process of a very large file reproducibly. I'm unsure why this happens (source disk has an erroneous sector), some more experiments with ext4 showed that this file is also absent after the rsync process but in contrast to zfs the filesystem is still accessible then.
I will need to check BTRFS stability und snapshot features. Early experiments showed that one must be carefully to take efficient incremental backups. It looks like that rsyncs default behaviour is not a help, because it first allocates a new file and then removes the old one (compare --inplace option). Thus BTRFS is not able to store only the changed parts of the original file (e.g. happens for Virtualbox Images) but the whole new one, which "leaks" the unchanged data.
Just tried the latest BTRFS code available for Ubuntu 9.10. BTRFS is actually totally incapable of detecting and repairing silent data corruption! Tried with a mirrored filesystem using two loopback devices and making one loopback device faulty with the help of an hexeditor (data part of file). Access to file will just fail!
Tested with latest testing branch of zfs-fuse and zfs handles this as expected: A scrub identified the error and even repairs it without any problem.