In our ever-expanding quest to make things a bit more streamline, I'm trying to simplify our offsite backup system.
Currently, we're trying to use JungleDisk to back up /opt/zimbra/backup, but for the last year it's just been an uphill battle- JD has trouble with more than ~500,000 files or so.
As a secondary plan, I have been doing an rsync of /opt/zimbra/backup from each of our Zimbra servers to a dedicated "storage" machine.
On a typical week-day, the rsync update is only about 6-8GB of "changes" in /opt/zimbra/backup from the previous night differential backup.
Every Saturday, however, our servers all do a "full" backup, dumping approximately 252 GB in new files to the collective /opt/zimbra/backup of our servers, which can be copied over to our storage server in fairly short order the next day.
The problem, of course, is that between our Zimbra servers and this Storage box exists all gigabit Ethernet- but in order to get that data offsite we're working with a 2MB/s link- or sneakernet.
Moving 10GB nightly isn't terrible, but the weekly backups are so large that I just can't sync them in a reasonable time frame, and driving an hour each way to swap in a hard disk every week seems... well, dumb. This isn't the 90's!
For those of you who skip the long story, here's what I'm thinking:
1) We are using rsync -avHu.
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive recurse into directories
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-g, --group preserve group
-o, --owner preserve owner (super-user only)
-D same as --devices --specials
--devices preserve device files (super-user only)
-v, --verbose increase verbosity
-H, --hard-links preserve hard links
-u, --update skip files that are newer on the receiver
I read somewhere that the BLOB files that Zimbra saves are often referred to via hard links, so theoretically I only need to copy a given BLOB over once, where it can be referred to by dozens of hard links, saving space. Does this seem to be about right, or am I missing something with rsync?
2) Related to BLOBs, I'm seeing a bunch of fairly large BLOB .zip files going across the wire with every rsync. Is the Zimbra internal backup re-writing new BLOB files when it could be re-using existing ones? Perhaps there is something I could tweak in the way our backups run to minimize re-sending extra data?
3) Another thought I had is finding some software that could examine the data left on our "storage" server, or perhaps even right on /opt/zimbra/backup from each server, and de-duplicate the data for transfer, and perhaps do block-level updates to a remote storage place. This is a feature we love JungleDisk for, but it cannot seem to keep up with these huge number of files.
4) We could also play with doing something insane, like using TAR to give each servers /opt/zimbra/backup one, single large file- and feeding it to a third party service like JD. My biggest problem with this is that if we ever had an event and needed our offsite backups, we would be waiting for JD to upload them to us- we would much rather have them somewhere we can physically get to, if not in our own secure location.
5) We are moving office locations, and may be able to get a much faster internet connection. The WAN where our Zimbra systems sit has a theoretical 100mbit pipe, where we currently have a 30- which tests to 2MB/s. Depending on pricing we may be able to upgrade to a 50mbit pipe, or even more, but I feel like something is fundamentally wrong with throwing so many resources to move what is mostly the same data over, and over, and over again.
6) Any other ideas? Am I going about this all wrong?