Thoughts on rsync and S3

rsync is a great utility. I bloody love it, and use it to do all my backups: over the network via SSH onto remote filesystems, and locally onto USB drives. I also use it to restore my home directory when I switch machine. Great.

However, the network part of my backup solution is expensive (in money). I use Strongspace at the moment, which works really well. But it is too expensive: $8 a month for 4Gb of storage, which means I have to keep a close eye on my files to make sure nothing massive goes over and maxes out my account. There are alternatives, of course, and the GDrive might solve all of my problems; but for now I'm stuck with not enough network storage.

S3 gives me the right price, but there is no mature client solution which does the job for me (JungleDisk looked promising, but it seems a bit buggy on my Linux machine - it sometimes just hangs - and Natilus doesn't properly mount WebDAV shares onto the filesystem, so I can't rsync to it). So I've been working on an S3 library for Ruby to allow me to sync a local filesystem to S3 automatically, only transferring the files which have changed.

I've been reading up on rsync a bit to work out what it does, and it's pretty clever: here's a technical report which explains the inner workings. It does a block by block copy from the source file to the target file, using a so-called "rolling checksum" on each block to decide whether: a) the source contains the block, but the target does not; or b) the block in the source file exists somewhere in the target file. In this way, only changed blocks are copied from the source file to the target file (plus some checksums and block indexes). This is what makes rsync so fast.

I'm not sure whether you could work things out the same way with S3. Each resource (accessible via a key into a bucket) on S3 has an MD5 checksum associated with it, so you could get a checksum for individual blocks or file fragments; but I'm not sure I'd want to split a file over multiple S3 keys to be able to do block-level copying. Although a bucket can contain as many keys as you like (even though you are limited to 100 buckets), so this might be possible. But to retrieve a file you would need to recompose it from the fragments on S3, which is a bit crap.

The obvious approach would be to associate one key with each object you put onto S3. This has the advantage of making each file addressable by URL, rather than requiring a reconstruction of the file from S3 fragments. Then when you sync your filesystem to S3, you do a comparison of the MD5 checksum of the local file to the MD5 checksum of the remote file. This could be pretty painful, though, and take forever: one call to S3 to get the target checksum, a checksum on the local file, then transfer of the whole file up to S3 if it has changed. This could perhaps be streamlined: instead of requesting the object using GET, you could use a HEAD request, which just gets the metadata; and perhaps the client could send a If-None-Match header with the request, passing the MD5 checksum as the value for the header - in this case, if the MD5 checksum in the request matches the checksum on S3, S3 will return a 304 response code (not modified), so we know the object hasn't changed and we don't need to parse the MD5 checksum out of the response. This could save a few cycles.

An alternative might be to put a piece of "file modified" metadata onto S3 (not the same as the date on the S3 resource, but a copy of the file modification time of the local file when it was transferred). Then just compare the local file modification time to the S3 metadata when the algorithm is deciding whether a file needs to be transferred. The file stat to get the modification time is likely to be far faster than an MD5 hash.

Yet another approach would be to just keep track locally of file paths and modification times (e.g. in a database) from the last time they were sent to S3. I think this is what JungleDisk does. Anything that has been added/removed/changed will be transferred without needing to reference S3 at all, so no expensive network operations. However, this will only work if S3 is only being sync'd from one location, and won't work if you are trying to sync from multiple locations. Is this enough, or should local files always be compared to S3 to determine whether they should be transferred? Maybe a local database of file modification times is enough?

One more idea: perhaps you could combine the database of local file modification times with a cache of MD5 checksums for those local files. You could then:

  1. Find any new files and transfer them to S3, while their checksum is generated and stored in the local database in the background
  2. Find any files which have changed, generate new checksums, cache them, and then compare those generated checksums to S3 checksums: the file gets transferred if its checksum differs from the one on S3
  3. Determine which local files have disappeared and optionally remove them from S3; then delete their cached MD5 checksum from the database
  4. Any files which haven't changed can have their checksum compared to the S3 resource

This has the advantage of some local caching, but isn't dependent on it; so you could use this approach to sync multiple machines to a single S3 bucket (unlike the database-only method). And the local cache could be optional, so you could use this approach from any machine, even if it was unable to do the caching (though with the ubiquity of file-based databases like SQLite it's unlikely you'll be working on a system with no database).

You could use this to do two-way synchronisation too, potentially. So if a file hasn't changed locally, you could get the client to compare to the S3 version, and fetch that to replace the local file if they are different. Though I think I will be concentrating on doing it in one direction first.

Comments

Brackup

check out Brackup - may be exactly what you want.

Dose any one try s3rsync.com ?

Hi,

Dose any one try s3rsync.com ?
They claim to provide full rsync on top of s3.
Is it read work?

Dan

Haven't tried it yet, but

Haven't tried it yet, but sounds interesting.

Unison?

Just a note that I'm able to run Unison on JungleDisk on OS X. JungleDisk gave some weird errors when I tried to bulk-copy from within Finder, so I'm a little sketched out, but Unison seems to work just fine. Still, I'm a little skeptical about how wonderfully cheap S3 storage actually is when you factor in Bandwidth costs. I can get a dreamhost account with 20 gigs storage and a terabyte of transfer for under $10, does s3 really improve so much on that? (I say as I'm uploading 10 gigs of backup data for a client to s3 because it's just so easy to set up!)

Cheers Ethan, good point. I

Cheers Ethan, good point. I hadn't thought about buying hosting just for the storage, but that might be a good way to go.

rsync using s3 implementation

You might want to take a look at this if you haven't seen it already. I'd read the first post to get an idea of what his goals were and then skip to the last post to see where they've gotten with it. Sounds pretty impressive.

http://developer.amazonwebservices.com/connect/thread.jspa?threadID=1027...

Someone had the exact same thought as you and has already gotten quite a ways along the way. Unfortunately for me, it's on Linux and relies on that at the moment, but it sounds like you might be able to use it.

Oh, thanks for your S3 library, it was a good starting point for my own application.

Thanks for that Gabriel. I

Thanks for that Gabriel. I had thought that writing a FUSE layer would be a good way to go, so interesting to see someone taking this approach.

Glad the code was useful. I'm doing a fair bit of hacking on it this week!

Thanks for that. Looks

Thanks for that. Looks interesting. I'm concentrating on the sync code at the moment, and need to get some ideas about how to make this efficient. Is it just me or is S3 really slow? Is it the Ruby HTTP libraries? Not sure.

Slow?

Do you mean slow as in actual download speed (bandwidth), or the sending of commands and the processing time?

I can't say that I had really focused on speeds - but I know it's not breaking any records...

The upload to S3 is the

The upload to S3 is the thing which seems slow: my estimates put it at around 12 seconds per megabyte (on a connection which is a lot faster than that). This is OK for small files, but would mean a 3 hour upload per gigabyte. If you've got a few big files, that's not good enough. Perhaps my figures are wrong, or my code is really hokey. I've even been considering writing some C code (which would entail learning C, unfortunately) so I can make the upload more efficient. Perhaps I should do some experiments with Java/Python programs for S3 access, to get an idea of how fast they go.

Ruby S3 library

I had started on a ruby library for S3 that acted more like a filesystem (think root, directories, files etc...) than buckets but quickly abandoned it.

I just put up anonymous access if it might be useful to you:

https://saucyworks.devguard.com/svn/projects/saucy3/
anon/anon

It's very rough - but it's always nice to have others' implementations lying around if you need a quick 'how do others do X' refresher.

Thanks Ryan: some good ideas

Thanks Ryan: some good ideas in there.