<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://townx.org" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>townx - Thoughts on rsync and S3 - Comments</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3</link>
 <description>Comments for &quot;Thoughts on rsync and S3&quot;</description>
 <language>en</language>
<item>
 <title>Brackup</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-38297</link>
 <description>&lt;p&gt;check out Brackup - may be exactly what you want.&lt;/p&gt;</description>
 <pubDate>Wed, 02 Jul 2008 15:42:09 -0500</pubDate>
 <dc:creator>Daniel Von Fange</dc:creator>
 <guid isPermaLink="false">comment 38297 at http://townx.org</guid>
</item>
<item>
 <title>Haven&#039;t tried it yet, but</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-34387</link>
 <description>&lt;p&gt;Haven&#039;t tried it yet, but sounds interesting.&lt;/p&gt;</description>
 <pubDate>Thu, 17 Apr 2008 18:19:46 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">comment 34387 at http://townx.org</guid>
</item>
<item>
 <title>Dose any one try s3rsync.com ?</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-26626</link>
 <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Dose any one try s3rsync.com ?&lt;br /&gt;
They claim to provide full rsync on top of s3.&lt;br /&gt;
Is it read work?&lt;/p&gt;

&lt;p&gt;Dan&lt;/p&gt;</description>
 <pubDate>Sun, 23 Mar 2008 01:45:40 -0500</pubDate>
 <dc:creator>Dan m</dc:creator>
 <guid isPermaLink="false">comment 26626 at http://townx.org</guid>
</item>
<item>
 <title>Thanks Ryan: some good ideas</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-4984</link>
 <description>&lt;p&gt;Thanks Ryan: some good ideas in there.&lt;/p&gt;</description>
 <pubDate>Fri, 08 Sep 2006 04:21:58 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">comment 4984 at http://townx.org</guid>
</item>
<item>
 <title>Cheers Ethan, good point. I</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-3356</link>
 <description>&lt;p&gt;Cheers Ethan, good point. I hadn&#039;t thought about buying hosting just for the storage, but that might be a good way to go.&lt;/p&gt;</description>
 <pubDate>Thu, 31 Aug 2006 16:32:39 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">comment 3356 at http://townx.org</guid>
</item>
<item>
 <title>Unison?</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-3127</link>
 <description>&lt;p&gt;Just a note that I&#039;m able to run Unison on JungleDisk on OS X.  JungleDisk gave some weird errors when I tried to bulk-copy from within Finder, so I&#039;m a little sketched out, but Unison seems to work just fine.  Still, I&#039;m a little skeptical about how wonderfully cheap S3 storage actually is when you factor in Bandwidth costs.  I can get a dreamhost account with 20 gigs storage and a terabyte of transfer for under $10, does s3 really improve so much on that? (I say as I&#039;m uploading 10 gigs of backup data for a client to s3 because it&#039;s just so easy to set up!)&lt;/p&gt;</description>
 <pubDate>Mon, 28 Aug 2006 23:29:17 -0500</pubDate>
 <dc:creator>Ethan</dc:creator>
 <guid isPermaLink="false">comment 3127 at http://townx.org</guid>
</item>
<item>
 <title>Thanks for that Gabriel. I</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-2780</link>
 <description>&lt;p&gt;Thanks for that Gabriel. I had thought that writing a &lt;span class=&quot;caps&quot;&gt;FUSE &lt;/span&gt;layer would be a good way to go, so interesting to see someone taking this approach.&lt;/p&gt;

&lt;p&gt;Glad the code was useful. I&#039;m doing a fair bit of hacking on it this week!&lt;/p&gt;</description>
 <pubDate>Tue, 22 Aug 2006 03:54:24 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">comment 2780 at http://townx.org</guid>
</item>
<item>
 <title>rsync using s3 implementation</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-2769</link>
 <description>&lt;p&gt;You might want to take a look at this if you haven&#039;t seen it already.  I&#039;d read the first post to get an idea of what his goals were and then skip to the last post to see where they&#039;ve gotten with it.  Sounds pretty impressive. &lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://developer.amazonwebservices.com/connect/thread.jspa?threadID=10271&amp;amp;tstart=0&quot; title=&quot;http://developer.amazonwebservices.com/connect/thread.jspa?threadID=10271&amp;amp;tstart=0&quot;&gt;http://developer.amazonwebservices.com/connect/thread.jspa?threadID=1027...&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Someone had the exact same thought as you and has already gotten quite a ways along the way.  Unfortunately for me, it&#039;s on Linux and relies on that at the moment, but it sounds like you might be able to use it.&lt;/p&gt;

&lt;p&gt;Oh, thanks for your S3 library, it was a good starting point for my own application.&lt;/p&gt;</description>
 <pubDate>Mon, 21 Aug 2006 22:05:08 -0500</pubDate>
 <dc:creator>gabriel</dc:creator>
 <guid isPermaLink="false">comment 2769 at http://townx.org</guid>
</item>
<item>
 <title>The upload to S3 is the</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-2574</link>
 <description>&lt;p&gt;The upload to S3 is the thing which seems slow: my estimates put it at around 12 seconds per megabyte (on a connection which is a lot faster than that). This is OK for small files, but would mean a 3 hour upload per gigabyte. If you&#039;ve got a few big files, that&#039;s not good enough. Perhaps my figures are wrong, or my code is really hokey. I&#039;ve even been considering writing some C code (which would entail learning C, unfortunately) so I can make the upload more efficient. Perhaps I should do some experiments with Java/Python programs for S3 access, to get an idea of how fast they go.&lt;/p&gt;</description>
 <pubDate>Thu, 17 Aug 2006 14:52:44 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">comment 2574 at http://townx.org</guid>
</item>
<item>
 <title>Slow?</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-2572</link>
 <description>&lt;p&gt;Do you mean slow as in actual download speed (bandwidth), or the sending of commands and the processing time?&lt;/p&gt;

&lt;p&gt;I can&#039;t say that I had really focused on speeds - but I know it&#039;s not breaking any records...&lt;/p&gt;</description>
 <pubDate>Thu, 17 Aug 2006 14:28:00 -0500</pubDate>
 <dc:creator>Ryan Daigle</dc:creator>
 <guid isPermaLink="false">comment 2572 at http://townx.org</guid>
</item>
<item>
 <title>Thanks for that. Looks</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-2562</link>
 <description>&lt;p&gt;Thanks for that. Looks interesting. I&#039;m concentrating on the sync code at the moment, and need to get some ideas about how to make this efficient. Is it just me or is S3 really slow? Is it the Ruby &lt;span class=&quot;caps&quot;&gt;HTTP &lt;/span&gt;libraries? Not sure.&lt;/p&gt;</description>
 <pubDate>Thu, 17 Aug 2006 10:26:21 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">comment 2562 at http://townx.org</guid>
</item>
<item>
 <title>Ruby S3 library</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comment-2560</link>
 <description>&lt;p&gt;I had started on a ruby library for S3 that acted more like a filesystem (think root, directories, files etc...) than buckets but quickly abandoned it.&lt;/p&gt;

&lt;p&gt;I just put up anonymous access if it might be useful to you:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://saucyworks.devguard.com/svn/projects/saucy3/&quot; title=&quot;https://saucyworks.devguard.com/svn/projects/saucy3/&quot;&gt;https://saucyworks.devguard.com/svn/projects/saucy3/&lt;/a&gt;&lt;br /&gt;
anon/anon&lt;/p&gt;

&lt;p&gt;It&#039;s very rough - but it&#039;s always nice to have others&#039; implementations lying around if you need a quick &#039;how do others do X&#039; refresher.&lt;/p&gt;</description>
 <pubDate>Thu, 17 Aug 2006 09:33:28 -0500</pubDate>
 <dc:creator>Ryan Daigle</dc:creator>
 <guid isPermaLink="false">comment 2560 at http://townx.org</guid>
</item>
<item>
 <title>Thoughts on rsync and S3</title>
 <link>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3</link>
 <description>&lt;p&gt;rsync is a great utility. I bloody love it, and use it to do all my backups: over the network via &lt;span class=&quot;caps&quot;&gt;SSH &lt;/span&gt;onto remote filesystems, and locally onto &lt;span class=&quot;caps&quot;&gt;USB &lt;/span&gt;drives. I also use it to restore my home directory when I switch machine. Great.&lt;/p&gt;

&lt;p&gt;However, the network part of my backup solution is expensive (in money). I use &lt;a href=&quot;http://strongspace.com/&quot;&gt;Strongspace&lt;/a&gt; at the moment, which works really well. But it is too expensive: $8 a month for 4Gb of storage, which means I have to keep a close eye on my files to make sure nothing massive goes over and maxes out my account. There are alternatives, of course, and the &lt;a href=&quot;http://blogs.zdnet.com/Google/?p=121&quot;&gt;GDrive&lt;/a&gt; might solve all of my problems; but for now I&#039;m stuck with not enough network storage. &lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://s3.amazonaws.com/&quot;&gt;S3&lt;/a&gt; gives me the right price, but there is no mature client solution which does the job for me (&lt;a href=&quot;http://www.jungledisk.com/&quot;&gt;JungleDisk&lt;/a&gt; looked promising, but it seems a bit buggy on my Linux machine - it sometimes just hangs - and Natilus doesn&#039;t properly mount WebDAV shares onto the filesystem, so I can&#039;t rsync to it). So I&#039;ve been working on an S3 library for Ruby to allow me to sync a local filesystem to S3 automatically, only transferring the files which have changed.&lt;/p&gt;

&lt;p&gt;I&#039;ve been reading up on rsync a bit to work out what it does, and it&#039;s pretty clever: here&#039;s a &lt;a href=&quot;http://rsync.samba.org/tech_report/tech_report.html&quot;&gt;technical report which explains the inner workings&lt;/a&gt;. It does a block by block copy from the source file to the target file, using a so-called &quot;rolling checksum&quot; on each block to decide whether: a) the source contains the block, but the target does not; or b) the block in the source file exists somewhere in the target file. In this way, only changed blocks are copied from the source file to the target file (plus some checksums and block indexes). This is what makes rsync so fast.&lt;/p&gt;

&lt;p&gt;I&#039;m not sure whether you could work things out the same way with &lt;span class=&quot;caps&quot;&gt;S3.&lt;/span&gt; Each resource (accessible via a key into a bucket) on S3 has an &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksum associated with it, so you could get a checksum for individual blocks or file fragments; but I&#039;m not sure I&#039;d want to split a file over multiple S3 keys to be able to do block-level copying. Although a bucket can contain as many keys as you like (even though you are limited to 100 buckets), so this might be possible. But to retrieve a file you would need to recompose it from the fragments on &lt;span class=&quot;caps&quot;&gt;S3, &lt;/span&gt;which is a bit crap.&lt;/p&gt;

&lt;p&gt;The obvious approach would be to associate one key with each object you put onto &lt;span class=&quot;caps&quot;&gt;S3.&lt;/span&gt; This has the advantage of making each file addressable by &lt;span class=&quot;caps&quot;&gt;URL, &lt;/span&gt;rather than requiring a reconstruction of the file from S3 fragments. Then when you sync your filesystem to &lt;span class=&quot;caps&quot;&gt;S3, &lt;/span&gt;you do a comparison of the &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksum of the local file to the &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksum of the remote file. This could be pretty painful, though, and take forever: one call to S3 to get the target checksum, a checksum on the local file, then transfer of the whole file up to S3 if it has changed. This could perhaps be streamlined: instead of requesting the object using &lt;span class=&quot;caps&quot;&gt;GET, &lt;/span&gt;you could use a &lt;span class=&quot;caps&quot;&gt;HEAD &lt;/span&gt;request, which just gets the metadata; and perhaps the client could send a &lt;strong&gt;If-None-Match&lt;/strong&gt; header with the request, passing the &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksum as the value for the header - in this case, if the &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksum in the request matches the checksum on &lt;span class=&quot;caps&quot;&gt;S3,&lt;/span&gt; S3 will return a 304 response code (not modified), so we know the object hasn&#039;t changed and we don&#039;t need to parse the &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksum out of the response. This could save a few cycles.&lt;/p&gt;

&lt;p&gt;An alternative might be to put a piece of &quot;file modified&quot; metadata onto S3 (not the same as the date on the S3 resource, but a copy of the file modification time of the local file when it was transferred). Then just compare the local file modification time to the S3 metadata when the algorithm is deciding whether a file needs to be transferred. The file stat to get the modification time is likely to be far faster than an &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;hash.&lt;/p&gt;

&lt;p&gt;Yet another approach would be to just keep track locally of file paths and modification times (e.g. in a database) from the last time they were sent to &lt;span class=&quot;caps&quot;&gt;S3.&lt;/span&gt; I think this is what JungleDisk does. Anything that has been added/removed/changed will be transferred without needing to reference S3 at all, so no expensive network operations. However, this will only work if S3 is only being sync&#039;d from one location, and won&#039;t work if you are trying to sync from multiple locations. Is this enough, or should local files always be compared to S3 to determine whether they should be transferred? Maybe a local database of file modification times is enough?&lt;/p&gt;

&lt;p&gt;One more idea: perhaps you could combine the database of local file modification times with a cache of &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksums for those local files. You could then:&lt;/p&gt;


&lt;ol&gt;
&lt;li&gt;Find any new files and transfer them to &lt;span class=&quot;caps&quot;&gt;S3, &lt;/span&gt;while their checksum is generated and stored in the local database in the background&lt;/li&gt;
&lt;li&gt;Find any files which have changed, generate new checksums, cache them, and then compare those generated checksums to S3 checksums: the file gets transferred if its checksum differs from the one on S3&lt;/li&gt;
&lt;li&gt;Determine which local files have disappeared and optionally remove them from S3; then delete their cached &lt;span class=&quot;caps&quot;&gt;MD5 &lt;/span&gt;checksum from the database&lt;/li&gt;
&lt;li&gt;Any files which haven&#039;t changed can have their checksum compared to the S3 resource&lt;/li&gt;
&lt;/ol&gt;



&lt;p&gt;This has the advantage of some local caching, but isn&#039;t dependent on it; so you could use this approach to sync multiple machines to a single S3 bucket (unlike the database-only method). And the local cache could be optional, so you could use this approach from any machine, even if it was unable to do the caching (though with the ubiquity of file-based databases like &lt;span class=&quot;caps&quot;&gt;SQL&lt;/span&gt;ite it&#039;s unlikely you&#039;ll be working on a system with no database).&lt;/p&gt;

&lt;p&gt;You could use this to do two-way synchronisation too, potentially. So if a file hasn&#039;t changed locally, you could get the client to compare to the S3 version, and fetch that to replace the local file if they are different. Though I think I will be concentrating on doing it in one direction first.&lt;/p&gt;</description>
 <comments>http://townx.org/blog/elliot/thoughts_on_rsync_and_s3#comments</comments>
 <category domain="http://townx.org/tech">tech</category>
 <pubDate>Thu, 17 Aug 2006 04:05:50 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">383 at http://townx.org</guid>
</item>
</channel>
</rss>
