Skip to content


Incremental backups in AWS Glacier using Duplicity

As the CTO of my household, I’m responsible for keeping our irreplaceable data safe. To that end, I have a household networked storage device that runs RAID to guard against the failure of a single drive. But what if that thing gets stolen, or the house burns down? Our data would be gone forever.

The answer, obviously, is offsite backups. I’ve been looking around for a while for a sane way to do this, given that I have hundreds of gigs of data, mostly photographs, that I need to back up. Duplicity seems like a good option, since it’s incremental and has an s3 backend. But I wasn’t looking forward to paying hundreds of dollars a year for my wife’s increasingly large raw-photography collection, and I didn’t really trust my dinky web host, my only offsite server, to keep the data safe.

But now there’s Glacier, which offers the durability of s3 at a tenth the price. The trade off is that you can’t get immediate access to your files anymore; it takes several hours to make them available for download. But since this is meant to handle the catastrophic failure of my on-site backup, that’s acceptable.

However, when I went to implement this solution the brain-dead simple way, I quickly ran into a couple big problems:

  • The first backup takes a long time to complete. I mean a looooooong time, measured in weeks. And since it’s all one big archive from duplicity’s perspective, there’s no way to retry without starting over completely.
  • When the files transition to glacier, duplicity can’t download them immediately. It needs immediate access to the *.manifest files it creates and archives each run or else it can’t do an incremental backup.
  • The first issue made the whole process unwieldy and error prone, but the second one was a total non-starter, and sent me back to the drawing board for a couple weeks. Then I found this blog post with a perfect, if hacky, solution to the problem of the manifest files. I decided to write it up, together with an even hackier solution to the giant-sized archive problem I was having, as a python script.

    The result is glacierplicity, my first public non-work GitHub repository! Feast upon its bounty! I hope that S3 hurries up and makes its lifecycle rules more expressive to obsolete this script, but I’m not holding my breath.

    Also, if you’re going to try to run this on your DNS-323 or another NAS like I am, you might find the following resources very helpful:

    http://www.drak0.com/2008/06/09/dns323-duplicity-encrypted-offsite-backup-bliss/#more-4
    http://www.drak0.com/2008/06/13/duplicity-on-dns323-shell-scripts/
    https://code.google.com/p/dns323-builds/

    I had to at one point build python 2.6.5 from source to be able to use the most recent version of boto. Provided this backup cron works, I think it will all have been worthwhile.

Posted in Uncategorized.


7 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Nate says

    Is there a reason you have the dir_size_threshold hack with multiple archives rather than the approach of just using a single bucket with a large duplicity volsize? i.e., run duplicity with something like “–volsize 2000″? Wouldn’t this solve the same problem in a much simpler way?

    • Zach Musgrave says

      I wasn’t familiar with the volsize option when I wrote this, but it looks like that wouldn’t help. Duplicity will break each backup chain into lots of little volumes of size volsize. For my version of duplicity (a bit old) and using the S3 backend, that number is 5MB. In practice, this means that it uploads 5MB objects into S3, rather than arbitrarily large ones.

      The main reason I have the dir_size_threshold hack is to make the script resilient. I have several hundred gigs of photos to backup. I tried this at first with a single large archive, and it kept dying in the middle of the multi-week backup process. Every time this happened, duplicity had to start from scratch. Breaking things into multiple archives means that when the script fails in the middle, you don’t lose all the work you have done so far.

      Because S3 limits you to 100 buckets, I think a better approach would be to combine many archives into a single bucket using different paths. But that’s more complicated than I wanted to attempt on my first try.

      • Nate says

        But if you specified “–volsize 2000″ instead of “–volsize 5″, you could have a smaller number of ~2GBish chunks. It should also solve your problem of having to restart transfers, since having to restart a 2GB chunk isn’t as catastrophic as having to restart hundreds of GB.

        I’m setting up a system to do something similar to what you are, but with several TBs of data. Hopefully I’m not missing anything.

        • Zach Musgrave says

          What you’re missing (I think) is the difference between archives and volumes. Each archive is broken up into multiple volumes. Each archive upload must be essentially atomic, so it does’t matter what volume size you specify — if you lost power in the middle of uploading an archive of several terabytes (which would take months on my cable internet connection), you would need to start from scratch the next time the script runs. This problem is, in my opinion, the main weakness of duplicity as it exists today, and is the main reason I implemented this hack.

          There are probably better ways to accomplish the partitioning process. This way works reasonably well for me, since I have a pretty hierarchical directory structure (photos organized by year, then by broad sets of photos). Seems like getting much better partitioning than this would be best accomplished by a patch to duplicity itself.

          • Nate says

            OK, I think I understand where you’re coming from better now. You’re right that I didn’t realize the archive uploads needed to be atomic.

            But do they really need to be atomic? I just did some tests here with duplicity 0.6.21 and I can clearly see they make an attempt to support resuming on the latest volume of the archive uploaded. It was somewhat confusing to figure out the caveats, but after looking at the duplicity source I learned two things:

            1) Duplicity will always re-upload the latest volume uploaded, even if it was 100% complete, because it can’t guarantee its integrity. Checking its integrity would involve downloading the volume, so it’s simpler to just re-upload it.

            2) When resuming a failed backup, duplicity will always attempt to download the first volume to verify that the encryption settings haven’t changed since you last ran it. This was confusing, because the console messages suggested that duplicity was starting over at volume 1, with no hints that it’s checking encryption settings on the remote volume 1.

            For me, I don’t ever plan on changing my GPG password in between failed uploads, so I patched my duplicity as follows to get rid of this unnecessary download:

            — duplicity.orig 2013-05-09 14:57:47.111610990 -0500
            +++ duplicity 2013-05-09 14:58:50.535610862 -0500
            @@ -343,7 +343,11 @@
            mf = globals.restart.last_backup.get_local_manifest()
            globals.restart.checkManifest(mf)
            globals.restart.setLastSaved(mf)
            - validate_encryption_settings(globals.restart.last_backup, mf)
            + # Modified because we will not change encryption settings in between failed uploads
            + # Don’t unnecessarily download the first volume
            + skip_encryption_validate = True
            + if not skip_encryption_validate:
            + validate_encryption_settings(globals.restart.last_backup, mf)
            mf.fh = man_outfp
            last_block = globals.restart.last_block
            log.Notice(“Restarting after volume %s, file %s, block %s” %

          • Zach Musgrave says

            That’s really interesting. I was running into the same problem on a retry — duplicity would die because it couldn’t download the first volume, it having already transitioned into glacier by the time of the retry. I’ll have to give this a shot on my end. The only drawback to this approach is that you will need to keep everything in S3 for the length of that several-month initial backup, since you don’t want your .manifest files transitioning to glacier during this time. Seems like your patch, along with this one, is the best long-term solution:

            https://bugs.launchpad.net/duplicity/+bug/1170161

  2. Jay says

    Hey Zach,
    I’ve been using your script with great success on a FreeBSD box. Got something like 200GB successfully stored on Glacier now.

    There’s one thing I’ve noticed however – I’m unable to get the ignore_directories to be recognised. No matter what I do, the directories never get ignored. I’ve tried absolute and relative paths.

    example:

    backup_dirs = ['/home/jay/test']
    ignore_directories = ['00Catalogs','dir2']

    Those 2 directories are subdirectories under /home/jay/test.

    Both directories still get included in the backup.

    I’m running Python version 2.7.

    Any ideas?

    Cheers



Some HTML is OK

or, reply to this post via trackback.