Thursday, July 22, 2010

Personal digital curation: a software category that does not exist, but ought to

If you're a typical computer user today, you have lots of data that you've created or participated in creating. The data takes several forms, typically including:

  • Personal documents: writing, drawing, photographs, home movies, etc.
  • Purchased media: music, movies, software, etc.
  • Electronic records: receipts, account records, etc.
  • Communication: email, chat logs, & other social media messages.

There are several things about how you keep this data today which ought to bother you.

First, it's probably inadequately backed up. By this, I mean that you don't back it up frequently enough, and when you do, it's probably either (a) on a USB hard drive or (b) on cloud storage. Taken alone, either of these is inadequate. Typical hard drive backups are inadequate because on most file systems, the data's neither pervasively checksummed nor stored with redundant error-correcting codes; as a result, all your files are subject to random corruption. There's also the small matter that you probably don't archive your backups regularly to a remote location that's safe if, e.g., your home burns down. Cloud storage alone is inadequate because (1) any compromise to your account could result in malicious deletion or corruption of your data and (2) Amazon/Google/... may seem like permanent institutions today, but on the scale of decades I am somewhat dubious; DEC and SGI and Sun were once lords of Silicon Valley; ultimately one must consider Phlebas and all that.

Second, the data's stored in a mishmash of formats, some of which will be exceptionally difficult to read in years to come. Microsoft's file formats are particularly egregious, but I also have my doubts that, for example, today's video file formats, or an iPhoto or Picasa metadata database, will be readable by commonly available software in twenty years.

Third, a lot of your data's stored in multiple related forms, and the relationship between those forms is totally ad hoc and not captured by future-proof software. For example, you might have a batch of raw photos, of which you pick a few to clean up, rescale to lower resolution, and upload to the web. So now you've got multiple versions of the photo. If you need to go trawling through this mess some years from now, you're in for a lot of curatorial tedium reorganizing it and figuring out what's redundant and discardable versus what's a pristine original that you must keep.

Conquering any one of these problems, let alone all of them, requires serious geekery today. For example, if you want to have good backups, you need to store data both in cloud storage and on multiple media, and you need software that records and verifies the checksums of all your files. The other two problems are just as gnarly, if not more so.

This may seem like a totally anorakish concern that doesn't matter to most people. Maybe most people are OK with most of their data being ephemeral, except for the rare object that they print out into physical form. Maybe it's just me, because I've been thinking about posterity a lot lately — including photos and video, my raw data generation rate has risen to something like a couple of GB per month. But I suspect I'm merely one of the people on the leading edge of this problem. Someday, everybody will generate a couple of GB per month and they really will want to share the family albums with their grandchildren without inordinate curatorial effort.

So, as far as I can tell, there's a big gaping hole in the market for personal digital curation software — software that would help you not only back up your data (there's plenty of software out there for that) but that would take care of ensuring the posterity of your data. This implies at least (1) backing it up to multiple distributed locations, (2) transcoding it into future-proof forms, and (3) remembering the relationship between different parts of your data.

This software would not be simple to build. It would have to be cross-platform. To offer a credible promise of future-proofness, it would have to be built on well-documented protocols and file formats so that if your organization went bust, someone else could write software, from scratch, that at least recovers the data. It would have to either include software that manages common file types like photos, or to hook into existing software that manages them, or compute relationships between the files after the fact (for example, it would have to either replace Picasa, or hook into it, or be able to figure out by post hoc analysis when two files were really variants of each other). It would have to be performant. It would need a nice UI.

I suppose the difficulty of building such software is one reason it hasn't been done. Much easier to just build a social networking doodad or a little timewasting mobile app or whatever the next Valley flavor of the month is. On the other hand, I think there's actually a reasonable (although perhaps tough to pitch) business case. There's probably at least tens of thousands of digital obsessives in the world who'd pay Photoshop CS-level prices for a credible digital curation package. The need to support new file formats or cloud storage APIs as they come online could provide a steady stream of upgrade revenue. If you built it right, then there's the potential for standard licensing deals where you bundle value-subtracted versions of the software with new computers, digital cameras, and other doodads.

Oh well. Anyway, add it to the list of stuff that I wish existed but does not, and also the list of things I wish I the time and focus to write but will probably not get to in my lifetime.

UPDATE 2010-08-03: Apparently you can actually learn something by blogging in ignorance and waiting until Reddit sends some commenters your way. There's an IT service category called digital asset management (DAM), and it's a big deal for enterprises (which shouldn't be surprising). (In library science, the analogous problem is called digital curation, which IMO is closer to the problem I care about.) The question, I suppose, is whether DAM can be scaled down, made sufficiently comprehensive, and encapsulated in a mixture of consumer-grade software and services so that individuals can have credible assurance that their data will be preserved on decades-long time scales. I'm somewhat dubious that Expression Media or Bridge can really offer that kind of promise (for example, those packages seem media-focused; do they back up stuff like email and source code?) but of course I haven't looked very deeply. Thanks interwebs!


  1. What do you currently use for an online backup solution?

  2. Argh, such a mishmash it's not even really a solution.

    I have a Flickr pro account which I maintain for my family and I put a lot of my digital originals there, although I don't/can't share everything with my family. I have a Picasa account onto which, with Google paid storage, I can dump full-sized postprocessed photos; but since I send those albums to people, I don't upload my origianls to that account. And then I have some git/svn repos on my Dreamhost account, but that's obviously not all my documents. And the rest of my stuff I dump onto a USB hard drive now and then.

    So, really, it's just frustrating and incomplete. I guess I should look into Amazon S3 but I'm not sure that would be economical in the long term. Maybe it would.

  3. You're describing a pursuit of useless hoarding. Sometimes it may be a blessing to lose things and forget. Nothing is meant to last forever. Having a fresh start can produce far more interesting results than you could've achieved by lugging a boatload of everything you ever possessed. Imagine the "worst": you lost all your digital stuff. Would you really be so hurt? Look at the real world: even the best curated collections of art and literature get lost and permanently damaged. The world sighs, shrugs its shoulders and moves on. What matters most is what you do with the stuff you currently have, not what you could have done with it.

  4. Alex:

    So you're saying it's "not that bad" to lose your entire lifetime of pictures? of music or digital art you have ever created? I disagree. Not all people have a problem with hoarding, and I dare say that many can in fact select between "this matters to me, so I should back it up", and, "I can handle to lose this".

  5. what about evernote? I think that is at least in the category of personal digital curation software. It stores the data locally and in the cloud.

  6. DAM - digital asset management. it's more of an entire occupation rather than a "software category", but it functions as that, too. if you get good enough at doing your own data up with that kind of robustness, you'd have a job waiting for you at quite a few creative shops. good luck, sir.