Thursday, July 22, 2010

Personal digital curation: a software category that does not exist, but ought to

If you're a typical computer user today, you have lots of data that you've created or participated in creating. The data takes several forms, typically including:

  • Personal documents: writing, drawing, photographs, home movies, etc.
  • Purchased media: music, movies, software, etc.
  • Electronic records: receipts, account records, etc.
  • Communication: email, chat logs, & other social media messages.

There are several things about how you keep this data today which ought to bother you.

First, it's probably inadequately backed up. By this, I mean that you don't back it up frequently enough, and when you do, it's probably either (a) on a USB hard drive or (b) on cloud storage. Taken alone, either of these is inadequate. Typical hard drive backups are inadequate because on most file systems, the data's neither pervasively checksummed nor stored with redundant error-correcting codes; as a result, all your files are subject to random corruption. There's also the small matter that you probably don't archive your backups regularly to a remote location that's safe if, e.g., your home burns down. Cloud storage alone is inadequate because (1) any compromise to your account could result in malicious deletion or corruption of your data and (2) Amazon/Google/... may seem like permanent institutions today, but on the scale of decades I am somewhat dubious; DEC and SGI and Sun were once lords of Silicon Valley; ultimately one must consider Phlebas and all that.

Second, the data's stored in a mishmash of formats, some of which will be exceptionally difficult to read in years to come. Microsoft's file formats are particularly egregious, but I also have my doubts that, for example, today's video file formats, or an iPhoto or Picasa metadata database, will be readable by commonly available software in twenty years.

Third, a lot of your data's stored in multiple related forms, and the relationship between those forms is totally ad hoc and not captured by future-proof software. For example, you might have a batch of raw photos, of which you pick a few to clean up, rescale to lower resolution, and upload to the web. So now you've got multiple versions of the photo. If you need to go trawling through this mess some years from now, you're in for a lot of curatorial tedium reorganizing it and figuring out what's redundant and discardable versus what's a pristine original that you must keep.

Conquering any one of these problems, let alone all of them, requires serious geekery today. For example, if you want to have good backups, you need to store data both in cloud storage and on multiple media, and you need software that records and verifies the checksums of all your files. The other two problems are just as gnarly, if not more so.

This may seem like a totally anorakish concern that doesn't matter to most people. Maybe most people are OK with most of their data being ephemeral, except for the rare object that they print out into physical form. Maybe it's just me, because I've been thinking about posterity a lot lately — including photos and video, my raw data generation rate has risen to something like a couple of GB per month. But I suspect I'm merely one of the people on the leading edge of this problem. Someday, everybody will generate a couple of GB per month and they really will want to share the family albums with their grandchildren without inordinate curatorial effort.

So, as far as I can tell, there's a big gaping hole in the market for personal digital curation software — software that would help you not only back up your data (there's plenty of software out there for that) but that would take care of ensuring the posterity of your data. This implies at least (1) backing it up to multiple distributed locations, (2) transcoding it into future-proof forms, and (3) remembering the relationship between different parts of your data.

This software would not be simple to build. It would have to be cross-platform. To offer a credible promise of future-proofness, it would have to be built on well-documented protocols and file formats so that if your organization went bust, someone else could write software, from scratch, that at least recovers the data. It would have to either include software that manages common file types like photos, or to hook into existing software that manages them, or compute relationships between the files after the fact (for example, it would have to either replace Picasa, or hook into it, or be able to figure out by post hoc analysis when two files were really variants of each other). It would have to be performant. It would need a nice UI.

I suppose the difficulty of building such software is one reason it hasn't been done. Much easier to just build a social networking doodad or a little timewasting mobile app or whatever the next Valley flavor of the month is. On the other hand, I think there's actually a reasonable (although perhaps tough to pitch) business case. There's probably at least tens of thousands of digital obsessives in the world who'd pay Photoshop CS-level prices for a credible digital curation package. The need to support new file formats or cloud storage APIs as they come online could provide a steady stream of upgrade revenue. If you built it right, then there's the potential for standard licensing deals where you bundle value-subtracted versions of the software with new computers, digital cameras, and other doodads.

Oh well. Anyway, add it to the list of stuff that I wish existed but does not, and also the list of things I wish I the time and focus to write but will probably not get to in my lifetime.

UPDATE 2010-08-03: Apparently you can actually learn something by blogging in ignorance and waiting until Reddit sends some commenters your way. There's an IT service category called digital asset management (DAM), and it's a big deal for enterprises (which shouldn't be surprising). (In library science, the analogous problem is called digital curation, which IMO is closer to the problem I care about.) The question, I suppose, is whether DAM can be scaled down, made sufficiently comprehensive, and encapsulated in a mixture of consumer-grade software and services so that individuals can have credible assurance that their data will be preserved on decades-long time scales. I'm somewhat dubious that Expression Media or Bridge can really offer that kind of promise (for example, those packages seem media-focused; do they back up stuff like email and source code?) but of course I haven't looked very deeply. Thanks interwebs!

Tuesday, July 20, 2010

Ubuntu 10.04 (Lucid) on Dell Mini 12

After a recent system update, my Dell Mini 12 went on the fritz: wireless networking stopped working reliably. Obviously, that's completely unacceptable in a device like this. I guess I could have tried messing around with the configuration files and drivers, but Dell's oddball Ubuntu 8.04 (Hardy) lpia distribution has been feeling long in the tooth lately anyway. So, following my mostly satisfactory Lucid workstation experience, I decided to try upgrading my Mini 12 to Lucid.

And, once again, almost everything worked just fine.

I prepared a USB drive with the installer (actually just a memory card reader plus the card from my camera) according to Ubuntu's instructions. Then I rebooted (using F12 to bring up Dell's boot menu, and then selecting USB), and selected installation.

The installer was exceptionally sluggish — for which I blame the Mini 12's underpowered hardware — but otherwise the installation went through without a hitch.

If you try the same with your Dell Mini 12, you'll want to look at these notes before you run the installation. Two post-install tweaks are necessary:

  • To get acceptable graphics performance, you'll need to enable the Poulsbo GMA500 proprietary driver.
  • For the wireless networking you'll have to enable the Broadcom STA wireless driver from the System -> Administration -> Hardware Drivers menu.

Overall I'm mostly satisfied. Lucid both looks and feels much slicker than Hardy, from the fonts to the windows. The desktop distribution's UI works fine on the 12 inch screen. And once you perform the tweaks above, most everything in the hardware works fine, including wireless, bluetooth, trackpad, sound, and the webcam.

The fly in the ointment this time? Suspend and resume are sometimes flaky. In particular, sometimes resume either fails completely or requires that I switch virtual terminals a couple of times (Ctrl-Alt-F2, Ctrl-Alt-F7) to jog it out of its slumber. Given the way I use this device, it's actually less of a big deal than you might expect (basically, when suspend fails, I just hard-reboot and restart my web browsers and emacs), but if this matters to you a lot then you might want to hold off. I guess I could try debugging the problem, but like I said it hasn't been that important to me.

So, OK, I have to admit that owning this computer overall hasn't been a seamless experience. (But then nothing is these days, not even my Macbook Pro from work; I've had many travails with MacPorts and Fink and and...). When I bought the Mini 12, my goal was to see whether a sub-$700 computer could keep me satisfied for more than one year, which would make it more cost-effective than a $2k computer which typically lasts me 3 years. In that sense, the experiment succeeded: it's lasted over a year, and I've gotten good mileage out of it. Meanwhile, I have not cursed at my Mini especially more than I've cursed at any other computing device I've ever owned. And the biggest positive qualities — compactness, light weight, near-silent operation — remain salient even today.

Monday, July 19, 2010

Virtual cosmetics, redux

OK, well, I thought they'd do faces before bodies but the basic motivation isn't far off (research paper link.

UPDATE 2011-01-09: Updated YouTube link; the original appears to have had its permissions toggled to private. Also added link to original research.