digital archives and the “bit rot” problem
Today, I went to a presentation given by Vint Cerf and Robert Kahn. One of the problems they presented as still unsolved was the problem of retaining information in a readable format in the long term. Vint made a pretty funny joke about trying to open a PPT file from 1997 in the year 3000. Even using Windows 3000 with Office 2998, the file may not necessarily be readable.
This is a problem that many people have experienced first-hand. Have any old 5 1/4″ floppies laying around? Think you can still read them? And assuming you can, can you then read the file formats, which may be proprietary to software that no longer exists?
Lo and behold, a couple of hours after the talk, there’s a Slashdot story on the very same topic, pointing to an American Scientist article titled “Avoiding a Digital Dark Age”.
My thoughts on the matter. There are three separate layers here: the media longevity, the media format, the file format, and each needs to be designed with the same longevity goal in mind.
I will give an example here of how one software product handles this problem. That software product is Bacula. It uses an open, documented format for its file contents. So you can print out the specification on paper, if you like, and then sit down and re-implement the code and be able to read its files. It also uses the same format on different media, be it tape or disk. AFAIK, this design decision was made after seeing the evolution of ‘tar’ and GNU tar. Even with the same name, there are some versions of tar that produce incompatible files.
So the key is to use an open, documented format. Furthermore, it needs to be truly Free Software, not just open source but encumbered by a patent, for example.
February 24th, 2010 at 7:04 pm
Great insight!
I love how you simplified the problem:
“There are three separate layers here: the media longevity, the media format, the file format, and each needs to be designed with the same longevity goal in mind.”
Have you heard of Millenniata? They have created a new optical storage media that, once written, is backwards compatible with industry standard dvd players. Where DVD-R are normally recorded by making changes in an organic dye, their discs (called M-DISC) are written by making permanent physical changes. Their tag line is “Write Once, Read Forever”.