You are here

Formats for Archiving?

The last session for today has started. Colin Webb (how appropriate!), Director of the National Library of Australia's Preservation Services sets the scene, noting that 'preservation' means maintaining the ability to access content. Layers of responsibility include byte stream integrity, byte stream identity, and the preservation of intellectual content for each digital object that is preserved, but also the preservation of original context, current context, and 'significant properties' or essential characteristics.

However, there are some reasons for hope here: the incentive is one of taking steps and building collections, and this has driven some very promising projects already. Also, the preservation problem may break down into some more manageable segments: byte stream protection, means of access, and metadata and systems. Additionally, it is possible to make informed decisions given the limitations of known means of access; we can work on specifics and push towards automation and towards a collaboration beyond research (building networks of capacity).

What needs to happen, then, is imperfect preservation done well (since perfect preservation may always remain an ideal rather than a real goal), and an acceptance of changes rather than attempt to build ultimate solutions. There is also a need for effective and continued stewardship of projects.

Next up is Stephen Abrams from Harvard University Library. He speaks on the role of format registries in digital preservation. Repositories need to ensure that the digital objects represented are effectively catalogued and presented in appropriate and future-ready formats. Information about such formats needs to be stored alongside the objects themselves, otherwise objects become useless and unintelligible. Use cases for formats include identification (what is this object?), validation (is this object properly formatted?), characterisation (what are an object's properties?), assessment (what is the risk of obsolescence?) and processing (what can be done with the object?).

Formats are important no matter the preservation strategy employed, but are especially important around issues of migration from older to newer formats, and emulation of older formats within newer environments (such as the Universal Virtual Computer UVC). Institutional archives, too, are often required to accept material of unknown provenance (the Library of Congress has an Archive Ingest and Handling Test AIHT to examine content intake) - and while some 90% of materials are likely to be of a small number of known formats (ASCII, HTML, JPG, etc.), while the rest will be in a large number of far less well-known formats (and amongst the better known formats there may also be some poorly formed files - this is especially true of HTML).

There is a new generation of format-aware tools which are now emerging (such as, once again, JHOVE and the NZNL Preservation Metadata Extraction Tool); a great deal of knowledge on formats is encapsulated in these tools. To enable the further development of such tools, a comprehensive format repository or registry is required, but especially with some of the more obscure formats information can be hard to come by. The MIME types registry is not enough in this respect; its format descriptions remain relatively general and non-specific. The registry would need to be inclusive and trustworthy (an honest broker dealing with proprietary information in an appropriate manner) - but will it happen?

The UK National Archives already run the PRONOM file format registry, which may be one such registry. Further, the Digital Library Foundation has supported work towards a Global Digital Format Registry in 2002, and gradually some more specific models have emerged from this - most likely setting up a distributed network of cooperating registries built around a standard exchange and abstract data model.

The View from the Top
I'm afraid my battery ran out before the last speaker, Andreas Stanescu from the OCLC. He presented on INFORM, a methodology for file format risk assessment - that is, for assessing the longevity of file formats in order to avoid getting stuck with obsolete formats which then need to be converted at high cost to more future-proof formats.

Tonight we're going to be dining in the Great Hall of New Parliament House - look forward to it. Hopefully John Howard isn't around.