Approaches to Internet Content Preservation

Snurb — Friday 30 March 2012 14:23

Internet Content Preservation | DHA 2012 |

Canberra.
The final speakers in this DHA 2012 session are Monica Omodei and Gordon Mohr. Monica, from the National Library of Australia, begins by pointing out the importance of Internet content as raw data for humanities research – and even when the live Web is the object of study, its ephemeral nature means that archives of Web content are absolutely crucial for verifiability and reproducibility.

Relevant examples of such research include social network research, lexicography, linguistics, network science, and political science, amongst many others. Common collection strategies to develop archives of online content include thematical and topical archiving, resource-specific archiving (e.g. audiovisual materials), broad surveys (e.g. domain-wide), exhaustive (closure crawls for a specific Web space), or frequency-based. Such captures will have input from domain experts, will operate iteratively, use registry data or trusted directories to determine what to capture, etc.

Existing archives in this space include the Internet Archive’s Web Archive, which includes some 175 billion Web instances, has historic data stretching back to 1996, is publicly accessible, allows time-based URL research, offers API access, and is not overly constrained by legislation (it operates under a fair use policy, with fast content take-downs); however, because of its size, there is no full index, and keyword search is not possible. Also, it is fully automated, and hands-on quality assurance is therefore not possible. Common uses for such an archive include content discovery, nostalgic queries, Web restoration and file recovery, domain name valuation, collaborative R&D, prior art and copyright infringement research, legal cases, and topical and trends analysis.

The National Library’s Pandora archive exemplifies a different approach: it’s a selective archive with quality checks, operating from 1996 with a bibliocentric approach (selected Websites are catalogued and included in Trove). The downside of this approach is that the labour-intensity of this approach means that the archive must remain relatively small, and the permissions-based approach to archiving means that content will not be included if its owners refused permission to archive it, in the absence of legal deposit legislation in Australia. There’s also a need to further update the crawling technology used by Pandora.

To complement this selective approach, the National Library also takes annual snapshots of the .au domain (and sites located on servers in Australia), commissioned from the Internet Archive. There is no public access to this dataset, however, due to legal restrictions, but this may change soon (and a separate .gov.au crawl will be made publicly accessible soon); text-based search of the archive is possible, but again not publicly available yet.

Various other approaches also exist, and are undertaken by a number of other institutions; most of these use the standard Heritrix crawler (and data formats) popularised by the Internet Archive. What is necessary now is to start bringing those archives together, using shared data structures and APIs. But researchers will probably still need to create their own archives in many cases (and Heritrix is freely available for this purpose; other, commercial archiving services also operate in this area).

Gordon Mohr from the Internet Archive now takes over, and notes some of the tools which are available to researchers – such as the MementoFox browser plugin which makes archived materials available as users browse the Web. All of this is connected to the broader challenge of Web data mining, which is increasingly used to support large-scale data analysis. Such large-scale work relies on ‘big data’ tools such as Hadoop (for highly scalable data processing and storage – we’re talking petabytes here) and Pig, as well as data formats such as WARC (used by Heritrix), CDX (an index format for content manifests), or WAT (for extracting and storing key WARC metadata such as titles and links).

4455 views