Project Description

Motivation

With the rise of the Internet, digital information has become seemingly ubiquitous as a medium for communication and expression [LK98]. With the amount of information being made available in digital form, the need to archive and preserve access to these documents becomes an essential asset, which is only starting to be appreciated to its full extent. Organizations like the Long Now Foundation (http://www.longnow.org) with its clock/library projects are seeking to promote slower and better thinking and to focus our collective creativity on the next 10.000 years, pointing out the need to preserve the material being created [Ban00].

Nevertheless, one might argue that a high percentage of the data on the Internet is junk, useless, or even misleading information hardly anybody actually looks at. Furthermore, most of the information that is worth to be kept for future generations, is available in traditional media, anyway. Texts, music, films, and pictures all have their traditional physical media.

However, with the new possibilities offered by digital media, new forms of information representation emerge. The expressive power of hypertext documents with their non-linear link structure, as well as multimedia documents integrating video, sound, and interactive components, cannot be adequately represented in traditional forms. Furthermore, apart from the mere information contained in the documents, the network formed by cross-referenced documents offers another dimension of information largely unmatched in conventional media.

With the popularity of the Internet rising, the emergence of new communities adds a social component to the web. When the Internet used to be populated solely by a few scientists and geeks, interaction mechanisms and the available content were limited and adjusted to their needs and habits. Yet, the type of information made available as well as the way it is presented changes drastically as new groups of users discover the Internet, using it and adapting it to their needs, forming and evolving into new and diverse communities.

This is only the beginning of a process where our cultural heritage increasingly incorporates digital forms. It results in a shift of focus, making the archiving of digital material not primarily a question of storing asserted facts and figures, but of preserving the big picture, of capturing the sociological and cultural dimension of the Internet and its ``inhabitants''. Already now the material gathered by spear-heading initiatives has proven a useful resource to historians analyzing e.g. elections [Kah97]. As we start to recognize the importance of preserving at least parts of the entities that make up web, more and more projects are initiated to address the technical challenges associated with such an endeavor.

The issues pertaining to the archiving of digital material can be summarized as (1) acquisition, (2) storage, (3) preservation, and (4) access provision. First of all, the necessity to capture the characteristics of the web, i.e. to obtain the documents, their content, look-and-feel, as well as their role within the larger network of interlinked information poses serious challenges. The vast amount of material available on the web, the lack of a central listing, as well as the volatility and frequency of change call for carefully designed methods of data acquisition to meet the goals of building an Internet archive.

Secondly, once documents have been collected, long-term storage is a tacit requirement for any archive, yet demands careful consideration. With the current pace of technological changes, storage media become obsolete every few years, and so does the technology required to read the archived material. This requires migration strategies to move data from one storage system to the subsequent generation. A similar migration step is also required when the physical life time of a storage medium is reached, independent of the life span of the technology as such.

However, the most serious challenge and threat to the archiving of digital material lies within the concept of digital objects themselves. Contrary to conventional books, or even ancient stone plates, digital objects cannot be read as they are. Rather, they require special software to access and interpret them. Yet, the rapid pace at which these technological changes proceed may render most documents unreadable within a few years, as the original software is replaced by new, incompatible versions. Unfortunately, even archiving the software necessary for interpreting digital artifacts does not provide a solution. Software typically is designed for specific hardware platforms, i.e. a specific type of computer system, consisting of a specific set of chips. These systems would need to be preserved as well, resulting in a museum of outdated hardware nobody is able to maintain. Several approaches are pursued to tackle this problem, such as emulating obsolete systems or converting the digital objects into formats that can be interpreted easily even on future systems.

Last, but not least, once such an archive has been created, providing access to its content poses additional challenges, which to a large degree depend on the needs of its users and their goals, be it locating a specific document, or analyzing the evolution of social groups in cyberspace. This again has a strong influence on all aspects addressed so far, i.e. the information acquired for archiving, the storage media containing the archive and its means of access, as well as the form in which the documents are made accessible.

Obviously, legal issues merit special consideration, with any such project touching on privacy, copyright, as well as ethical questions. While hardly any congruent legal framework for the creation of Internet archives exists, there is a general agreement that web sites may be deemed as published - and thus publicly available - material that may be archived. Yet, providing access to the documents stored in the archive constitutes a new situation. Thus, no project currently provides open access to its archives apart from special research projects. We refer to [Lan96] for detailed coverage of this topic.

Ban00
S. Band.
The Clock of the Long Now.
Phoenix Publishers, London, UK, 2000.

Kah97
B. Kahle.
Preserving the internet.
Scientific American, March 1997.
http://www.sciam.com/0397issue/0397kahle.html.

Lan96
B. Lang.
The legal deposit of electronic publications.
Working Series of the General Information Programme and UNISIST CII-96/WS/10, Working Group of the Conference of Directors of National Libraries (CDNL), December 1996.
http://www.unesco.org/webworld/memory/legaldep.htm.

LK98
P. Lyman and B. Kahle.
Archiving digital cultural artifacts: Organizing an agenda for action.
D-Lib Magazine, 4, July-August 1998.
http://www.dlib.org/dlib/july98/07lyman.html.

The Austrian Pilot Project AOLA

The Austrian On-Line Archive (AOLA) is a cooperation between the Austrian National Library and the Department of Software Technology of the Vienna University of Technology with the goal to make periodical snapshots of the Austrian web space. In an amendment to the Austrian Deposit Law passed in July 2000 off-line electronic media such as CD-ROMs were included in deposit regulations which are to be collected by the Austrian National Libary. Furthermore, as part of this amendment a pilot study addressing the collections and archiving of on-line documents was conceived. Preparations for the pilot-study commenced in 1999, with the first phase having started officially in 2001.

The project is based on a Linux System which, for the pilot phase, is equipped with 240 gigabyte of hard-disk space plus a 6-fold tape-drive for final storage. As for the software, an open-source approach is pursued. This is not primarily for budget reasons, but rather to ensure independence from commercial providers and to easily allow insight into the project. Additionally, this offers the possibility for close cooperation between various other projects in this field. The AOLA project follows a harvesting approach, trying to download the Austrian web space. While doing so, it only archives files that are allowed to be indexed by automatic crawlers by their respective authors, strictly obeying robot exclusion files. Furthermore, if site owners object to their site being included in the archive, yet have failed to specify this in the form of a robot exclusion file, they may have their sites removed from the archive on request. Yet, while we have been in direct contact with numerous site managers who noticed the activity of our crawler, none of them objected to their data being included in the archive so far. We rather experienced high support for our activities, showing the recognition of the importance of such an archive by information providers on the web, and their interest in being part of the Austrian On-Line Archive.

Run 1

Initially, the project started with the Nedlib harvester. Several modifications and expansions had to be made in order to make it fit the needs. Between May 7th and 16th 2001 a first attempt was made to take a snapshot of the Austrian web space. Approximately 1 gigabyte of data was collected per day from the .at-domain, as well as several manually selected sites from other domains, amongst them .com, .cc, .org and others. In that time about 666.000 unique URLs were harvested from 1.210 different sites. All in all 8,3 gigabyte of data were stored. During this first pilot run numerous problems with the Nedlib harvester were discovered. Some of them could be fixed, as for example the handling of malformed links was updated. Other bugs and errors were too severe to be fixed. After several providers complained that some of their URLs were downloaded again and again blocking other data-traffic, this initial crawl had to be stopped. This multiple downloading of identical sites also was the reason for the rather low download rates of only 1GB per day, with actual data transfer rates being much higher.

Nevertheless, a considerable amount of experience was gained for the second pilot run. The initial run showed that the Nedlib harvester basically is constructed such that the requirements for the AOLA project could be met, but for the time being it is not stable enough and it, hence, still needs refinement.

Run 2

As it cannot be foreseen how long it will take until the Nedlib-crawler is available in a stable version, the Combine harvester used by the Swedish initiative is now being used. As the Combine harvester was initially developed for indexing purposes, rather than for web archiving, several adaptations had to be performed. In doing so the project follows a close cooperation with the Swedish Kulturaw3 project, benefiting from their experience and efforts already put into modifying the original harvester. Even though, some functionality desirable for an archiving system could not be included so far.

The AOLA project by now has successfully reached the second pilot phase, using the adapted Combine harvester. The harvester is currently collecting data at a rate of about 7GB per day, having created an archive of 150 GB, including more than 2,7 million pages from about 21.000 sites by June 21st, 2001.

Apart from problems concerning the collection of the digital material, other aspects mentioned previously have to be considered. To guarantee long-term preservation migration seems to be the only practically applicable approach instantaneously, but for the future a combination with the emulation approach is essential. Furthermore, the technical aspects to provide access have to be dealt with, requiring a appropriate legal framework. Yet, we again would like to stress the large public support for the creation of such an archive that we have experienced.