next up previous contents
Next: Pandora Up: Related Work Previous: The Internet Archive   Contents


Kulturarw3 - The Swedish Archive

Already in September 1996 the Swedish national library, Kungliga Biblioteket, inaugurated a project entitled Kulturarw3 - The Swedish Archiw3e12 with the goal to collect, preserve, and provide access to Swedish electronic documents [MAP00].

With the source being all publicly accessible material that is available via the Internet, the scope is limited on the entire Swedish national web-space. Besides the domain of Sweden .se constituting 55 percent of the repository, also sites registered under .com, .org, .nu and numerous others are included. Those additional web-servers are selected manually, if they are found to be physically located in Sweden, or if they are considered to be of Swedish interest, so-called "Suecana".

Thereby, a comprehensive approach is pursued, performing no selection on the material whatsoever, at the same time being aware that it is impossible to be complete. The data is acquired by taking snapshots. For this reason, a modified version of the Combine-robot is applied, which is discussed thoroughly in Section 4.5, since it was used in the scope of the AOLA-project.

The downloaded documents are retained together with the metadata in self-contained files making use of the MIME-format (cf. Section 2.3.3). The data is stored and managed by a newly purchased tape-robot implementing a Hierarchic Storage Management. This system transfers data when requested from slow, yet, capacious tapes to hard-disks (cf. Section 2.3.1). Taking advantage of the fact that data is more likely to be used if it stems from the same web-server and the same time-line as the document most recently accessed, files from one web-server and one snapshot are grouped together.

The archive currently contains 3,4 terabyte of data in 130 million files gathered in eight distinctive snapshots. Only in the seventh run performed in Spring 2000 more than 1,2 terabyte were collected from 96.600 sites. The massive amount of data that cumulated in this run underline the steep curve of the growth rate the Internet is subject to. An excerpt of the statistics compiled for the seventh run listing document types, their number, and size is presented in Table 3.1.


Table 3.1: 7th run - statistics (excerpt) - MIME type
extension #documents
(in thousands)
size
(gigabyte)

text/html

16.166 244

image/gif

6.228 118
image/jpeg 6.199 255

text/plain

814 117
application/pdf 319 86
application/octet-stream 217 95
application/zip 142 64
audio/x-pn-realaudio 102 9
application/msword 75 9
application/postscript 67 28

...

... ...


Digital preservation is a major concern, having the aim to find long-term forms of storage which will facilitate migration to future software and hardware environments. However, this issue is planned to be addressed in the next stage of the project.

In principle, access has been made possible by the implementation of a module that allows surfing in the collections in both network space and time as far as covered by the snapshots taken. Furthermore, an indexing mechanism is planned to be installed, as well as searching on metadata is conceivable. Yet, the archive is at present not accessible due to Copyright barriers. A report of the ministry of education proposes restricted access to scholars affiliated with recognised institutions, yet, the team members of the Kulturarw3-project are of the opinion that such a "limitation would be contrary to the democratic aim of the Swedish deposit law to guarantee free access to information."[Man00]



Footnotes

... Kulturarw3 - The Swedish Archiw3e12
http://kulturarw3.kb.se/html/kulturarw3.eng.html

next up previous contents
Next: Pandora Up: Related Work Previous: The Internet Archive   Contents
Andreas Aschenbrenner