Kulturarw3 - The Swedish Archive

Already in September 1996 the Swedish national library, Kungliga Biblioteket, inaugurated a project entitled Kulturarw3 - The Swedish Archiw3e¹² with the goal to collect, preserve, and provide access to Swedish electronic documents [MAP00].

With the source being all publicly accessible material that is available via the Internet, the scope is limited on the entire Swedish national web-space. Besides the domain of Sweden .se constituting 55 percent of the repository, also sites registered under .com, .org, .nu and numerous others are included. Those additional web-servers are selected manually, if they are found to be physically located in Sweden, or if they are considered to be of Swedish interest, so-called "Suecana".

Thereby, a comprehensive approach is pursued, performing no selection on the material whatsoever, at the same time being aware that it is impossible to be complete. The data is acquired by taking snapshots. For this reason, a modified version of the Combine-robot is applied, which is discussed thoroughly in Section 4.5, since it was used in the scope of the AOLA-project.

The downloaded documents are retained together with the metadata in self-contained files making use of the MIME-format (cf. Section 2.3.3). The data is stored and managed by a newly purchased tape-robot implementing a Hierarchic Storage Management. This system transfers data when requested from slow, yet, capacious tapes to hard-disks (cf. Section 2.3.1). Taking advantage of the fact that data is more likely to be used if it stems from the same web-server and the same time-line as the document most recently accessed, files from one web-server and one snapshot are grouped together.

The archive currently contains 3,4 terabyte of data in 130 million files gathered in eight distinctive snapshots. Only in the seventh run performed in Spring 2000 more than 1,2 terabyte were collected from 96.600 sites. The massive amount of data that cumulated in this run underline the steep curve of the growth rate the Internet is subject to. An excerpt of the statistics compiled for the seventh run listing document types, their number, and size is presented in Table 3.1.

Table 3.1: 7^th run - statistics (excerpt) - MIME type

extension	#documents (in thousands)	size (gigabyte)
text/html	16.166	244
image/gif	6.228	118
image/jpeg	6.199	255
text/plain	814	117
application/pdf	319	86
application/octet-stream	217	95
application/zip	142	64
audio/x-pn-realaudio	102	9
application/msword	75	9
application/postscript	67	28
...	...	...

Digital preservation is a major concern, having the aim to find long-term forms of storage which will facilitate migration to future software and hardware environments. However, this issue is planned to be addressed in the next stage of the project.

In principle, access has been made possible by the implementation of a module that allows surfing in the collections in both network space and time as far as covered by the snapshots taken. Furthermore, an indexing mechanism is planned to be installed, as well as searching on metadata is conceivable. Yet, the archive is at present not accessible due to Copyright barriers. A report of the ministry of education proposes restricted access to scholars affiliated with recognised institutions, yet, the team members of the Kulturarw3-project are of the opinion that such a "limitation would be contrary to the democratic aim of the Swedish deposit law to guarantee free access to information."[Man00]

Kulturarw3 - The Swedish Archive

Footnotes