next up previous contents
Next: Conclusions Up: Long-Term Preservation of Digital Previous: Figures - Automatic Retrieval   Contents


Lessons Learned

In the course of this thesis we have discussed a number of issues related to creating an archive for preserving digital material over the long term. We furthermore have succeeded in laying the foundation for an Austrian National Archive to guard cultural heritage passing it on to future generations.

In this Chapter we will summarise some of the main issues related to creating such an archive. Thereby, we followed the structure as it was established in Chapter 2, by dedicating each of its sections a paragraph and more. Additionally, recommendations are given how to tackle the individual challenges from the current point of view.

Putting it in a nutshell, the main steps to building an archive can be summarised as follows.

Ten Steps to Building an Archive

Start now! -- Digital information is under the imminent danger of fading away and being lost forever. Awareness about this fact rises. In the absence of an optimal solution for counteractive measures, however, the threat of losing our common memory prevails. While many aspects of creating an archive remain a matter of research and technological constraints inhibit our effort to be as complete, yet consistent as possible, it is crucial to embark on the task and tackle the challenges preserving the material to as large a degree as possible.

Choose your source -- The initial purpose for creating such an archive directly points to the actual data to be guarded for the long term. Nevertheless, these steps have to be taken carefully, since they form the foundation for the archive. Obviously, defining the exact goals is substantial. Subsequently, the source and the scope of the data to be retained can be specified. These decisions will considerably influence the operation of the archive as well as it's content and thereby the services for the user. Changing ones mind on this issue might require major restructuring.

When creating an archive having the World Wide Web as source one should consider to include quality-controlled sites (e.g. governmental sites, company web-sites), but also "open" sites such as private home-pages, discussion groups etc. Also, consider other sources such as mailing-lists, newsgroup-postings, FTP-archives, multi-user interaction sites, and new sources to evolve.

Acquire the data -- Based on the decisions concerning source and scope of the archive a method for acquisition has to be derived. The documents can be accepted passively, relying on donations or building on deposit regulations, provided the appropriate framework be it legal or internal to a company. Alternatively, the material can be actively collected. Thereby, in order to create a well sorted and consistent collection, it can be manually selected. On the other hand, using automatic means offers a very comprehensive method to gather the data. At the same time, the manpower required even for automatic acquisition must not be underestimated, considering the handling, and monitoring of the tools, as well as their adaption as technology advances.

Combining a selective and an automatic approach appears to be the most beneficial strategy for an initiative having as great demands as a national archive. Besides aiming for a very comprehensive profile applying automatic tools, topics of special interest can call for the creation of specialised collections. Additionally to this active collection, deposit regulations should be aspired, since they potentially yield a coherent collection composed of consistent items containing material that might be unavailable otherwise.

Build on solid equipment -- Capacious on the one side, allowing retrieval in due time on the other; robust to endure, yet, flexible to be exchanged when it runs danger to become obsolete - the equipment has to satisfy many demands. Besides the repository for long-term storage, the system for ingestion of the material has to be provided. This includes suitable equipment as well as the necessary staff, both involving considerable expenses. Yet, being tight hereby could rapidly result in an incomplete or inconsistent collection.

For the part of the system that actually acquires the data, a very powerful solution is demanded. It needs to offer the required space and speed in order to guarantee a sound processing, gathering the data and subsequently formatting the collection items. If, for instance, When acquiring the material to be included in the archive by harvesting the source in a sweep, for example, the system performing the ingest is recommended to have a RAID system big enough to store one complete snapshot plus sufficient space required for operating the harvester.

In order to provide storage space for a huge repository, hard-disk arrays are recommendable to keep the information directly accessible. Additional copies of the data should be made to tapes, that allow distributed, redundant storage.

Be concerned about the preservation strategy -- The longevity of the archive's content depends vitally on the selection of the suitable approach for digital preservation. Whether striving for obtaining an adequate non-digital representation, preserving the technology, converting to superseding (preferably standard) formats, or emulating obsolete technology - the applied method or conceivably combination of methods has to be apt for the specific task, primarily. Yet, feasibility and the ultimate efficiency of a strategy can only be roughly estimated at this point of time. Due to the long-term character of this particular challenge, as well as the fact that this issue has come up only recently, virtually no experience exists in practical application. However, the single only optimal solution might be long in coming. In fact, it is doubtable that something like that actually exists. However, a viable near-optimal strategy has to be constituted before important material is lost.

From the current point of view, a recommended approach tackles the problem along two lines. On the one hand, metadata is compiled for the original document and other measures are taken to facilitate Emulation later on. Additionally, following the Conversion-strategy the document is converted to standard formats continually for immediate access.

Organise your repository thoroughly -- A sound organisation facilitates processes significantly and subsequently improves the quality of the collection. Starting at ingestion, effective tools have to be provided. If an automatic collection is performed, the installed programs have to care for the consistency of the material. Furthermore, the storage concepts and the archival management have to be such, that maintenance work, the implementation of the preservation strategy, as well as resource retrieval are performed as conveniently as possible, using automatic means wherever possible. At the same time, the system should be flexible enough to accommodate new features and be adaptable to changing requirements.

A sound organisation will build on the use of metadata. Storing all elements together with the associated collection items guarantees a robust framework. Additionally, frequently used information should be copied to supplementary indices and databases to have it available in a fast and convenient way.

Pursue usability as your ultimate goal -- Caring for instant usability of the archive is a paramount objective. Interfaces to access the collections should be tailored to the needs of the user. However, if the target group is rather large and diverse, this becomes a daunting task.

How navigation through the resources of the archive can be realised depends, obviously, on the collection items themselves and, hence, their acquisition in the first place. Having pursued a selective strategy the resources can be arranged as a well structured subject gateway. Alternatively, if automatic tools were applied the repository is surfable, allowing an interface similar to that of a normal web-browser with the additional possibility to view the objects at the various times they were acquired. Installing a service for searching the archive makes usage more efficient.

Furthermore, it should be considered providing basic tools for scanning and analysing the collections using data-mining techniques. Yet, it can be expected that ever new applications will come up for specific projects. Thus, the extensibility of the system is its best service.

Consider legal aspects -- Delicate issues such as the Copyright of documents need to be addressed. By installing a rights management framework the publishers can be offered an instrument for keeping control over their work. Concerning an archive striving to preserve the digital cultural heritage of a country, ultimately, an amendment to the deposit law is aspired underpinning the importance of the initiative making it a national concern. Furthermore, if applying automatic means for data collection the repository could eventually contain illegal material. Since this can only be prevented with considerable effort, probably entailing the loss of other, important documents, such eventualities have to be embedded in an appropriate legal framework. Ultimately, one should aim at free, public access to the archive.

Care for a solid financial basis -- Planning the economics of an organisation building an archive is difficult, due to two characteristic attributes. Firstly, the long-term character of the endeavour calls for a solid financial basis for many years to come. Second, there is only limited experience since the preservation of digital material is a very young field of research. Following the development of technology and participating in research is integral in guaranteeing the quality of the archive, yet, it involves high expenses. Nevertheless, despite these adversities, costs have to be calculated and anticipated as accurately as possible.

Furthermore, apart from mere technology-related costs, significant expenses for personnel must be expected, which is required for implementing and monitoring data acquisition, maintaining the archive, developing and adapting solutions, as well as incorporating changes in the digital environment.

Monitor the changing digital environment and adapt to it -- Virtually any digital environment is subject to continuous change. This is all the more true for an open and highly dynamic construct such as the Internet. Therefore, constant surveillance has to be implemented and systems have to be adapted whenever necessary. After all, the archive is built to endure changing data formats and the advance of technology, it is designed to incorporate new sources and tackle further challenges to be expected.


next up previous contents
Next: Conclusions Up: Long-Term Preservation of Digital Previous: Further Improvements   Contents
Andreas Aschenbrenner