DELOS Summer School 2007 on Digital Preservation

 

 

 

 

 

 

 

 

 

 

At Santa Croce in Fossabanda (Pisa, Italy) from 3 June 2007 to 8 June 2007

 

 

 

 

 

 

 

Markus Opitz


Inhalts­verzeichnis

Inhaltsverzeichnis. 2

1.    Introduction.. 3

1.1.     General description.. 3

1.2.     Location and lecture environment 3

1.3.     Time horizon.. 3

1.4.     Participants. 4

1.5.     Breakout sessions. 4

1.6.     Aims and goals of the summer school 5

2.    Lectures. 5

2.1.     Repositories. 5

2.1.1.  Decision making.. 5

2.1.2.  Criteria catalog.. 6

2.1.3.  Product assessment of existing products. 6

2.2.     Metadata. 6

2.2.1.  Definition.. 6

2.2.2.  Approach via genres. 7

2.3.     PREMIS.. 7

2.3.1.  Aim of the project 7

2.3.2.  PREMIS data dictionary. 7

2.3.3.  Issues of the data dictionary. 8

2.4.     Kopal and DIAS.. 8

2.4.1.  Archive material 8

2.4.2.  Technical and development partners. 8

2.4.3.  Usage purpose. 8

2.4.4.  Workflow.. 8

2.4.5.  Object specification.. 9

2.4.6.  Software. 9

2.5.     File formats and registries. 10

2.5.1.  Vulnerabilities. 10

2.5.2.  Recommended formats. 10

2.5.3.  Registries. 11

2.6.     XCEL and XCDL.. 11

2.6.1.  Automate with XCEL and XCDL.. 12

3.    Conclusion and outlook. 12

 


1.           Introduction

This chapter contains an overall description of the conference. It deals with basic information about where it was held as well as a time schedule. Next, all the participants and their backgrounds are explained in detail. The chapter finishes with outlining the aims and goals of the summer school and what each participant should be able to take home from this unique experience in the heart of Italy.

1.1.     General description

The summer school took place In Pisa, Italy. The audience of this summer school were young researchers and professionals with an interest in this topic.

 

The major topics and issues treated in this course can be outlined by the following questions:

·                    Why should we preserve digital materials?

·                    Who should do it?

·                    What should they do?

·                    Where should they do it?

·                    When should it be done?

·                    How should it be done?

·                    For how long should we do it?

·                    How should we fund preservation?

1.2.     Location and lecture environment

The summer school took place at the

 

Santa Croce in Fossabanda Conference Centre

Piazza Santa Croce, 5

56125 Pisa, Italy

Web site: http://www.fossabanda.it

 

Santa Croce in Fossabanda was an excellent place for this summer school as it had a suitable meeting room. It was located at the ground floor, overlooking the majestic churchyard. At the same time all necessities for a modern conference were met by this location: a video beamer was of great use as well as WLAN access was available for all participants of the conference.

1.3.     Time horizon

The summer school on digital preservation was an intensive lecture and consisted of nine half-day sessions during the week beginning 3 June 2007 until the 8 June 2007. Each of the established lecturers led a session. The time frame in detail was:

 

 

 

Sunday, 3 June 2007

18:30 – 19:30: Overview of the summer school and introduction to the lecturers

19:30 – 21:30: Opening reception

Monday, 4 June 2007

09:00 – 12:45: Introduction to digital preservation & preservation challenges

14:00 – 17:30: Metadata approaches for digital preservation

Tuesday, 5 June 2007

09:00 – 12:45: Registries and file formats

14:00 – 17:30: Approaches to preservation (including breakout activity and demonstration of migration and emulation)

Wednesday, 6 June 2007

09:00 – 12:45: Identifying, evaluating and selecting preservation methods

14:00 – 17:30: Afternoon tour to Lucca

Thursday, 7 June 2007

09:00 – 12:45: Current and emerging storage technologies

14:00 – 17:30: Repository models: from experimentation to services

Friday, 8 June 2007

09:00 – 12:45: A systematic approach to preserving authenticity

14:00 – 17:30: Mechanisms for assessing preservation service performance

 

 

 

1.4.     Participants

It was very nice to see that most of the lecturers were available for further discussion and to improve one’s knowledge in a certain area.

 

Here is a list of scientific members and professionals:

 

UK, Michael Day, UKOLN

IT, Maria Guercio, University of Urbino & CASPAR

NE, Hans Hofman, National Archives of the Netherlands

UK, Perla Innocenti, Digital Curation Centre

AT, Andreas Rauber, Vienna University of Technology

DE, Peter Rödig, Universität der Bundeswehr München

UK, Seamus Ross, HATII at the University of Glasgow

DE, Stefan Strathmann, NESTOR

DE, Manfred Thaller, University of Cologne

US, Ken Thibodeau, NARA

 

The participants, among them many students coming from different universities, usually either had a background in computer science, digital preservation or knowledge about archiving or libraries. Most of the students joining this seminar were in their master studies or were doing a PhD.

1.5.     Breakout sessions

Nevertheless the summer school was not all about the lectures. In between the interesting talks students were encouraged to build small groups with other participants. Within time slots of usually one hour they had the chance to gather their ideas to a given scenario and to present their thoughts and findings to the other groups in the previously mentioned auditorium.

 

Sometimes a lecturer was present at their side to counsel them and provide additional background information so that they were able to solve the given task. And because people were distributed rather randomly the group work was turned out to be very interesting as people form different background (e.g. technical versus archiving) come up with different ideas, thoughts and approaches.

1.6.     Aims and goals of the summer school

The Summer School aimed to give students a coherent understanding of the issues surrounding digital preservation within the context of digital library development and management, and an introduction to the emerging research in the area of digital curation and preservation.

 

By the end of the summer school, students gained an appreciation of the issues surrounding digital preservation within the context of digital library development and management. They were supposed to develop a grasp of the core research in the area of digital curation and preservation and know about the techniques and practices that underlie this area. Further students should be aware of how these requirements can be integrated into approaches to digital library development. They have a coherent and practical understanding of activities surrounding digital preservation and gained experience with issues in the area of workflow modelling, metadata definition, and ingest process management.

They acquired an appreciation of the different approaches to selecting and appraising potential digital acquisitions and gained knowledge of the approaches to repository design and deployment. Further, they have a working knowledge of the issues surrounding assessment, audit and certification of digital repositories. Last, but not least, they develop a firm understanding of the issues of authenticity, integrity, and reliability in relation to digital libraries.

2.          Lectures

2.1.     Repositories

This lecture contained all the relevant ideas about how to build a repository. It was given by Peter Rödig, working at the Fakultät für Informatik at the Institut für Softwaretechnologie of the Universität der Bundeswehr München, Germany.

The main goal of this lecture was to understand why it is a complex issue but nevertheless important to think about how to build and create a repository in order to be able to store digital objects in a safe way over a long time.

 

There are several challenges in building a repository and many factors have to be taken into account, like for example that digital objects are inherently complex and thus it is difficult to choose file formats suitable for preservation. Individual requirements for a repository are quite heterogeneous because for example an archive of images certainly differs from the needs for an archive of personal records. Common criteria for quality (trustworthiness) are rather abstract and need to be clarified. Some frameworks may be far away from the implementation level and even if there are implementations the repository software is often complex. Documentation of such products and the quality of a system is also still an issue. A lot of the virtualization process of technical infrastructure, e.g., GRID computing, raises new challenges that have to be met.

 

2.2. Decision making

 

Taking into account all these issues for solving this problem a decomposition can be done in the following way: First reference models, technologies and use cases can help to build a criteria catalog, a so called rating schema. Using existing archiving products available on the market one can extract product descriptions which are comparable with each other. Using additional information by individual requirements analysis one should help to carry out the product selection.

2.3. Criteria catalog

 

Taking into account all the above arguments leads to the following criteria catalog to build a digital repository:

·                    Overall system architecture (design principles, explicit long-term features, object organization metadata organization, rights, roles, functions for ingest, access, archival storage, administration)

·                    System/application integration (library system/publishing system/product data management system/other archives)

·                    Organizational integration (Federation/cooperation/user communities)

·                    Software architecture

·                    Hardware basis

 

2.4. Product assessment of existing products

 

As a last step the lecturer gave an overview of several products available on the market in the area of digital repositories. A comparison was drawn using the following criteria: object formats, object identifications, relationships from an object to another, version, variants, metadata schemas, relation of objects and its metadata.

2.5.     Metadata

Is that a buzzword? Not at all. Metadata is crucial for repositories. The argument for metadata is that one wants to be able to find something in an archive. And that’s where metadata can not only support this process it is a requirement!

2.6. Definition

Metadata in general is the data about data, thus information about other data. The data to gather information about in this case are collections of data like documents, books, databases, files of all kinds and so on. Metadata can contain useful information about important properties and characteristics of a digital object. In order to be able to find objects given certain search criteria it is absolutely essential to be able to rely on meaningful metadata.

 

Metadata can be applied to almost any kind of digital material: web sites, databases, web pages, digital images, books, archival records, collections, museum objects, services, organisations, events, geographical locations, concepts and so on. It enables efficient management of large digital material (e.g. selection, appraisal, ingest, preservation, and access).

 

But as resources are limited, manual extraction faces limitations. For example, manual collection is time consuming and expensive. Looking at the exponential growth of digital objects being created makes continued manual collection impossible. Another issue is that a repository cannot rely on contributors to provide necessary metadata because usually people don’t do it in an appropriate manner.

 

Automatic extraction is therefore the obvious solution to this problem. Examples of useful tools are:

·                    Technical metadata extraction: National Library New Zealand Metadata Extraction Tool, UK National Archives Droid etc.

·                    Descriptive metadata extraction: DC-dot (http://www.ukoln.ac.uk/metadata/dcdot/), Automatic Metadata Generation (Catholic University Kuleuven) and so on.

·                    Descriptive metadata extraction is based on structured documents such as web pages or XML and is limited to shallow description.

2.7. Approach via genres

To collect metadata out of almost any origin with almost any content is rather easy for humans with their background knowledge. At the same time it is extremely difficult for applications to do the same task.

 

So the idea is to classify a given document into a certain genre, where a specialized tool can extract metadata of one genre in a rather efficient way. A metadata collector created for a certain document types achieves better and more meaningful recognition of important metadata.

 

To get an idea of which genres can be used in classifying digital objects the following list contains the 10 most important groups comprising 70 genres. The groups are: book, article, short composition, serial, correspondence, treatise, information structure, evidential document, visually dominant document. To get an idea of what genres a group can comprise, here the examples for “book”: Academic Monograph, Book of Fiction, Poetry Book, Handbook, Other Book.

2.8.     PREMIS

This section is about the PREMIS project which was held by Stefan Strahtmann who is working in the PREMIS project at the Niedersächsische Staats- und Universitätsbibliothek Göttingen. In the following sections we want to describe the aim of the project, the PREMIS data dictionary and further issues.

2.9. Aim of the project

The goal of the project is to define an implementable set of "core" preservation metadata elements, with broad applicability within the digital preservation community. A data dictionary is to be drafted to support the core preservation metadata element set. Another goal is to examine and evaluate alternative strategies for the encoding, storage, and management of preservation metadata within a digital preservation system, as well as for the exchange of preservation metadata among systems.

Besides, it is planned to conduct pilot programs for testing the group’s recommendations and best practices in a variety of systems settings and to explore opportunities for the cooperative creation and sharing of preservation metadata.

 

The outputs of the project were already published in the Implementation Survey report (Sept. 2004) and the PREMIS Data Dictionary 1.0 (May 2005), both available on the Internet.

2.10.         PREMIS data dictionary

OAIS remains the conceptual foundation (but some differences in terminology) for the PREMIS data dictionary. The data dictionary is a translation of the OAIS-based 2002 Framework into a set of implementable semantic units.

 

Preservation metadata is more or less "the information a repository uses to support the digital preservation process". The PREMIS data dictionary defines metadata that supports "maintaining viability, renderability, understandability, authenticity, and identity in a preservation context." and can be seen as a new 'canonical' definition of preservation metadata.

 

The Data Dictionary is implementation independent, i.e. does not define how it should be stored. It is based on simple data model that defines five types of entities.

The PREMIS data dictionary defines semantic units for objects, events, agents and rights.

2.11.         Issues of the data dictionary

The PREMIS Data Dictionary is an important contribution to the ongoing development of preservation metadata. It is, however, implementation independent, for example it brings its own definition of semantics and a suggested XML binding. Non-PREMIS elements do not conflict with or overlap with PREMIS semantic units, although there is a need for more harmonisation between them. An important issue is to exchange objects as well as mandatory metadata needs to be able to be extracted and packaged with the object

 

2.12.  Kopal and DIAS

The kopal project is dedicated to solve the long term preservation of digital objects in a cooperative way. The kopal project is implemented following the international standards for long-term archiving and metadata within the OAIS framework (Open Archival Information System). A transparent integration into existing information and library systems is a major goal of this project.

 

The partners Deutsche Nationalbibliothek, Niedersächsische Staats- und Universitätsbibliothek Göttingen, Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) and IBM Deutschland GmbH collaborate  in working on a technical solution for long term preservation which not only tries to save digital objects but also ensure their future availability.

2.13.         Archive material

The project itself started in July 2004. The goal is to preserve all data of the partners Deutsche Nationalbibliothek and the Niedersächsische Staats- und Universitätsbibliothek (SUB Göttingen) into the long-term archive. That refers to all kinds of digital data ranging from objects in various formats like PDF; TIFF, or TeX to complex objects (e.g. digital videos).

2.14.         Technical and development partners

The technical operation is carried out by the partner Gesellschaft für wissenschaftliche Datenverarbeitung GmbH Göttingen (GWDG). The development part was taken over by IBM Deutschland GmbH. The company provides a professional adaptation of software components as well as it ensures a stable long-term support.

2.15.         Usage purpose

Right from the beginning the kopal project integrates different partners at different locations. The long-term archive for digital information will also assist other companies or institutions in helping to keep their data usable and accessible. Therefore kopal offers a wide range of possibilities that exceed the area of application for libraries only. Also academic, business and administrative usage are fields of application.

2.16.         Workflow

The workflow can be divided into the following steps:

Selection: the institution selects digital objects that are meant to be preserved for long-term future use. Content-specific and formal criteria are also taken into consideration (e.g. orientation of the library).

Collecting and creation of metadata: To ensure a correct storage and retrieval of information additional information is stored along with the object. This technical information is necessary to be able to update and migrate digital objects occasionally as time passes by. The metadata is being taken from information systems and also being extracted of the objects itself using special tools.

Creation of a package: The digital objects are bundled along with their metadata into a special format, the so called UDF (Universal Object Format).

Inserting and ingest of package: Before the system stores the data into the archive it checks the data for completeness and formal correctness.

Converting into archive package: The metadata are transferred to the data management. The content files and the metadata file are saved on a mass storage managed by DIAS.

Access: Using the data management interface the metadata can be accessed efficiently and consequently the data itself.

Delivery: depending on the demanding source the metadata and the content data itself can be delivered in the desired format.

Usage of data: Users of access digital objects access them usually via an information system. The user can be made aware of the fact that he is accessing long-term stored data. At the same time he can choose whether he likes to retrieve the data in a not necessarily up-to-date format.

2.17.         Object specification

In order to be able to reconstruct and maintain archived digital objects there is a clear need for structured archive packages. kopal and its archiving system DIAS is using UOF (Universal Object Format). It describes the package structure including metadata and is archiving format and exchange format at the same time. The UOF can contain any file structures and store content specific as well as technical metadata. Hereby there are no limitations   drawn regarding certain file formats or media categories. Therefore it is possible to store arbitrary file formats like TIFF, PDF, XML and so on but also ISO images of CDs or video files can be handled.

 

METS (Metadata Encoding and Transmission Standard), LMER (Langzeitarchivierungsmetadaten für elektronische Ressourcen) and DC (Dublin Core) are used for handling of metadata. Besides, other XML-metadata can be used as well. Specific technical metadata can be extracted by using the software tool JHOVE which was originally developed for the magazine archive JSTOR.

A core part of the metadata in the UOF documents the complete migration history. All of the conversions are listed which ere necessary to ensure the long term availability of the record. This data is preserved along with the technical information to the individual files of an object so that users can access the content at a later time.

2.18.         Software

The kopal archiving system consists of the DIAS-Core developed by IBM and the kopal tools, which were developed by the partner of the SUB Göttingen and the Deutsche Nationalbibliothek. The DIAS-Core is based on the standard software components of IBM which were developed therefore. The Deutsche Nationalbibliothek and the SUB Göttingen created software packages that are a perfect match to the DIAS Core. The powerful kopal tools are distributed under an open source license and referenced as “kopal library for retrieval ingest” (koLibRI).

2.19.  File formats and registries

Digital long-term preservation of data is all about files. And as there is no file without a format it is a crucial to look at issues with file formats more closely. Manfred Thaller presented this important issue of digital preservation in a very striking manner.

 

First of all it was discussed what a file format is and why it is so important. The definition given was: “A deterministic specification how the properties of a digital object can reversibly be converted into a linear byte stream (bit stream).”

 

The content of a file always has the following characteristics:

 

 

 

 

<basic information>

What to do?

mandatory

<rendering information>

How to do it?

useful

<storage information>

How to move it from persistent to deployed form?

historical

<data>

What to deploy?

mandatory

 

 

 

 

2.20.         Vulnerabilities

Next, Mr. Thaller from the Universität zu Köln talked about the vulnerabilities of file formats as they can destroy or at least harm digital objects when stored in a repository.

 

Bit rot is commonly defined as the event in which the small electric charge of a bit in memory disperses, possibly altering program code, thus a digital object.

Bit rot can also be used to describe the process of stored data which is gradually decaying over the duration of many years, or in the decay of data stored on CD or DVD disks or other types of consumer storage.

 

The problem is that such errors are undetectable by software because hardly any correction mechanism (checksums and so on) won’t recognize this.

 

Another problem is the obscolescence of software. This occurs over time and it is very complicated to establish measures against this problem. While bit rot only affects certain individual files the obsolescence of software is more severe as it affects all files of this type in an archive.

 

The reasons for that can be diverse: For example, the software able to read does not exist any more or the format specification, the implied algorithm or the required object got lost.

2.21.         Recommended formats

As we have seen, file formats are vulnerable to a different extent. The FCLA (Florida Center for Library Automation) has published a list of format information about various file formats.

 

The next table shows the ranking of textual information (http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf):

 

 

 

 

 

Media

High Confidence Level

Medium Confidence Level

Low Confidence Level

Text

- Plain text (encoding: USASCII,

UTF-8, UTF-16 with BOM)

- XML (includes XSD/XSL/XHTML, etc.; with included or accessible schema and character encoding explicitly specified)

- PDF/A-1 (ISO 19005-1) (*.pdf)

 

- Cascading Style Sheets (*.css)

- DTD (*.dtd)

- Plain text (ISO8859-1

encoding)

- PDF (*.pdf) (embedded fonts)

- Rich Text Format 1.x (*.rtf)

- HTML 4.x (include a

DOCTYPE declaration)

- SGML (*.sgml)

- Open Office (*.sxw/*.odt)

- Office Open XML (*.docx)

 

- PDF (*.pdf) (encrypted)

- Microsoft Word (*.doc)

- WordPerfect (*.wpd)

- DVI (*.dvi)

- All other text formats not

listed here

 

Audio

[…]

[…]

[…]

Video

[…]

[…]

[…]

 

 

 

 

 

2.22.         Registries

There is a clear need for registries to store digital objects. There are certain issues in doing that because repository functions are performed on a format-specific basis. It is difficult to portray difficult content in an appropriate way as the interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented. And one has to take into account that Interchange requires mutual agreement of format syntax and semantics.

 

How to identify potential use cases for a file format

·                    Identification - I have a digital object; what format is it?

·                    Validation - I have an object purportedly of format F; is it?

·                    Transformation - I have an object of format F, but need G; how can I produce it?

·                    Characterization - I have an object of format F; what are its significant properties?

·                    Risk assessment - I have an object of format F; is at risk of obsolescence?

·                    Delivery - I have an object of format F; how can I render it?

 

One might argue why not to use the common MIME types. But there are two major reasons against that approach. The first one is that those types are characterized by an insufficient depth of detail. There are neither requirements regarding syntax and semantic description nor requirements for complete disclosure, especially of proprietary formats. The other reason is the insufficient granularity of these standards. Different PDF standards for example might require radically different workflows

2.23.  XCEL and XCDL

Moving from the description of file formats Mr. Thaller continued with outline the problems during the conversion of an old file format to a new one.

 

The scenario is that a file exists at a certain date in a certain (old) format. This format can be problematic for long-term preservation as new software might not support the old format anymore. But at the same time there exist software systems which convert old formats to new formats.

 

Before rushing into a non reversible conversion one has to make sure that the following questions are answered in order to preserve digital information. Does the new format contain all the information that the old format contains? Does the new format contain all information that is crucial for the usage of the information that the old format contains? And is conversion process a better then the conversion process b? Does conversion process a preserve more information?

 

To answer these questions it is necessary to make information of the old and the new format comparable. To make the retrieved information comparable the information must be able to be displayed in a third format which is independent of the two other formats.

Different parts of the information contained in the format have a different level of importance. An example would be the colours and metadata in an image file.

 

To express the information contained in a file in an abstract way the XCDL (Extensible Characterisation Definition Language) can be used. The XCDL enables to express the information of binary files. As a last step there is a need for an algorithm that compares two descriptions (XCDLs).

2.24.         Automate with XCEL and XCDL

The step of describing information contained in binary files usually has to be applied to a big amount of digital objects. Therefore there is a clear need to automate this process.

 

One mean to automate the process is to create a program for every single format description of files. A better approach is to provide a language that enables to transform arbitrary binary files into a XCDL description. This requires the definition of an XCEL (Extensible Characterization Extraction Language).

And XCEL describes how information is to be extracted out of a given file.       

 

The vision is that XCEL and XCDL are sufficient to describe arbitrary file formats and thus the extraction of properties of any file. All XCDLs describe the properties (characteristics) extracted of digital files.

XCEL provides the mapping of procedural descriptions of the position of information within a physical file and its abstract interpretation beyond the specification of the format.

3.          Conclusion and outlook

We hope that this overview of the DELOS summer school of digital preservation 2007 was interesting for the readers of this report. The topic of digital preservation will not outdate – only files and their standards will. So what should be clear now is that this topic will remain relevant for a long time.

 

A lot of scientific work and research is going on, in companies as well as academic institutions and any contribution can help to save our knowledge for future generations.