At Santa Croce in Fossabanda (
Markus Opitz
1.2. Location and lecture
environment
1.6. Aims and goals of the summer school
2.1.3. Product assessment of existing products
2.3.3. Issues of the data dictionary
2.4.2. Technical and development partners
2.5. File formats and registries
2.6.1. Automate with XCEL and XCDL
This
chapter contains an overall description of the conference. It deals with basic
information about where it was held as well as a time schedule. Next, all the
participants and their backgrounds are explained in detail. The chapter
finishes with outlining the aims and goals of the summer school and what each
participant should be able to take home from this unique experience in the
heart of
The summer
school took place In Pisa, Italy. The audience of this summer school were young
researchers and professionals with an interest in this topic.
The major
topics and issues treated in this course can be outlined by the following
questions:
·
Why
should we preserve digital materials?
·
Who
should do it?
·
What
should they do?
·
Where
should they do it?
·
When
should it be done?
·
How
should it be done?
·
For
how long should we do it?
·
How
should we fund preservation?
The summer school took place at the
Santa Croce in Fossabanda Conference
Centre
Piazza Santa Croce, 5
56125
Web site: http://www.fossabanda.it
Santa Croce in Fossabanda was an excellent place
for this summer school as it had a suitable meeting room. It was located at the
ground floor, overlooking the majestic churchyard. At the same time all
necessities for a modern conference were met by this location: a video beamer
was of great use as well as WLAN access was available for all participants of
the conference.
The summer
school on digital preservation was an intensive lecture and consisted of nine
half-day sessions during the week beginning 3 June 2007 until the 8 June 2007.
Each of the established lecturers led a session. The time frame in detail was:
|
|
Sunday, 3
June 2007 |
18:30 –
19:30: Overview of the summer school and introduction to the lecturers 19:30 –
21:30: Opening reception |
Monday, 4
June 2007 |
09:00 –
12:45: Introduction to digital preservation & preservation challenges 14:00 –
17:30: Metadata approaches for digital preservation |
Tuesday,
5 June 2007 |
09:00 –
12:45: Registries and file formats 14:00 –
17:30: Approaches to preservation (including breakout activity and
demonstration of migration and emulation) |
Wednesday,
6 June 2007 |
09:00 –
12:45: Identifying, evaluating and selecting preservation methods 14:00 –
17:30: Afternoon tour to |
Thursday,
7 June 2007 |
09:00 – 12:45:
Current and emerging storage technologies 14:00 –
17:30: Repository models: from experimentation to services |
Friday, 8
June 2007 |
09:00 –
12:45: A systematic approach to preserving authenticity 14:00 –
17:30: Mechanisms for assessing preservation service performance |
|
|
It was very
nice to see that most of the lecturers were available for further discussion
and to improve one’s knowledge in a certain area.
Here is a
list of scientific members and professionals:
IT, Maria
Guercio, University of Urbino & CASPAR
NE, Hans
Hofman, National Archives of the
AT, Andreas
Rauber,
DE, Peter Rödig,
Universität der Bundeswehr München
DE, Stefan
Strathmann, NESTOR
DE, Manfred
Thaller,
US, Ken
Thibodeau,
The
participants, among them many students coming from different universities,
usually either had a background in computer science, digital preservation or
knowledge about archiving or libraries. Most of the students joining this
seminar were in their master studies or were doing a PhD.
Nevertheless
the summer school was not all about the lectures. In between the interesting
talks students were encouraged to build small groups with other participants.
Within time slots of usually one hour they had the chance to gather their ideas
to a given scenario and to present their thoughts and findings to the other
groups in the previously mentioned auditorium.
Sometimes a
lecturer was present at their side to counsel them and provide additional
background information so that they were able to solve the given task. And
because people were distributed rather randomly the group work was turned out
to be very interesting as people form different background (e.g. technical
versus archiving) come up with different ideas, thoughts and approaches.
The Summer
School aimed to give students a coherent understanding of the issues
surrounding digital preservation within the context of digital library
development and management, and an introduction to the emerging research in the
area of digital curation and preservation.
By the end
of the summer school, students gained an appreciation of the issues surrounding
digital preservation within the context of digital library development and
management. They were supposed to develop a grasp of the core research in the
area of digital curation and preservation and know about the techniques and
practices that underlie this area. Further students should be aware of how
these requirements can be integrated into approaches to digital library
development. They have a coherent and practical understanding of activities
surrounding digital preservation and gained experience with issues in the area
of workflow modelling, metadata definition, and ingest process management.
They
acquired an appreciation of the different approaches to selecting and
appraising potential digital acquisitions and gained knowledge of the
approaches to repository design and deployment. Further, they have a working
knowledge of the issues surrounding assessment, audit and certification of
digital repositories. Last, but not least, they develop a firm understanding of
the issues of authenticity, integrity, and reliability in relation to digital
libraries.
This
lecture contained all the relevant ideas about how to build a repository. It
was given by Peter Rödig, working at the Fakultät für Informatik at the
Institut für Softwaretechnologie of the Universität der Bundeswehr München,
The main
goal of this lecture was to understand why it is a complex issue but nevertheless
important to think about how to build and create a repository in order to be
able to store digital objects in a safe way over a long time.
There are
several challenges in building a repository and many factors have to be taken
into account, like for example that digital objects are inherently complex and
thus it is difficult to choose file formats suitable for preservation.
Individual requirements for a repository are quite heterogeneous because for
example an archive of images certainly differs from the needs for an archive of
personal records. Common criteria for quality (trustworthiness) are rather
abstract and need to be clarified. Some frameworks may be far away from the
implementation level and even if there are implementations the repository
software is often complex. Documentation of such products and the quality of a
system is also still an issue. A lot of the virtualization process of technical
infrastructure, e.g., GRID computing, raises new challenges that have to be
met.
Taking into account all these issues for solving this problem a decomposition can be done in the following way: First reference models, technologies and use cases can help to build a criteria catalog, a so called rating schema. Using existing archiving products available on the market one can extract product descriptions which are comparable with each other. Using additional information by individual requirements analysis one should help to carry out the product selection.
Taking into
account all the above arguments leads to the following criteria catalog to
build a digital repository:
·
Overall
system architecture (design principles, explicit long-term features, object
organization metadata organization, rights, roles, functions for ingest,
access, archival storage, administration)
·
System/application
integration (library system/publishing system/product data management
system/other archives)
·
Organizational
integration (Federation/cooperation/user communities)
·
Software
architecture
·
Hardware
basis
As a last
step the lecturer gave an overview of several products available on the market
in the area of digital repositories. A comparison was drawn using the following
criteria: object formats, object identifications, relationships from an object
to another, version, variants, metadata schemas, relation of objects and its
metadata.
Is that a
buzzword? Not at all. Metadata is crucial for repositories. The argument for
metadata is that one wants to be able to find something in an archive. And
that’s where metadata can not only support this process it is a requirement!
Metadata in
general is the data about data, thus information about other data. The data to
gather information about in this case are collections of data like documents,
books, databases, files of all kinds and so on. Metadata can contain useful
information about important properties and characteristics of a digital object.
In order to be able to find objects given certain search criteria it is
absolutely essential to be able to rely on meaningful metadata.
Metadata
can be applied to almost any kind of digital material: web sites, databases,
web pages, digital images, books, archival records, collections, museum
objects, services, organisations, events, geographical locations, concepts and
so on. It enables efficient management of large digital material (e.g.
selection, appraisal, ingest, preservation, and access).
But as
resources are limited, manual extraction faces limitations. For example, manual
collection is time consuming and expensive. Looking at the exponential growth
of digital objects being created makes continued manual collection impossible.
Another issue is that a repository cannot rely on contributors to provide
necessary metadata because usually people don’t do it in an appropriate manner.
Automatic extraction is therefore the obvious
solution to this problem. Examples of useful tools are:
·
Technical
metadata extraction: National Library New Zealand Metadata Extraction Tool, UK
National Archives Droid etc.
·
Descriptive
metadata extraction: DC-dot (http://www.ukoln.ac.uk/metadata/dcdot/), Automatic
Metadata Generation (Catholic University Kuleuven) and so on.
·
Descriptive
metadata extraction is based on structured documents such as web pages or XML
and is limited to shallow description.
To collect
metadata out of almost any origin with almost any content is rather easy for
humans with their background knowledge. At the same time it is extremely difficult for
applications to do the same task.
So the idea
is to classify a given document into a certain genre, where a specialized tool
can extract metadata of one genre in a rather efficient way. A metadata
collector created for a certain document types achieves better and more
meaningful recognition of important metadata.
To get an
idea of which genres can be used in classifying digital objects the following
list contains the 10 most important groups comprising 70 genres. The groups
are: book, article, short composition, serial, correspondence, treatise,
information structure, evidential document, visually dominant document. To get
an idea of what genres a group can comprise, here the examples for “book”:
Academic Monograph, Book of Fiction, Poetry Book, Handbook, Other Book.
This
section is about the PREMIS project which was held by Stefan Strahtmann who is
working in the PREMIS project at the Niedersächsische Staats- und Universitätsbibliothek
Göttingen. In the following sections we want to describe the aim of the
project, the PREMIS data dictionary and further issues.
The goal of
the project is to define an implementable set of "core" preservation
metadata elements, with broad applicability within the digital preservation
community. A data dictionary is to be drafted to support the core preservation
metadata element set. Another goal is to examine and evaluate alternative
strategies for the encoding, storage, and management of preservation metadata
within a digital preservation system, as well as for the exchange of
preservation metadata among systems.
Besides, it
is planned to conduct pilot programs for testing the group’s recommendations
and best practices in a variety of systems settings and to explore
opportunities for the cooperative creation and sharing of preservation
metadata.
The outputs
of the project were already published in the Implementation Survey report
(Sept. 2004) and the PREMIS Data Dictionary 1.0 (May 2005), both available on
the Internet.
OAIS
remains the conceptual foundation (but some differences in terminology) for the
PREMIS data dictionary. The data dictionary is a translation of the OAIS-based
2002 Framework into a set of implementable semantic units.
Preservation
metadata is more or less "the information a repository uses to support the
digital preservation process". The PREMIS data dictionary defines metadata
that supports "maintaining viability, renderability, understandability,
authenticity, and identity in a preservation context." and can be seen as
a new 'canonical' definition of preservation metadata.
The Data
Dictionary is implementation independent, i.e. does not define how it should be
stored. It is based on simple data model that defines five types of entities.
The PREMIS
data dictionary defines semantic units for objects, events, agents and rights.
The PREMIS
Data Dictionary is an important contribution to the ongoing development of
preservation metadata. It is, however, implementation independent, for example
it brings its own definition of semantics and a suggested XML binding.
Non-PREMIS elements do not conflict with or overlap with PREMIS semantic units,
although there is a need for more harmonisation between them. An important
issue is to exchange objects as well as mandatory metadata needs to be able to
be extracted and packaged with the object
The kopal
project is dedicated to solve the long term preservation of digital objects in
a cooperative way. The kopal project is implemented following the international
standards for long-term archiving and metadata within the OAIS framework (Open
Archival Information System). A transparent integration into existing
information and library systems is a major goal of this project.
The partners Deutsche Nationalbibliothek, Niedersächsische Staats- und Universitätsbibliothek Göttingen, Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) and IBM Deutschland GmbH collaborate in working on a technical solution for long term preservation which not only tries to save digital objects but also ensure their future availability.
The project
itself started in July 2004. The goal is to preserve all data of the partners
Deutsche Nationalbibliothek and the Niedersächsische Staats- und
Universitätsbibliothek (SUB Göttingen) into the long-term archive. That refers
to all kinds of digital data ranging from objects in various formats like PDF;
TIFF, or
The technical
operation is carried out by the partner Gesellschaft für wissenschaftliche
Datenverarbeitung GmbH Göttingen (GWDG). The development part was taken over by IBM
Deutschland GmbH. The company provides a professional adaptation of software
components as well as it ensures a stable long-term support.
Right from
the beginning the kopal project integrates different partners at different
locations. The long-term archive for digital information will also assist other
companies or institutions in helping to keep their data usable and accessible.
Therefore kopal offers a wide range of possibilities that exceed the area of
application for libraries only. Also academic, business and administrative
usage are fields of application.
The
workflow can be divided into the following steps:
Selection:
the institution selects digital objects that are meant to be preserved for long-term
future use. Content-specific and formal criteria are also taken into
consideration (e.g. orientation of the library).
Collecting
and creation of metadata: To ensure a correct storage and retrieval of
information additional information is stored along with the object. This
technical information is necessary to be able to update and migrate digital
objects occasionally as time passes by. The metadata is being taken from
information systems and also being extracted of the objects itself using special
tools.
Creation of
a package: The digital objects are bundled along with their metadata into a
special format, the so called UDF (Universal Object Format).
Inserting
and ingest of package: Before the system stores the data into the archive it
checks the data for completeness and formal correctness.
Converting
into archive package: The metadata are transferred to the data management. The
content files and the metadata file are saved on a mass storage managed by
DIAS.
Access:
Using the data management interface the metadata can be accessed efficiently
and consequently the data itself.
Delivery:
depending on the demanding source the metadata and the content data itself can
be delivered in the desired format.
Usage of
data: Users of access digital objects access them usually via an information
system. The user can be made aware of the fact that he is accessing long-term
stored data. At the same time he can choose whether he likes to retrieve the
data in a not necessarily up-to-date format.
In order to
be able to reconstruct and maintain archived digital objects there is a clear
need for structured archive packages. kopal and its archiving system DIAS is
using UOF (Universal Object Format). It describes the package structure
including metadata and is archiving format and exchange format at the same
time. The UOF can contain any file structures and store content specific as
well as technical metadata. Hereby there are no limitations drawn regarding certain file formats or media
categories. Therefore it is possible to store arbitrary file formats like TIFF,
PDF, XML and so on but also ISO images of CDs or video files can be handled.
METS
(Metadata Encoding and Transmission Standard), LMER
(Langzeitarchivierungsmetadaten für elektronische Ressourcen) and DC (Dublin
Core) are used for handling of metadata. Besides, other XML-metadata can be
used as well. Specific technical metadata can be extracted by using the
software tool JHOVE which was originally developed for the magazine archive
JSTOR.
A core part
of the metadata in the UOF documents the complete migration history. All of the
conversions are listed which ere necessary to ensure the long term availability
of the record. This data is preserved along with the technical information to
the individual files of an object so that users can access the content at a
later time.
The kopal
archiving system consists of the DIAS-Core developed by IBM and the kopal
tools, which were developed by the partner of the SUB Göttingen and the
Deutsche Nationalbibliothek. The DIAS-Core is based on the standard software
components of IBM which were developed therefore. The Deutsche
Nationalbibliothek and the SUB Göttingen created software packages that are a
perfect match to the DIAS Core. The powerful kopal tools are distributed under
an open source license and referenced as “kopal library for retrieval ingest”
(koLibRI).
Digital
long-term preservation of data is all about files. And as there is no file
without a format it is a crucial to look at issues with file formats more
closely. Manfred Thaller presented this important issue of digital preservation
in a very striking manner.
First of
all it was discussed what a file format is and why it is so important. The
definition given was: “A deterministic specification how the properties of a
digital object can reversibly be converted into a linear byte stream (bit
stream).”
The content
of a file always has the following characteristics:
|
|
|
<basic
information> |
What to do? |
mandatory |
<rendering
information> |
How to do
it? |
useful |
<storage
information> |
How to
move it from persistent to deployed form? |
historical |
<data> |
What to
deploy? |
mandatory |
|
|
|
Next, Mr. Thaller
from the Universität zu Köln talked about the vulnerabilities of file formats
as they can destroy or at least harm digital objects when stored in a
repository.
Bit rot is
commonly defined as the event in which the small electric charge of a bit in
memory disperses, possibly altering program code, thus a digital object.
Bit rot can
also be used to describe the process of stored data which is gradually decaying
over the duration of many years, or in the decay of data stored on CD or DVD
disks or other types of consumer storage.
The problem
is that such errors are undetectable by software because hardly any correction
mechanism (checksums and so on) won’t recognize this.
Another
problem is the obscolescence of software. This occurs over time and it is very
complicated to establish measures against this problem. While bit rot only
affects certain individual files the obsolescence of software is more severe as
it affects all files of this type in an archive.
The reasons
for that can be diverse: For example, the software able to read does not exist
any more or the format specification, the implied algorithm or the required
object got lost.
As we have
seen, file formats are vulnerable to a different extent. The FCLA (
The next
table shows the ranking of textual information
(http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf):
|
|
|
|
Media |
High
Confidence Level |
Medium
Confidence Level |
Low
Confidence Level |
Text |
- Plain text (encoding:
USASCII, UTF-8, UTF-16 with BOM) - XML (includes
XSD/XSL/XHTML, etc.; with included or accessible schema and character
encoding explicitly specified) - PDF/A-1 (ISO 19005-1)
(*.pdf) |
- Cascading Style Sheets
(*.css) - DTD (*.dtd) - Plain text (ISO8859-1 encoding) - PDF (*.pdf) (embedded
fonts) - Rich Text Format 1.x (*.rtf) - HTML 4.x (include a DOCTYPE declaration) - SGML (*.sgml) - Open Office (*.sxw/*.odt) - Office Open XML (*.docx) |
- PDF (*.pdf) (encrypted) - Microsoft Word (*.doc) - WordPerfect (*.wpd) - DVI (*.dvi) - All other text formats not listed here |
Audio |
[…] |
[…] |
[…] |
Video |
[…] |
[…] |
[…] |
|
|
|
|
There is a
clear need for registries to store digital objects. There are certain issues in
doing that because repository functions are performed on a format-specific
basis. It is difficult to portray difficult content in an appropriate way as
the interpretation of otherwise opaque content streams is dependent upon
knowledge of how typed content is represented. And one has to take into account
that Interchange requires mutual agreement of format syntax and semantics.
How to
identify potential use cases for a file format
·
Identification
- I have a digital object; what format is it?
·
Validation
- I have an object purportedly of format F; is it?
·
Transformation
- I have an object of format F, but need G; how can I produce it?
·
Characterization
- I have an object of format F; what are its significant properties?
·
Risk
assessment - I have an object of format F; is at risk of obsolescence?
·
Delivery
- I have an object of format F; how can I render it?
One might
argue why not to use the common MIME types. But there are two major reasons
against that approach. The first one is that those types are characterized by
an insufficient depth of detail. There are neither requirements regarding
syntax and semantic description nor requirements for complete disclosure,
especially of proprietary formats. The other reason is the insufficient
granularity of these standards. Different PDF standards for example might
require radically different workflows
Moving from
the description of file formats Mr. Thaller continued with outline the problems
during the conversion of an old file format to a new one.
The
scenario is that a file exists at a certain date in a certain (old) format.
This format can be problematic for long-term preservation as new software might
not support the old format anymore. But at the same time there exist software
systems which convert old formats to new formats.
Before
rushing into a non reversible conversion one has to make sure that the
following questions are answered in order to preserve digital information. Does
the new format contain all the information that the old format contains? Does
the new format contain all information that is crucial for the usage of the
information that the old format contains? And is conversion process a better
then the conversion process b? Does conversion process a preserve more
information?
To answer
these questions it is necessary to make information of the old and the new
format comparable. To make the retrieved information comparable the information
must be able to be displayed in a third format which is independent of the two
other formats.
Different
parts of the information contained in the format have a different level of
importance. An example would be the colours and metadata in an image file.
To express
the information contained in a file in an abstract way the XCDL (Extensible
Characterisation Definition Language) can be used. The XCDL enables to express
the information of binary files. As a last step there is a need for an
algorithm that compares two descriptions (XCDLs).
The step of describing information contained in
binary files usually has to be applied to a big amount of digital objects.
Therefore there is a clear need to automate this process.
One mean to automate the process is to create a
program for every single format description of files. A better approach is to
provide a language that enables to transform arbitrary binary files into a XCDL
description. This requires the definition of an XCEL (Extensible
Characterization Extraction Language).
And XCEL describes how information is to be extracted out of a given
file.
The vision is that XCEL and XCDL are sufficient to describe arbitrary
file formats and thus the extraction of properties of any file. All XCDLs
describe the properties (characteristics) extracted of digital files.
XCEL provides the mapping of procedural descriptions of the position of
information within a physical file and its abstract interpretation beyond the
specification of the format.
We hope that this overview of the
A lot of scientific work and research is going
on, in companies as well as academic institutions and any contribution can help
to save our knowledge for future generations.