Thursday, December 01, 2016

Implementing Automatic Digital Preservation for a Mass Digitization Workflow

Implementing Automatic Digital Preservation for a Mass Digitization Workflow. Henrike Berthold, Andreas Romeyke, Jörg Sachse.  Short paper, iPres 2016.  (Proceedings p. 54-56 / PDF p. 28-29). 
     This short paper describes their preservation workflow for digitized documents and the in-house mass digitization workflow, based on the Kitodo software, and the three major challenges encountered.
  1. validating and checking the target file format and the constraints to it,
  2. handling updates of d content already submitted to the preservation system, 
  3. checking the integrity of all archived data in an affordable way
They produce several million scans a year and preserve these digital documents in their Rosetta based archive which is complemented by a submission application for pre-ingest processing, an access application that prepares the preserved master data for reuse, and a storage layer that ensures the existence of three redundant copies of the data in the permanent storage and a backup of data in the processing and operational storage. They have customized Rosetta operations with plugins they developed.  In the workflow, the data format of each file is identified, validated and technical metadata are extracted. AIPS are added to the permanent storage (disk and LTO tapes). The storage layer, which uses hierarchical storage management, creates two more copies and manages them.

To ensure robustness, only single page, uncompressed TIFF files are accepted. They use the open-source tool checkit-tiff to check files against a specified configuration. To deal with AIP updates, files can be submitted multiple times: the first time is an ingest, all transfers after that are updates. Rosetta ingest functions can add, delete, or replace a file. Rosetta can also manage multiple versions of an AIP, so older versions of digital objects remain accessible for users.

They manage three copies of the data, which totals 120 TBs. An integrity check of all digital documents, including the three copies, is not feasible due to the time that is required to read all data from tape storage and check them. So to get reliable results without checking all data in the archive they use two different methods:

  • Sample Method Integrity 1% sample of archival copies is checked yearly 
  • Specified fixed bit pattern workflow that is checked quarterly.

Their current challenges are in developing new media types (digital video, audio, photographs and pdf documents), unified pre-ingest processing, and automation of processes (e.g. to perform tests of new software versions).


Wednesday, November 30, 2016

To Act or Not to Act - Handling File Format Identification Issues in Practice

To Act or Not to Act - Handling File Format Identification Issues in Practice. Matthias Töwe, Franziska Geisser, Roland E. Suri. Poster, iPres 2016.  (Proceedings p. 288-89 / PDF p. 145).
     Format identification output needs to be assessed within an institutional context and also consider provenance information to determine actions. This poster presents ways to address file format identification and validation issues that are mostly independent of the specific tools and systems employed; they use Rosetta, DROID, PRONOM, and JHOVE. Archives rely on technical file format information and therefore want to derive as much information about the digital objects as possible before ingest. But there are issues that occur in everyday practice, such as:
  • how to proceed without compromising preservation options
  • how to make efforts scalable 
  • issues with different types of data
  • issues related to the tool's internal logic
  • metadata extraction which is also format related
 The use cases vary depending on the customers, types of material, and formats. A broad range of use cases apply to safeguarding research data for a limited period of time (ten years at minimum) to publishing and preserving data in the long term. Understanding the use cases’ characteristics helps provides "a better understanding of what actually matters most in each case."

Ideally, format identification should yield reliable and unambiguous information on the format of a given file, however a number of problems make the process more complicated. Handling files on an individual basis does not scale well. This may mean that unsatisfactory decisions need to be taken to keep the volume of data manageable. Some criteria to consider:
  • Usability: can the file be used as expected with standard software?
  • Tool errors: is an error known to be tool-related?
  • Understanding: is the error actually understood?
  • Seriousness: does the error concern the format's significant properties?
  • Correctability: is there a documented solution to the error?
  • Risk of correcting: what risks are associated with correcting the error?
  • Effort: what effort is required to correct the error?
  • Authenticity: is the file’s authenticity more relevant than format identification?
  • Provenance: can the data producer help resolve this and future errors?
  • Intended preservation: what solution is acceptable for lower preservation periods?
There are no simple rules to resolve these, so other considerations are needed to determine what actions to take:
  • Should format identification be handled at ingest or as a pre-ingest activity?
  • How to document measures taken to resolve identified problems?
  • Can unknown formats be admitted to the archive? 
  • Should the format identification be re-checked later? 
  • Do we rely on PRONOM or do we need local registries? 
  • How to preserve formats where no applications exist.
"Format validation can fail when file properties are not in accord with its format’s specification. However, it is not immediately clear if such deviations prevent current usability of a file orcompromise the prospects for a file’s long term preservability." If the file is usable today, does that mean it is valid? Digital archives need to "balance the efforts for making files valid vs. making files pass validation in spite of known issues."

The failure to extract significant properties has no immediate consequences, and institutions need to decide if they will correct the issues, especially if the metadata is embedded and the file must be altered, which creates the risk of unknowingly introducing other changes. The authors of this poster act even more cautiously when fixing metadata extraction issues that require working with embedded metadata or file properties.

Tuesday, November 29, 2016

German hbz Consortium Selects Ex Libris Rosetta Digital Asset Management and Preservation Solution

German hbz Consortium Selects Ex Libris  Rosetta Digital Asset Management and Preservation Solution. Press Release. ProQuest. 29 November 2016.
     Hochschulbibliothekszentrum des Landes Nordrhein‑Westfalen has chosen the Ex Libris Rosetta digital asset management and preservation solution. There are more than 40 member institutions that will be able to deposit digital collections in the central Rosetta system. “Our preservation and management plans across the entire North-Rhine Westphalia region include both artifacts and modern research output. With Rosetta, we will be able preserve a wide range of data and manage digital assets on both the consortium and institutional level. Rosetta meets our current and long-term needs.” 

Tuesday, November 22, 2016

Every little bit helps: File format identification at Lancaster University

Every little bit helps: File format identification at Lancaster University.  Rachel MacGregor. Digital Archiving at the University of York. 21 November 2016
   The post is about Rachel's work on identifying research data and follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported previously. The aim was to understand the nature of research data and to inform their approaches to preservation. The summary of the statistics:
Of 24,705 files: 

  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications. 
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications. 
    • 50 of these were either 8-bit or 7-bit ASCII text files.  
    • The remaining 26 were identified by container as various types of Microsoft files.

Of the 11008 identified files:

  • 89.34% were identified by signature
  • 9.2% were identified by extension
  • 1.46% identified by container
When adjusted for the 7,000 gzip files, the percentages identified were:
  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
These results were different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's and also a set of lsm files identified as TIFFs. 

In all, 59 different file formats were identified, GZIP  was the most frequently occurring followed by xml format.

Files that weren't identified
  • There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  
  • 64% had no file extension (64%). 
  • Top counts of unidentified file extensions: dat, data, cell, param,
Gathering this information helps contribute towards our overall understanding of file format types. "Every little bit helps."

Monday, November 21, 2016

The Digital Preservation Gap(s)

The Digital Preservation Gap(s). somaya langley. Digital Preservation at Oxford and Cambridge. 18 November 2016.
     This is a broader comment on the field of digital preservation and the various gaps in the digital preservation field. Some of these are:
  • Silo-ing of different areas of practice and knowledge (developers, archivists etc.)
  • Lack of understanding of front-line staff working with born-digital materials 
  • Archivists, curators and librarians wanting a ‘magic wand’ to deal with ‘all things digital’
  • Tools that are limited or currently do not exist
  • Lack of knowledge to run the few available tools
  • Lack of knowledge of how to approach problem-solving
At iPres "the discussion still began with the notion that digital preservation commences at the point where files are in a stable state, such as in a digital preservation system (or digital asset management system). Appraisal and undertaking data transfers wasn’t considered at all, yet it is essential to capture metadata (including technical metadata) at this very early point. (Metadata captured at this early point may turn into preservation metadata in the long run.)" First-hand experiences of acquiring born-digital collections provide greater understanding of what it takes to do this type of work and will help in developing policies.

It is important to understand common real-world use cases and experiences in acquiring born-digital collections. "Archivists have an incredible sense of how to manage the relationship with a donor who is handing over their life’s work, ensuring the donor entrusts the organisation with the ongoing care of their materials" but preservationists that are traditionally trained archivists, curators and librarians often lack technical skill sets.  On the other hand, technologists lack experience with liaising with donors first-hand. Both groups would benefit from the others. Sharing approaches to problem-solving is definitely important.  The term ‘digital stewardship’ may be more helpful in acquiring and managing born-digital materials. 

Saturday, November 19, 2016

Software Sustainability and Preservation: Implications for Long-term Access to Digital Heritage

Software Sustainability and Preservation: Implications for Long-term Access to Digital Heritage. Jessica Meyerson, David Rosenthal, Euan Cochrane. Panel, iPres 2016.  (Proceedings p. 294-5 / PDF p. 148).

     Digital content requires software for interpretation, processing, and use, and sustaining the software functionality beyond its normal life span is an issue. It may not be possible, economically or otherwise, for the software vendors to maintain software long term. Virtualization and emulation are two techniques that may be viable options for long-term access to objects, and there are currently efforts to preserve essential software that is needed to access or render digital content. Some efforts are the earlier KEEP Emulation Framework project, and currently the bwFLA Emulation as a Service (EaaS) project has demonstrated the ability to provide access to emulated and virtualized environments via a simple web browser and as part of operational archival and library workflows.

Memory institutions and software vendors have valuable digital heritage software collections that need to be maintained. A growing number of digital objects require software in order to be used and viewed. Yale University, the Society of American Archivists and others are working to resolve legal barriers to software preservation practices. The preservation community "continues to evolve their practices and strive for more comprehensive and complete technical registries to support and coordinate software preservation efforts".


Friday, November 18, 2016

Challenges and benefits of a collaboration of the Collaborators

Challenges and benefits of a collaboration of the Collaborators. William Kilbride, et al. Panel, iPres 2016.  (Proceedings p. 296-7 / PDF p. 149).
     The importance of collaboration in digital preservation has been emphasized by many professionals in the field. Because of rapid technological developments, the increase of digital material and the growing complexity of digital objects, "no one institution can do digital preservation on its own". Digital preservation related tasks and responsibilities has led to a network of relationships between various groups, such as the DPC and then other institutions, founded as a “collaborative effort to get digital preservation on the agenda of key decision-makers and funders”. These organizations are help to encourage collaboration to help libraries, archives, museums and experts to work together to ensure the long-term preservation and accessibility of digital sources

A logical next step is to establish a larger collaborative infrastructure to preserve all relevant digital data from the public sector. This would require storage facilities, but also knowledge and manpower to ensure proper management of the facilities. There must be agreement about which tasks and  responsibilities can be performed by the institutions themselves, and which could be carried out in collaboration with others. This seems to be the right time to join forces, to be more effective in our work, and to share our experiences. This can help answer questions about prioritization, solutions and policies for the next steps in international collaboration.


Thursday, November 17, 2016

Preserving Data for the Future : Research Data Management in an Academic Library Consortium

Preserving Data for the Future : Research Data Management in an Academic Library Consortium. Alan Darnell. PASIG 2016.
     A presentation about managing and preserving data. Three major points:
1. Get the Data. Important to get the data as soon as possible. The Availability of research data declines rapidly; time is the enemy of preservation.

2. Preserve the Data. The goal is to automated the transfer of data to a secure repository, in this case from Dataverse to Archivematica.  There are issues that need to be resolved, such as scalability, file size, increasing volume of materials, unrecognized file types, etc. There are tools that can help. The AIPs in a repository need continued management, formats, checksums.

3. Ensure that the Data is Usable. Reproducibility of results is a key measure of the usability of data. The data management process needs to capture more of the context of the research process that created the data, including software, metadata, all research materials, including notebooks. A Data management plan is important in this process.

Wednesday, November 16, 2016

A Doomsday Scenario: Exporting CONTENTdm Records to XTF

A Doomsday Scenario: Exporting CONTENTdm Records to XTF. Andrew Bullen. D-Lib Magazine. November/December 2016.
     Because of budgetary concerns, the Illinois State Library asked Andrew Bullen to explore how their CONTENTdm collections could be migrated to another platform. (The Illinois Digital Archives repository is based on CONTENTdm). He chose methods that would allow him to quickly migrate the collections using existing tools, particularly PHP, Perl, and XTF which they use as the platform for a digital collection of electronic Illinois state documents. The article shows the perl code written, metadata, record examples, and walks through the process. He started A Cookbook of Methods for Using CONTENTdm APIs. Each collection presented different challenges and required custom programming. He recommends reviewing the metadata elements of each collection and normalizing like elements as much as possible, and plan what elements can be indexed and how faceted browsing could be implemented. The test was to see if the data could be reasonably converted so not all parts were implemented. In a real migration, CONTENTdm's APIs could be used as a data transfer medium.

Tuesday, November 15, 2016

Digital Preservation for Libraries, Archives, & Museums, a review

Digital Preservation for Libraries, Archives, & Museums. 2nd edition.  Edward M. Corrado, Heather Moulaison Sandy. Rowman & Littlefield. 2016.
     I don't usually include publisher reviews here, but I got to know Edward Corrado when he worked with the Rosetta system at Binghampton. I received an advance copy of this book and provided a review for it. This is a very thorough book on a very large topic and I thought the review worth including.

This very thorough and well researched book on digital preservation is for libraries, archives and museums of all sizes.  It covers a wide range of digital preservation topics which will prove useful for managers and technical staff alike.  The foreword to the book states that digital preservation is not a problem but an opportunity. The topics covered in this book help the reader understand how to implement these opportunities within their own organization. Digital preservation cannot be done in isolation from the rest of the organization; it needs to be an integral part of the whole. The authors demonstrate that with the proper resources and technical expertise, organizations can preserve "today's digital content long into the future". 

The table of contents of the book shows the range of topics covered:

Parts of the book:
I. Introduction to Digital Preservation,
II. Management Aspects,
III. Technology Aspects, and
IV. Content-Related Aspects.

Sections of the book
1. What is Digital Preservation? What it is not.
2. Getting Started with the Digital Preservation Triad: Management, Technology, Content
3. Management for Digital Preservation
4. The OAIS Reference Model
5. Organizing Digital Content
6. Consortia and Membership Organizations
7. Human Resources and Education
8. Sustainable Digital Preservation, financial factors
9. Digital Repository Software and Digital Preservation Systems
10. The Digital Preservation Repository and Trust
11.  Metadata for Digital Preservation
12. File Formats and Software for Digital Preservation
13. Emulation
14. Selecting Content
15. Preserving Research Data
16. Preserving Humanities Content
17. Digital Preservation of Selected Specialized Formats
Appendix A: Select Resources in Support of Digital Preservation

A few quotes and thoughts from the book that I thought especially useful:
  • three interrelated activities: management-related activities, technological activities, and content-centered activities.
  • technology cannot --- and should not --- be the sole concern of digital preservation. 
  • concerned with the life cycle of the digital object in a robust and all-inclusive way.
  • digital preservation is in many ways a management issue.  It requires interaction with the process and procedures of all parts of an organization.
  • Regardless of the role any particular staff member plays in digital preservation, one of the most important attributes required is passion for digital preservation.
  • Ultimately, digital preservation is an exercise in risk management.
  • Primarily, digital preservation is something that must be accepted on the basis of trust. can help build trust using self-assessments, certification, and audit tools
  • Digital preservation allows information professionals and those working in cultural heritage institutions to preserve, for the long-term, content that otherwise, if not cared for, would unquestionably be lost.
It helps to answer some basic questions:
  • How can I preserve the digital content available in my institution for the future?
  • What do I need to know to carry out this work?
  • How can I plan for the future in terms of the technology, human resources, and collections?
  • How do I know if I’m on the right track with my digital preservation efforts?


Monday, November 14, 2016

The (information) machine stops

The (information) machine stops. Gary McGath. Mad File Format Science Blog. March 14, 2016.
     The “Digital Dark Age” discussion comes up again.  Instead of asking what could trigger a Digital Dark Age, we ought to ask
  1. what conditions are necessary and sufficient for the really long-term preservation of information,
  2. what will minimize the risk of widespread loss of today’s history, literature, and news?
Our storage ability has increased but the durability of that storage has decreased. We deal with obsolescence and format, file, and device failures. "Anything we put on a disk today will almost certainly be unusable by 2050. The year 3016 just seems unimaginably far. Yet we still have records today from 1016, 16, and even 984 B.C.E. How can our records of today last a thousand years?"

The current practices rely on curation, migration, and hoping that storage providers will be around forever. Or that some institutions will take up the task of preservation and continue it forever. This requires "an unbroken chain of human activity to keep information alive". History shows that information is often neglected or destroyed, and in reality, only a tiny fraction has survived. "Today’s leading forms of digital storage simply can’t survive that degree of neglect." Abby Smith Rumsey writes, "The new paradigm of memory is more like growing a garden. Everything that we entrust to digital code needs regular tending, refreshing, and periodic migration to make sure that it is still alive, whether we intend to use it in a year, a hundred years, or maybe never." It is not a safe assumption that "things will always be the way they are today, maybe with some gradual improvement or decline, but nothing that will seriously disrupt the way we and future generations live."

However, we have to remember that people and information have survived many types of catastrophes. The original question in the post was "If an uninterrupted succession of custodians isn’t the best way to keep history alive, what is? The answer must be something that’s resilient in the face of interruptions." An important part of this is to avoid reliance on fragile protection; the keys are durability and decentralization. "The hard parts are avoiding physical degradation, hardware obsolescence, and format obsolescence. Physical durability isn’t out of reach. Devices like the M-disc have impressive durability."

"The way to address obsolescence is with designs simple enough that they can be reconstructed." We need decentralized archives in many places with different approaches. "The problem is solvable. The mistake is thinking that an indefinite chain of short-term solutions can add up to a long-term solution."

Related posts:

Wednesday, November 09, 2016

Autonomous Preservation Tools in Minimal Effort Ingest

Autonomous Preservation Tools in Minimal Effort Ingest. Asger Askov Blekinge, Bolette Ammitzbøll Jurik, Thorbjørn Ravn. Andersen.  Poster, iPres 2016.  (Proceedings p. 259-60 / PDF p. 131).
     This poster presents the concept of Autonomous Preservation Tools developed by the State and University Library, Denmark. It is an expansion of their idea of Minimal Effort Ingest. In Minimal Effort Ingest most of the preservation actions are handled within the repository when resources are available. The incoming data is to be secured quickly, even when resources are sparse. Preservation actions should happen when resources are available, rather than by a static ingest workflow.

From these concepts they created the idea of Autonomous Preservation Tools which are more like software agents rather than a static workflow system. The process is more flexible and allows for easy updates or changes to the workflow steps. A fixed workflow is replaced with a decentralised implicit workflow which defines the set of events that an AIP must go through.  Rather than a static workflow that must process AIPs in a fixed way, the Autonomous Preservation Tools "can discover AIPs to process on their own". Because AIPs maintain an record of past events tools can determine whether or not an AIP has been processed or if other Tool actions must be performed first. So the workflow is the tools finding and processing items correctly until every item has been processed.  This becomes an alternative method of processing.

Establishing Digital Preservation At the University of Melbourne

Establishing Digital Preservation At the University of Melbourne. Jaye Weatherburn. Poster, iPres 2016.  (Proceedings p. 274-5 / PDF p. 138).
     The University of Melbourne’s Digital Preservation Strategy is to make the "University’s digital product of enduring value available into the future, thus enabling designated communities to access digital assets of cultural, scholarly, and corporate significance over time". The long-term, ten-year vision of their strategy looks at four interrelated areas in phases over the next three years:
  1. Research Outputs
  2. Research Data and Records
  3. University Records
  4. Cultural Collections
The key principles around which action is required: Culture, Policy, Infrastructure, and Organization. The University’s research strategy recognizes the importance of their digital assets by declaring that "the digital research legacy of the University must be showcased, managed, and preserved into the future". The project team members need to start a comprehensive advocacy campaign to illustrate the importance of preservation. Instead of digital preservation being perceived as a bureaucratic and financial burden it needs to be seen as a useful tool for academic branding and profiling, as well as important for the long-term sustainability of their research.

Tuesday, November 08, 2016

Digital Preservation with the Islandora Framework at Qatar National Library

Digital Preservation with the Islandora Framework at Qatar National Library. Armin Straube, Arif Shaon, Mohammed Abo Ouda. Poster, iPres 2016.  (Proceedings p. 270-271 / PDF p. 136).
     This poster outlines how Qatar National Library is creating a digital preservation solution. Their preservation strategy is to build a trustworthy digital repository based on established digital preservation and certification. The guiding principles that serve as benchmarks for their digital preservation efforts and which will inform its decision making process:
  • Accessibility: permanent accessibility and usability
  • Integrity: verify checksums, storage redundancies, monitoring and managing storage hardware.
  • Persistent identifiers
  • Metadata: capture technical metadata and record in PREMIS
  • Preservation planning and risk assessment
  • Standards compliance and trustworthiness
  • Development and research via collaboration
The digital repository is based on Islandora integrated with Fedora Commons along with different preservation functions to be developed as Drupal modules. The repository stores image objects (digitized books, maps, photos etc.) in both tiff and jpeg2000 formats; audio-visual collections in mp4 and wav; and web archives in warc format from Heritrix.  The library will develop a file format policy that will enhance the basis of its risk assessment.


Monday, November 07, 2016

The Nuclear Bunker Preserving Movie History

The Nuclear Bunker Preserving Movie History. George Willeman, film archivist.  Great Big Story. Oct 10, 2016.
     A short video about the Library of Congress movie and film archive which is located in an underground bunker in Culpepper, Virginia. The bunker was originally a gold storage unit and later a fallout shelter during the Cold War. Today the Library of Congress stores film there. It is used to ensure the survival of the nations films through restoration and preservation. Besides storing, the center also specializes in repairing and processing films of many different types and sizes. It also includes nitrate films in 124 nitrate film vaults. Some of the films are over 100 years old. They preserve old and new films for the historical value, not necessarily the monetary value. The purpose is to remember what the early times were like, what we had and what we did. "Film is one of the absolute best ways of doing that."

Thursday, November 03, 2016

Formation of Task Force for Email Archiving

Mellon Foundation and Digital Preservation Coalition Sponsor Formation of Task Force for Email Archives. Press Release.  November 1, 2016.
      The task force has been created and is charged over the next 12 months to assess current frameworks, tools, and approaches being taken toward these critical historical sources. Personal correspondence is an essential primary source for historians and scholars across and helps "future generations understand and learn from history, providing evidence of the functions and activities of governments, businesses, nonprofit organizations, families, and individuals".  Today's correspondence is digital and emails especially are far more difficult to gather and preserve in an accessible format. "This is a topic of deep concern.  Preserved correspondence helps students of the past develop a nuanced understanding of events, much more so than published or other widely circulated sources." Email has remained resistant preservation efforts at preservation and is currently not systematically acquired by most institutions. 

"As archives include more born-digital collections, the complex technical issues around preserving email are more prevalent and increasingly important. The technical issues around email preservation are compounded by the sheer scale of the collections." Solutions need to be community supported, large-scale with preservation options.  The task force will focus on these three issues:
  1. articulating this technical framework, 
  2. suggesting how existing tools fit within this framework,
  3. beginning to identify any missing elements.
Preservation of email cannot be a single, comprehensive solution, but the "interaction of a variety of solutions covering the entire range of archival activities, from appraising the research value of email to helping researchers discover and use it." It requires an inter-operable toolkit.

Wednesday, November 02, 2016

Should We Keep Everything Forever? Determining Long-Term Value of Research Data

Should We Keep Everything Forever? Determining Long-Term Value of Research Data. Bethany Anderson, et al.  iPres 2016. (Proceedings p. 284,5/ PDF p. 143). Poster.
     The poster describes efforts by the institution to launch an institutional data repository called the Illinois Data Bank. The Research Data Service is committed to preserving and providing access
to published research datasets for a minimum of five years after the date of publication in the Data Bank. A preservation review developed preservation review processes and guidelines for datasets that will help promote the discoverability and use of open research data. They offer a preservation and access solution that is trusted by researchers.

The framework includes guidelines and processes for reviewing published datasets after their five-year commitment ends and decide if they should be retained or, deaccessioned. This systematic appraisal approach helps them decide the long-term viability of a dataset, its value to research communities and its preservation viability.

The Preservation review guidelines for the Illinois Data Bank are:

Evaluated by Curators/Librarians/Archivists
  • Cost to Store:  estimated cost of continuing to store
  • Cost to Preserve: estimated cost of continuing or escalating preservation
  • Access: use metrics to determine interest in this dataset
  • Citations:  has the dataset been cited in any publications
  • Restrictions: are there access or re-use restrictions
Evaluated by Domain Experts
  • Possibility of Re-creation
  • Cost of Re-creation
  • Impact of Study: did the study for this dataset significantly impact research
  • Uniqueness of Study
  • Quality of Study
  • Quality of Dataset
  • Current Relevance to contemporary research questions
Evaluated by Curators/Librarians/Archivists and Domain Experts
  • Are other copies available
  • Understandability: is the metadata & documentation for access / reuse sufficient
  • Dependencies: what are the software and environment dependencies
  • Appropriateness of Repository: is there a better repository for the dataset

Tuesday, November 01, 2016

Audio Visual Archiving: Philosophy and Principles

Audio Visual Archiving: Philosophy and Principles. Ray Edmondson. UNESCO. 2016. PDF, 102pp.
     Audiovisual heritage comprises a large and increasingly important part of the world's cultural heritage. Currently, among the major issues for Audio Visual Archiving are digitization and format obsolescence. The field is complex and requires skills, technology and budgets.

There is a lack of professional recognition  of the community and a lack of formal training standards and courses.  Audiovisual archiving is still emerging as an academic discipline. The greatest challenge of digitization is not one of technology or economics, but of scholarship, education and ethics. A major challenge of preservation is not only to migrate analogue works that are at risk, but to keep up with the new born-digital productions  and at the same time preserving the technology and skills of an analogue era.

Preservation of AV archiving, ensuring the permanent accessibility of audiovisual content with the maximum integrity, is a never‑ending management task. "Nothing has ever been preserved – at best, it is being preserved!" AV media have always been in a state of continuous evolution.

To preserve their collections and make them accessible, audiovisual archives have to maintain obsolete technology as well as keeping abreast of new technology, and retain the relevant skill base for both. Content is migrated to newer formats to maintain its accessibility, while older carriers may still need to be maintained for their artefact and informational value.  Digital formats are not simply replacing analogue formats; they both have a future.  It is unlikely that there is any “ultimate” format.
Some notes and definitions:
  • Audiovisual documents are no less important, and in some contexts more important, than other kinds of documents or artefacts.
  • The responsibilities of Audiovisual archivists include maintaining the authenticity, and guaranteeing the integrity, of the works in their care. Selection, protection and accessibility of this content should be governed by publicly declared policies rather than political presssures.
  • Preservation and access are two sides of the same coin, but they are so interdependent that access can be seen as an integral part of preservation.
  • Preservation, without the objective of access, has no point. The relatively fragile and fugitive nature of the audiovisual media and its technology place these functions at the centre of the management and culture of audiovisual archives.
  • Digital preservation combines policies, strategies and actions to ensure access to reformatted and born digital content regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.
  • "An audiovisual archive is an organization or department of an organization which has a statutory or other mandate for providing managed access to a collection of audiovisual documents and the audiovisual heritage by collecting, preserving and promoting." 
  • The function of building, documenting, managing and preserving a collection is central and presumes that the collection will be accessible.
  • An audiovisual archivist is a person formally qualified or accredited as such, or who is occupied at the level of a skilled professional in an audiovisual archive, in developing, preserving or providing managed access to its collection, or the serving of its clientele.
  • The preservation and accessibility of moving images and sound recordings eventually involves copying or migration. Documenting the processes involved and choices made in copying from generation to generation is essential to preserving the integrity of the work
  • All key areas of an archive’s operation – including collection development, preservation, access and collection management – should have a deliberate policy basis.
  • "Permanent access is the goal of preservation: without this, preservation has no purpose except
    as an end in itself."
  • Putting long-term preservation at risk in order to satisfy sudden, short-term access demand is
     a risk that should be avoided 
Collection development embraces four distinct procedures:
  1. selection: involves research and judgment, leading to acquisition
  2. acquisition: involves technical and physical choices, contractual negotiation and transaction, shipment, examination and inventorying of carriers
  3. deselection: a judgmental process based on later circumstances, including changes in selection policy
  4. disposal: the ethical divesting of carriers from a collection
The philosophy and principles of audiovisual archiving will always be a work in progress.

Monday, October 31, 2016

MIT task force releases preliminary “Future of Libraries” report

MIT task force releases preliminary “Future of Libraries” report. Peter Dizikes. MIT News Office. October 24, 2016.
    An MIT task force released a preliminary report about making MIT’s library system an “open global platform” enabling the “discovery, use, and stewardship of information and knowledge” for future generations. It contains general recommendations to develop “a global library for a global university,” yet strengthen the library’s relationship with the local academic community and public sphere.  “For the MIT Libraries, the better world we seek is one in which there is abundant, equitable, meaningful access to knowledge and to the products of the full life cycle of research. Enduring global access to knowledge requires sustainable models for ensuring that past and present knowledge are available long into the future.”

The MIT task force arranged ideas into four “pillars":
  1. Community and Relationships: interactions with local and global users
  2. Discovery and Use: the provision of information
  3. Stewardship and Sustainability: management and protection of scholarly resources
  4. Research and Development: library practices and needs
The report suggests a flexible approach simultaneously serving students, faculty, staff, alumni, cooperating scholars, and the local and the global scholarly community. It recommends study of changes allowing quiet study as well as new types of instruction and collaboration. The library system needs to enhance its ability to disseminate MIT research, provide better  digital access to content, and generate open platforms for sharing and preserving knowledge. The report encourages the institution to help find solutions for the “preservation of digital research,” which the report says is a “major unsolved problem.”

The report engages advocates finding the right balance between analog and digital resources, since  “the materiality of certain physical resources continues to matter for many kinds of research and learning.” They see this as a high priority.

The Future of Libraries site has a link to the full PDF report.

Copyright is Not Inevitable, Divine, or Natural Right

Copyright is Not Inevitable, Divine, or Natural Right. Kenneth Sawdon. ALA Intellectual Freedom Blog. October 19, 2016.
     A copyright lawsuit was decided in India that allows academia to create unlicensed coursepacks and allow students to photocopy portions of textbooks used in their classes. The Court dismissed the case brought by publishers and "held that coursepacks and photocopies of chapters from textbooks are not infringing copyright, whether created by the university or a third-party contractor, and do not require a license or permission". Unlicensed custom coursepacks are not covered under fair use in the U.S. but they are in India.

The ruling included this quote about what copyright is:
"Copyright, specially in literary works, is thus not an inevitable, divine, or natural right that confers on authors the absolute ownership of their creations. It is designed rather to stimulate activity and progress in the arts for the intellectual enrichment of the public. Copyright is intended to increase and not to impede the harvest of knowledge. It is intended to motivate the creative activity of authors and inventors in order to benefit the public."
This ruling doesn’t suggest that everything is fair game, but only that the use of textbook excerpts in India is fair use. "Stopping a university or third-party from providing coursepacks or textbook excerpts merely prevents the students from getting the most convenient source for information that they are free to use."  The Court held that when texts are used for imparting education and not commercial sale, it can’t infringe on copyright of the publishers. In the United States the defense for fair use involving coursepacks failed a legal challenge.

Saturday, October 29, 2016

Beta Wayback Machine – Now with Site Search!

Beta Wayback Machine – Now with Site Search! Vinay Goel. Internet Archive Blogs. October 24, 2016.
     The Wayback Machine has provided access to the Internet Archive's archived websites for 15 years. Previously the URL was the main access. There is a new beta keyword search that returns a list of relevant archived websites with additional information.

Friday, October 28, 2016

A Method for Acquisition and Preservation of Emails

A Method for Acquisition and Preservation of Emails. Claus Jensen, Christen Hedegaard. iPres 2016. (Proceedings p. 72-6/ PDF p. 37-39).
     The paper describes new methods for the acquisition of emails from a broad range of sources not directly connected with the responsible organization, as well ingesting into the repository. Some of the requirements:

Non-technical requirements
  • Maximum emulation of traditional paper-based archiving criteria, procedures
  • High level of security against loss, degradation, falsification, and unauthorized access
  • A library record should exist, even if documents are not publicly available 
  • Simple procedure for giving access to third-party by donor
  • Maximum degree of auto-archiving
  • Minimum degree of curator involvement after Agreement
Technical-oriented requirements
  • No new software programs for the donor to learn
  • No need for installation of software on the donor’s machine
  • As much control over the complete system  as possible
  • Automated workflows as much as possible
  • Independence from security restrictions on the donor system imposed by others 
New requirements for the second prototype
  • The system should be based on standard email components
  • Easy to use for both curator and donors
  • Donors’ self-deposit
  • System based on voluntary/transparent deposit 
  • It should be independent of technical platforms  
  • Donor ability to transfer emails to the deposit area at any time
  • Donor should always have access to donated emails
  • Varying levels of access for external use 
  • Donors must be able to organize and reorganize emails.
  • Donors must be allowed to saved delete emails within a certain time-frame
  • The original email header metadata must be preserved
  • The donors must be able to deposit other digital content besides emails
The Royal Library created two areas for each donor, the deposit area and the donation area.  The repository supports linked data and uses RDF within its data model that creates relations between the objects. By ingesting the different email representations the system is able to perform file characterization on the email container files, individual emails, and attachments.

"The email project is still active, and there is still time to explore alternative or supplementing methods for the acquisition of emails. Also the task of finding good ways of disseminating the email collections has not yet begun."


Thursday, October 27, 2016

Exit Strategies and Techniques for Cloud-based Preservation Services

Exit Strategies and Techniques for Cloud-based Preservation Services. Matthew Addis. iPres 2016. (Proceedings p. 276-7/ PDF p. 139).
   This poster discusses the need for an exit strategy for when organisations that use cloud-based preservation services, and understanding what is involved in migrating to or from a cloud-hosted service. It specifically looks at Arkivum and Archivematica. Some of the topics include Contractual agreements, data escrow, open source software licensing, use of independent third-party providers, and tested processes and procedures in order to mitigate risks. The top two issues are
  1. the need for an exit strategy when using a cloud preservation service, and
  2. the need to establish trust and perform checks on the quality of the service
It mentions that “full support for migrating between preservation environments has yet to be implemented in a production preservation service.” The approach used in the poster includes:
  • Data escrow
  • Log files of the software versions and updates
  • Ability to export database and configuration
  • Ability to test a migration
 It is important to remember in a migration test that “production pipelines may contain substantial amounts of data and hence doing actual migration tests of the whole service on a regular basis will typically not be practical”.  “Hosted preservation services offer many benefits but their adoption can be hampered by concerns over vendor lock-in and inability to migrate away from the service, i.e. lack of exit-plan.“   

Wednesday, October 26, 2016

Research data is different

Research data is different. Simon Wilson. Digital Archiving blog. 5 August 2016.
     A blog post about some born digital archives at Hull.  It is not academic research data but instead comes from a variety of sources. By using DROID to look at 270,867 accessioned files they discovered the following:
  • 97.96% of files were identified by DROID 
  • There were 228 different format types were identified 
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%).  
  •   The top formats they found were:
    Microsoft Word Document (97-2003)                 44.52%
    Microsoft Word for Windows (2007 and later)     5.63%
    Microsoft Excel 97 Workbook                              5.08%
    Graphics Interchange Format                              4.15%
    Acrobat PDF 1.4 - Portable Document Format     3.12%
    JPEG File Interchange Format (1.01)                    2.72%
    Microsoft Word Document (6.0 / 95)                    2.46%
    Acrobat PDF 1.3 - Portable Document Format     2.39%
    JPEG File Interchange Format (1.02)                    1.83%
    Hypertext Markup Language (v4)                         1.67%
 The number of and type of formats they found in their collections was different from other institutions that had research data.  An important step is to then look at the identified file formats and determine a strategy to migrate that format. Knowing the number and frequency of the formats in the collections will allow efforts to be prioritized.


Tuesday, October 25, 2016

Checksum 101: A bit of information about Checksums

Checksum 101: A bit of information about Checksums. Ross Spencer. Archives NZ Workshop. 2 October 2016.
    A slide presentation providing very good information on checksums. Why do we use checksums:
  • Policy: Provides Integrity
  • Moving files: Validation after the move
  • Working with files: Uniquely identifying what we’re working with
  • Security:  a by-product of file integrity
An algorithm does the computing bit, and there are a variety of types, MD5, CRC32, SHA, etc. A checksum algorithm is a one way function that can't be reversed. DROID can handle MD5, SHA1, and SHA256.  Why use multiple checksums? This helps to avoid potential collisions, though the probabilities are low. The presentation shows the different type of checksums and how they are generated.

Checksums will ensure uniqueness. We can automate processes better with file checksums. Some people may have a preference of which checksums to use. Using the checksums will help future proof the systems and provide greater security

Monday, October 24, 2016

Our Preservation Levels

Our Preservation Levels. Chris Erickson. October 24, 2016.
     After looking at the levels used by various groups, we have decided on 4 levels for our preservation plan. We want to keep it simple so that it is not difficult to determine and that it is meaningful for our workflows. Our Rosetta preservation system is a dark archive that can harvest digital materials from several publicly accessible content management systems. The curator or subject specialist for the collection will determine the level of preservation together with the preservation priority and will indicate that on the Digital Preservation Decision Form.

The Preservation Levels
0.   No preservation. Regular backups only (for example: Shared network drive that is  backed up regularly by IT)
1.   Basic preservation. A copy on M-Disc in Special Collections besides an access copy in our CMS, which is backed up by IT. No other preservation processing
2.   Full preservation. A master copy in Rosetta, with format migration, descriptive and preservation metadata, fixity checks, multiple copies (tape, data center, Granite Mountain Vault)
3.  Extended preservation. Full preservation services plus either DPN or remote/internet storage copy for materials that are appropriate for DPN
The intention is to recognize that some materials do not need full preservation services, nor long term storage in DPN. We will evaluate the levels next year and see if they are working the way we expect.

Thursday, October 20, 2016

Digital Storage In Space Rises Above The Cloud

Digital Storage In Space Rises Above The Cloud.  Tom Coughlin. Forbes. October 13,  2016.
     A start up company (Cloud Constellation) plans to build an array of earth orbiting data center satellites that would provide a space-based infrastructure for cloud service providers that can provide a private network with communications directly to and from the satellite network without any communication over the Internet via tight beam radio and hence no public data transmission headers. The company says that latencies will be lower than those through conventional Internet transmission.

The digital storage in these orbiting data centers will be solid-state drives and the internal temperature inside the satellites will be kept at about 70 degrees Fahrenheit. The budget to build the initial phase of this satellite network is estimated at $400 M, much less than the cost of building an equivalent terrestrial global data center network with an equivalent level of security. Data is encrypted on the way to the satellite chain, inside the satellite storage and when the data is transmitted back to earth. This should provide secure storage and transport of data without interruption or exposure to exposed networks.It could protect critical and sensitive data for potential clients, including university archives and libraries. The first phase is planned to be operational in 2018 or 2019. Soon many companies and organizations will have an option to store their data securely in outer space.

Wednesday, October 19, 2016

Filling the Digital Preservation Gap. Phase Three report - October 2016.

Filling the Digital Preservation Gap. A Jisc Research Data Spring project. Phase Three report - October 2016. 19 October 2016. Jenny Mitcham, et. al. [PDF]
     This is a report of phase 3 of the Filling the Digital Preservation Gap project.  It is important to
consider how we incorporate digital preservation functionality into our Research Data Management workflows.
  • Phase 1: addressed the need for digital preservation as part of the research data management infrastructure
  • Phase 2: practical steps to enhance their preservation system for research data 
  • Phase 3 has the following aims:
    • To establish proof of concept implementations of Archivematica at the Universities of Hull and York, integrated with other research data systems at each institution
    • To investigate the problem of unidentified research data file formats and consider practical steps for increasing the representation of research data formats in PRONOM3
    • To continue to disseminate the outcomes of the project both nationally and internationally and to a variety of different audiences

"Preserving digital data isn’t solely reliant on the implementation of a digital preservation system, it is also necessary to think about related challenges that will be encountered and how they may be addressed."  In working with formats it was clear that DROID does not look inside the zip files, and not all files were assigned a file format identification. Of the 3752 files analysed at York, only 1382 (37%) were assigned a file format identification by DROID. At the University of Hull a similar exercise had quite different results, with 89% of files assigned an identification by DROID. At Lancaster University the identification rate was 46%. Of the files, 70% of the files were TIFF images. Of the files that were not automatically identified, files with no extension made up 26% of the total.

"One possible solution to the file format problem as described would be to limit the types of files that would be accepted within the digital repository. This is a tried and tested approach for certain disciplines and data archives" and follows the NDSA level one recommendations, to “... encourage use of a limited set of known open formats ...”. This may be a problem with preserving research data, since researchers use a wide range of specialist hardware and software and it will be "hard for the repository and research support staff to provide appropriate advice on suitable formats. For much of the data there will be no obvious preservation format for that data."

The University of York encourages researchers (in training sessions and webpages) to consider file formats throughout their project, and the longevity and accessibility of the formats they select, but  researcher decides what formats to deposit their data in. The university accepts these formats and will preserve them on a best efforts basis. "Understanding the file format moves us one step closer to preservation and reuse over the longer term." In order to help the research data community their recommendations include:
  • For data curators: 
    • Greater engagement with researchers on the value and necessity of recognising and recording the file formats they will use/generate to inform effective data curation.
  • For researchers:
    • Supply adequate metadata about submitted datasets. Clear and accurate metadata about file formats and hardware/software dependencies will aid file format identification and future preservation work. 
    • Be open to sharing sample files for testing and to aid signature development where appropriate.

Appendix 2 contains A Draft PCDM-based Data Model for Datasets