Tuesday, June 28, 2016

Protecting the Long-Term Viability of Digital Composite Objects through Format Migration

Protecting the Long-Term Viability of Digital Composite Objects through Format Migration. Elizabeth Roke, Dorothy Waugh. iPres 2015 Poster. November, 2015.
     The poster discusses work done at Emory University’s Manuscript, Archives, and Rare Book Library to "review policy on disk image file formats used to capture and store digital content in our Fedora repository". The goal was to to migrate existing disk images to formats more suitable for long-term digital preservation. Trusted Repositories Audit & Certification (TRAC) requires that digital repositories monitor changes in technology in order to respond to changes. Advanced Forensic Format offered a good solution for capturing forensic disk images along with disk image metadata, but Libewf by Joachim Metz, which is a library of tools to access the Expert Witness Compression Format (EWF) has replaced it. They have decided to acquire raw disk images, or when not possible, to use tar files, because the disk images may be less vulnerable to obsolescence.

In attempting to migrate formats, they had to develop methods for migrating the files setup the repository to accept the new files. They also rely on PREMIS metadata.  The migration of disk images from a proprietary or unsupported format to a raw file format has made it easier for us to manage and preserve these objects and mitigates the threat of obsolescence for the near term. There have been some consequences. Some metadata is no longer available. Also, the process will be more complicated and require other workflows, and files will no longer contain embedded metadata. "The migration to a raw file format has made the digital file itself easier to preserve."

Monday, June 27, 2016

A Digital Dark Now? : Digital Information Loss at Three Archives in Sweden

A Digital Dark Now? : Digital Information Loss at Three Archives in Sweden.  Anna-Maria Underhill and  Arrick Underhill. Master’s  thesis. Lund University. 2016. [PDF]
     The purpose of this study is to examine the loss of digital information at three Swedish archives. Digital preservation is a complex issue that most archival institutions struggle with. Focusing on successes to the exclusion of failures runs the risk of creating a blind spot for existing problems.  The definition of digital information in this study includes digital objects and their metadata. The study includes digital internal work documents that serve as a contextual support for an archive’s collections; results are analyzed from the transition between the Records Lifecycle Model and the Records Continuum Model, an ontological understanding of digital information, the SPOT model for risk assessment and the OAIS Reference Model.

Some of the conclusions re-affirm previous research, such as the need to prioritize organizational issues. Others look at the current state of digital preservation at these archives which includes the delicate balancing act "between setting up systems for successful future digital preservation while managing existing digital collections which may not have been preserved correctly". Some institutions are unable to undertake a more proactive form of digital preservation because of the nature of the materials they preserve. The study points out that "when discussing digital preservation, the tendency remains to think of digitized material first rather than born digital information". The loss of a file may be only a part of the loss; there is also a loss of metadata and the connections between information, which may be more common than the loss of entire digital objects. "Finally, one question has followed this study from the beginning to the end: How can you know that you have lost something you never knew existed".
  • When discussing digital  preservation, it is important to clarify that storage is not the same  thing  as  preservation. 
  • The  survival  of  information  is  dependent upon the maintenance of its infrastructure  and migrating it to contemporary  formats. 
  • Authenticity can be a major issue for digital records and is important to their evidentiality.
  • Emulation is another option for digital preservation, which targets the operating environment of the information rather than the file. 
  • Emulation will eventually require migration. Emulation can become too complicated to be viable in the long run
  • Sometimes digital preservation fails to preserve what it intends to save, which can be termed information loss.
  • Obsolescence is currently one of the greatest threats to successful digital preservation. If a file cannot be read, then it is nearly the same thing as a document having been destroyed. 
  • "Without the provenance and the contextual links between records, records cannot be demonstrated to be authentic and reliable, evidentiality is lost and the use of the records for knowledge and understanding about what has happened will be difficult."

One definition of short, medium, and long-term preservation is:
  • Short-term preservation – solutions that are used for a short time, 5 years maximum.
  • Medium-term preservation – solutions that are used during a system’s lifetime, 10 years maximum.
  • Long-term preservation – solutions that are used after the originating system’s lifetime, the number of years varies, usually from 10 to 50 years.
"Dark archives are often used in order to separate the original master copies of a file from the copies that users actually access. These dark archives are generally only accessed when new material is being placed in them, and are otherwise protected in order to maintain the authenticity of the originals by placing them in an environment that is as tamper and error proof as possible"

Six essential properties for digital preservation which must be preserved:
  • Availability
  • Identity
  • Persistence
  • Renderability
  • Understandability
  • Authenticity
The study showed types of actual and potential information loss:
  • Loss of parts or whole digital objects during migration
  • Loss of the connections between analog and digital information belonging to the same archive
  • Loss of information due to it having been saved in an incorrect format
  • Loss of data in connection with technological changes
  • Loss of digital information when stored together with analog
  • Loss of information due to obsolete hardware
  • Loss of metadata due to databases written in code that is not open source
The reasons behind such actual and potential information loss were:
  • Human error during the production of information
  • An analog understanding and treatment of digital information
  • A lack of organizational structure and strategies for digital preservation
  • Lack of resources
  • Technological limitations
  • Lack of competencies among staff who produce digital information

Friday, June 24, 2016

File-format analysis tools for archivists

File-format analysis tools for archivists. Gary McGath. LWN. May 26, 2016.
     Preserving files for the long term is more difficult than just copying them to a drive. There are other issues are involved. "Will the software of the future be able to read the files of today without losing information? If it can, will people be able to tell what those files contain and where they came from?"

Digital data is more problematic than analog materials, since file formats change. Detailed tools can check the quality of digital documents, analyze the files and report problems. Some concerns:

  • Exact format identification: Knowing the MIME type isn't enough.
  • Format durability: Software can fade into obsolescence if there isn't enough interest to keep it updated.
  • Strict validation: Archiving accepts files in order to give them to an audience that doesn't even exist yet. This means it should be conservative in what it accepts.
  • Metadata extraction: A file with a lot of identifying metadata, such as XMP or Exif, is a better candidate for an archive than one with very little. An archive adds a lot of value if it makes rich, searchable metadata available.
Some open-source applications address these concerns, such as:
  • JHOVE (JSTOR-Harvard Object Validation Environment)
  • ExifTool
  • FITS File Information Tool Set
"Identifying formats and characterizing files is a tricky business. Specifications are sometimes ambiguous."  There are different views on how much error, if any, is acceptable. "Being too fussy can ban perfectly usable files from archives."

"Specialists are passionate about the answers, and there often isn't one clearly correct answer. It's not surprising that different tools with different philosophies compete, and that the best approach can be to combine and compare their outputs"

Wednesday, June 22, 2016

Five Star File Format Signature Development

Five Star File Format Signature Development. Ross Spencer. Open Preservation Foundation blog. 14 Jun 2016 .
     Discussion about formats and the importance of developing identification techniques for text formats. DROID is a useful tool but it has its limitations. For those wanting to be involved in defining formats, there are five principles of file format signature development:
  1. Tell the community about your identification gaps
  2. Share sample files
  3. Develop basic signatures
  4. Where feasible, engage openly with the community
  5. Seek supporting evidence
Developing file format signatures is really reverse engineering.

Tuesday, June 21, 2016

Vienna Principles: A Vision for Scholarly Communication

Vienna Principles: A Vision for Scholarly Communication. Peter Kraker, et al. June 2016.
     The twelve principles of Scholarly Communication are:
  1. Accessibility: be immediately and openly accessible by anyone
  2. Discoverability: should facilitate search, exploration and discovery.
  3. Reusability: should enable others to effectively build on top of each other’s work.
  4. Reproducibility: should provide reproducible research results.
  5. Transparency: should provide open means for judging the credibility of a research result.
  6. Understandability: should provide research in an understandable way adjusted to different stakeholders.
  7. Collaboration: should foster collaboration and participation between researchers and their stakeholders.
  8. Quality Assurance: should provide transparent and competent review.
  9. Evaluation: should support fair evaluation.
  10. Validated Progress: should promote both the production of new knowledge and the validation of existing knowledge.
  11. Innovation: should embrace the possibilities of new technology.
  12. Public Good: should expand the knowledge commons.

Monday, June 20, 2016

Preserving Transactional Data

Preserving Transactional Data. Sara Day Thomson. DPC Technology Watch Report 16-02. May 2016.
     This report examines the requirements for preserving transactional data and the challenges in re-using these data for analysis or research.   Transactional will be used to refer to "data that result from single, logical interactions with a database and the ACID properties (Atomicity, Consistency, Isolation, Durability) that support reliable records of interactions."

Transactional data, created through interactions with a database, can come from many sources and different types of information. "Preserving  transactional data, whether large or not, is imperative for the future usability of big data, which is often comprised of many sources of transactional data.  Such data have potential for future developments in consumer analytics and in academic research and "will only lead to new discoveries and insights if they are effectively curated and preserved to ensure appropriate reproducibility."

The organizations who collect transactional data aim to manage and preserve collected data for business purposes as part of their records management. There are strategies for database preservation as well as tools and standards  that can look at data re-use. The strategies for managing and preserving big transactional data must adapt to both SQL and NoSQL environments. Some significant challenges include the large amounts of data, rapidly changing data, and different sources of data creation. 

Some notes:
  • understanding the context and how the data were created may be critical in preserving the meaning behind the data
  • data purpose: preservation planning is critical in order to make preservation actions fit for purpose while keeping preservation cost and complexity to a minimum
  • how data are collected or created can have an impact on long-term preservation, particularly when database systems have multiple entry points, leading to inconsistency and variable data quality.
  • Current technical approaches to preserving transactional data primarily focus on the preservation of databases. 
  • Database preservation may not capture the complexities and rapid changes enabled by new technologies and processing methods 
  • As with all preservation planning, the relevance of a specific approach depends on the organization’s objectives.
There are several approaches to preserving databases:
  • Encapsulation
  • Emulation 
  • Migration/Normalization
  • Archival Data Description Markup Language (ADDML)
  • Standard Data Format for Preservation (SDFP) 
  • Software Independent Archiving of Relational Databases (SIARD)
"Practitioners of database preservation typically prefer simple text formats based on open standards. These include flat files, such as Comma Separated Value (CSV), annotated textual documents, such as Extended Markup Language (XML), and the international and open Structured Query Language (SQL)." The end-goal is to keep data in a transparent and vendor-neutral database so they can be  reintegrated into a future database.

Best practices:
  1. choose the best possible format, either preserving the database in its original format or migrating to an alternative format.
  2. after a database is converted, encapsulate it by adding descriptive, technical, and other relevant documentation to understand the preserved data.
  3. submit database to a preservation environment that will curate it over time.
Research is continuing in the collection, curation, and analysis of data; digital preservation standards and best practices will make the difference between just data and "curated collections of rich information".

Friday, June 17, 2016

The Web’s Past is Not Evenly Distributed

The Web’s Past is Not Evenly Distributed. Ed Summers. Maryland Institute for Technology. May 27, 2016.
     This post discusses ways to structure the content "with the grain of the Web so that it can last (a bit) longer."The web was created so that there was not a central authority to sure all the links work, and permission is not needed to link to a site. It does result in a web where about 5% of links break per year, according to one site.

"The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable. Sometimes when you try to visit a Web page what you see is an error message: Page Not Found. This is known as link rot, and it’s a drag, but it’s better than the alternative. Jill Lepore." If we didn’t have a partially broken Web, where content constantly change and links break, it’s quite possible we wouldn’t have a Web at all.  Some things to take note of:
  • problems with naming things
  • redirects
  • proxies
  • web archives
  • static sites
  • data export
"Being able to export your content from one site to the next is extremely important for the long term access to your data. In many ways the central challenge we face in the preservation of Web content, and digital content generally, is the recognition that each system functions as a relay in a chain of systems that make the content accessible."

"Our knowledge of the past has always been mediated by the collective care of those who care to preserve it, and the Web is no different."

Thursday, June 16, 2016

Current Game Preservation is Not Enough

Current Game Preservation is Not Enough. Eric Kaltman. Eric Kaltman's blog. 6 June, 2016.
     The current preservation practices we use for games and software must be reconsidered for modern computer games. The Standard preservation model considers three major areas of interest:
  1. the physical extent of the game, 
  2. the data stored on it, and 
  3. the hardware necessary to run it. 
The long term physical maintenance of games is not particularly good since the media and hardware degrade over time, and the data will not be readable as the media fail. The model also does not reflect the current game technology or the networked world.  "Solutions? What are some ways to combat this looming preservation quagmire? (It’s also not looming, since it’s already here.)"
  1. Consider what we are trying to save when we preserve video games. Is it to save the ability to play a historical game at some point in the future or record the act of play itself.
  2. Get the people creating games to dedicate time to basic preservation activities, such as providing  records of development, production processes, and legacy source code that would help to recreate or recover the games .
  3. There needs to be more pressure and motivation from society to legitimate games as cultural production worth saving, and to create institutional structures to fight for preservation activity. Similar to what is being done for film.
  4. This all applies to more than to just games, but also software in general, which may be in an even worse situation.
The post refers to two YouTube presentations that the author gave on game preservation: 

Wednesday, June 15, 2016

Keep Calm and do Practical Records Preservation

Keep Calm and do Practical Records Preservation. Matthew Addis. Conference on European Electronic Data Management and eHealth Topics. 23 May 2016.
     The presentation looks at some of the practical tools and approaches that can be used to ensure that digital content remains safe, secure, accessible and readable over multiple decades. It covers  mostly "practical and simple steps towards doing digital preservation for electronic content" but also some ways to determine how well prepared you are for preservation.  Some things you need to show:
  • ongoing integrity and authenticity of content in an auditable way.
  • that content is secured and access is controlled.
  • ability to access content when needed that is readable and understandable.
  • ability to do this over decades, which is a very long time in the IT world 
  • have an archivist with clear responsibility for making all this happen
  • have appropriate processes that manage all the risks proportionally.
A really simple definition of Digital Preservation from the Library of Congress: "the management of content in a pro-active way so that it remains accessible and usable over time." 

"Focus on the basic steps that need to be done now in order to support something bigger and better in the future." Know what you have and get the precious stuff in a safe storage environment.

Tuesday, June 14, 2016

Digital Preservation: We have to get this right

"We have to get this right." Jennifer Paustenbaugh. Digital Preservation. Harold B. Lee Library, Brigham Young University. June, 2016.
     Here are some recent email comments from Jennifer Paustenbaugh, our University Librarian, on digital preservation:
  • “We have to get this right. If we don't, then not much else that we’re doing in research libraries matters. If we don’t fully develop a sustainable digital preservation program, we could negatively impact whole areas of research, because materials created right now could just disappear. I think about gaps that exist in records because of man-made events and natural disasters. This could be a disaster of our own making.” 
  • "I truly believe that of all the things we’re doing in the library, this is the thing that has the potential to make the biggest difference to scholars 20 or 50 years from now. Much of the digital content that we are preserving will be gone forever if we don’t do this right. It’s a role that at once is formidable and humbling. And for most people, it will probably never be important until something that is vital to their research is just missing (and forever unavailable) from the historical record."

Monday, June 13, 2016

Macro & Micro Digital Preservation Services & Tools

Rosetta Users Group 2016: Macro & Micro Digital Preservation Services & Tools. Chris Erickson. June 7, 2016. [PDF slides]
      This is my presentation at the Rosetta's User Group / Advisory Group held this past week in New York (I always enjoy these meetings; probably my favorite conference).
  • Preservation Micro Services: free-standing applications that perform a single or limited number of tasks in the larger preservation process. The presentation includes some of those we use in our processes, both from internet sites and those that we have created in-house. Micro services are often used in the following processes: 
    • Capture
    • Appraisal
    • Processing
    • Description
    • Preservation
    • Access
  • Preservation Macro Services: Institutional services and directions that assist organizations in identifying and implementing a combination of policies, strategies, and tactics to effectively meet their preservation needs. Some of these are:
    • Digital Preservation Policy Framework
    • Workflows
    • Storage plans
    • Financial Commitment and
    • Engaging the Community
"Practitioners can only make use of individual micro-services tools if they understand which roles they play in the larger digital curation and preservation process...." Richard Fyffe 

“We have to get this right. If we don't, then not much else that we’re doing in research libraries matters. If we don’t fully develop a sustainable digital preservation program, we could negatively impact whole areas of research, because materials created right now could just disappear. I think about gaps that exist in records because of man-made events and natural disasters. This could be a disaster of our own making.” Jennifer Paustenbaugh. University Librarian. 

Since starting in this position in 2002, our digital preservation challenges have changed and increased. Re-evaluating where we are heading and how we proceed is important. A combination of broad visions and practical applications can ensure the future use of digital assets

Thursday, June 02, 2016

The Three C’s of Digital Preservation: Contact, Context, Collaboration

The Three C’s of Digital Preservation: Contact, Context, Collaboration. Brittany. DigHist Blog. May 5, 2016.
     The post looks at three themes from learning about digital preservation: "every contact leaves a trace, context is crucial, and collaboration is the key".

Contact: A digital object is more than we see, and we need to take into consideration the hardware, software, code, and everything that runs underneath it. There are "layers and layers of platforms on top of platforms for any given digital object", the software, the browser, the operating system and others. These layers or platforms are constantly obsolescing or changing and "cannot be relied upon to preserve the digital objects.  Especially since most platforms are proprietary and able to disappear in an instant."

Context is Crucial: "There’s no use in saving everything about a digital object if we don’t have any context to go with it." Capture the human experience with the digital objects. 

Collaboration is the Key: "There are a number of roles played by different people in digital preservation, and these roles are conflating and overlapping." As funding becomes tighter and the digital world more complex, "collaboration is going to become essential for a lot of digital preservation projects".   

There are still many unanswered questions that need to be asked and answered.

Tuesday, May 31, 2016

Introduction to Free and/or Open Source Tools for Digital Preservation

Introduction to Free and/or Open Source Tools for Digital Preservation. Max Eckard. University of Michigan. May 16, 2016.
     The post refers to a workshop that was given as part of the Personal Digital Archiving 2016 conference, entitled "Introduction to Free and/or Open Source Tools for Digital Preservation". This workshop introduced participants to a mix of open source and/or free software to review personal digital archives and "perform preservation actions on content to ensure its long-term authenticity, integrity, accessibility, and security". The presentation slides and google doc file are available and contain all the links and additional information.

The table of contents:
      Still Images
      Text(ual) Content
      Audio and Video

Monday, May 30, 2016

US nuclear force still uses floppy disks

US nuclear force still uses floppy disks. BBC News Services. 26 May 2016.
     A government report shows that the US nuclear weapons forces ( intercontinental ballistic missiles, nuclear bombers and tanker support aircraft) still use a 1970s-era computer system and 8-inch floppy disks. The GAO said there were "legacy systems" which need to be replaced. Legacy systems cost about $61bn a year to maintain. "This system remains in use because, in short, it still works."  "However, to address obsolescence concerns, the floppy drives are scheduled to be replaced with secure digital devices by the end of 2017." According to the report, the US treasury systems, still use a system written in "assembly language code".

Saturday, May 28, 2016

List of analog media inspection templates/forms

List of analog media inspection templates/forms. Katherine Nagels, et al. May 6, 2016.
     This  is a list of freely available analog media inspection templates, forms, or reports. Anyone is free to add contributions.  Appropriate additions may include:
  • inspection reports/forms/templates
  • condition reports/forms/templates
  • instructional guides for inspecting or assessing the condition of analog film, audio, or video



Thursday, May 26, 2016

The Governance of Long-Term Digital Information

The Governance of Long-Term Digital Information. IGI 2016 Benchmark. Information Governance Initiative. May 18, 2016. [PDF]
     “The critical role of digital . . .archives in ensuring the future accessibility of information with enduring value has taken a back seat to enhancing access to current and actively used materials. As a consequence, digital preservation remains largely experimental and replete with the risks . . . representing a time bomb that threatens the long-term viability of [digital archives].”

1. We have a problem. Nearly every organization has digital information they want to keep for 10 or more years.
2. The problem is technological, most often a storage problem.
3. The problem is business related. It is not related to just archives, libraries or museums. 
4. The problem is a legal problem. Legal requirements are the main reason organizations keep 
digital information longer than ten years
5. We know what we must do, but are we doing it? In a survey 97 percent said they are aware that digital information is at risk of obsolescence but three fourths are just thinking about it or have no strategy. Only 16% have a standards-based digital preservation system.
  • “Most records today are born digital."
  • Digital assets should be considered business-critical information and steps taken to keep them usable long into the future
  • Most organizations are not storing their long-term digital assets in a manner sufficient to ensure their long-term protection and accessibility.
How are they being kept? According to a survey:
  • Shared Network Drive                                68%
  • Business Applications (e.g. CRM, ERP)        52%
  • Enterprise Content Management System     47%
  • Disk or Tape Backup Systems                      44%
  • Records Management System                      43%
  • Application-specific Archiving (e.g. email)  33%
  • Removable Media (e.g. CD or USB)              22%
  • Enterprise Archiving System                       14%
  • Long-term Digital Preservation System        11%
  • Other                                                          9%
  • Commodity Cloud Storage (e.g. Amazon)      8%
  • I don't know                                                 1%

Where to start? Some recommendations:
  • Triage right now the materials that are in serious danger of being lost, damaged, or rendered inaccessible.
  • Conduct a formal assessment so that you can benefit from strategic planning and economies of scale.
  • Address the Past, Protect the Future
  • Catalog the Consequences of not being able to access and rely upon your own information
  • Build Your Rules for Protection and accessibility
  • Assess the IT Environment

Thursday, May 19, 2016

One Billion Drive Hours and Counting: Q1 2016 Hard Drive Stats

One Billion Drive Hours and Counting: Q1 2016 Hard Drive Stats. Andy Klein. Backblaze. May 17, 2016.
     Backblaze reports statistics for the first quarter of 2016 on 61,590 operational hard drives used to store encrypted customer data in our data center. The hard drives in the data center, past and present, totaled over one billion hours in operation to date.The data in these hard drive reports has been collected since April 10, 2013. The website shows the statistical reports of the drive operations and failures every year since then. The report shows the drives (and drive models) by various manufacturers, the number in service, the time in service, and failure rates. The drives in the data center come from four manufacturers, most of them are from HGST and Seagate. Notes:
  • The overall Annual Failure Rate of 1.84% is the lowest quarterly number we’ve ever seen.
  • The Seagate 4TB drive leads in “hours in service” 
  • The early HGST drives, especially the 2- and 3TB drives, have lasted a long time and have provided excellent service over the past several years.
  • HGST has the most hours in service

Related posts:

IBM Scientists Achieve Storage Memory Breakthrough

IBM Scientists Achieve Storage Memory Breakthrough. Press release. 17 May 2016.
     IBM Research demonstrated reliably storing 3 bits of data per cell using phase-change memory. This technology doesn't lose data when powered off and can endure at least 10 million write cycles, compared to 3,000 write cycles for an average flash USB stick. This provides "fast and easy storage" to capture the exponential growth of data.

Wednesday, May 18, 2016

Floppy Disk Format Identifer Tool

Floppy Disk Format Identifer Tool. Euan Cochrane. Digital Continuity Blog. May 13, 2016.
     Euan created this tool https://github.com/euanc/DiskFormatID (which he documents in this great blog post) to:
  1.     “Automatically” identify floppy disk formats from kryoflux stream files.
  2.     Enable “simple” disk imaging workflows that don’t include a disk format identification step during the data capture process.
The tool processes copies of floppy disk data saved in the kryoflux stream file format, creates a set of disk image files formatted according to assumptions about the disk’s format, and allows the user to try mounting the image files as file systems. It requires the Kryoflux program to function. The documentation also provides detailed information on how to use it, along with other interesting information.

Friday, May 13, 2016

JHOVE 1.14 released

JHOVE 1.14 released. Open Preservation Foundation. 12 May 2016.
     "The latest version of JHOVE, the open source file format identification, validation and characterisation tool for digital preservation, is now available to download." This version has three new format modules: gzip, WARC and PNG. Among other features, it has a black box testing module and support for Unicode 7.0.0.  Relevant links:

Related posts:

Thursday, May 12, 2016

The Center for Jewish History Adopts Rosetta for Digital Preservation and Asset Management

The Center for Jewish History Adopts Rosetta for Digital Preservation and Asset Management. Ex Libris. Press Release. May 12, 2016.
     After a thorough search process, the Center for Jewish History selected the Ex Libris Rosetta digital asset management and preservation solution. They wanted a system to handle their comprehensive list of requirements for both long‑term digital preservation and robust management of digital assets, including the ability to interface with their other systems.

The Center’s partners are American Jewish Historical Society, American Sephardi Federation, Leo Baeck Institute, Yeshiva University Museum, and YIVO Institute for Jewish Research.  The collections include more than five miles of archival documents, over 500,000 volumes, and thousands of artworks, textiles, ritual objects, recordings, films, and photographs.

Monday, May 09, 2016

Looking Across the Digital Preservation Landscape

Looking Across the Digital Preservation Landscape. Margaret Heller. ACRL TechConnect Blog. April 25, 2016.
     "When it comes to digital preservation, everyone agrees that a little bit is better than nothing." The article cited refers to two presentations from Code4Lib 2016, “Can’t Wait for Perfect: Implementing “Good Enough” Digital Preservation” by Shira Peltzman and Alice Sara Prael, and “Digital Preservation 101, or, How to Keep Bits for Centuries” by Julie Swierczek. This article mentions two major items about digital preservation:
  1. Digital preservation doesn’t have to be hard, but it does have to be intentional.
  2. Digital preservation requires institutional commitment. 
Understanding all the basic issues and what your options are can be daunting. They had a committee that started examining born digital materials, but expanded the  focus to all digital materials because it made it easier to test their ideas. Some of the tasks they accomplished included: created a rough inventory of digital materials, a workflow manual, and secured networked storage  to replace all removable hard drives used for backups. "While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have". The inventory and workflow manual are living documents and are useful for identifying gaps in the processes.

They also looked at the end-to-end systems available for digital preservation, such as Preservica, ArchivesDirect, and Rosetta. Migrating from one system to another if you change your mind may involve some very difficult processes, so people may tend to stay with providers.  Another option is to join a preservation network, such as Digital Preservation Network (DPN) or APTrust, that have the larger preservation goal ensuring long-term access to material even if the owning institution disappears.

Sustainable Financing for many is the crux of the digital preservation problem. "It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires."

Digital Preservation is receiving more attention digital preservation lately and hopefully more libraries will see this as a priority.

Thursday, April 28, 2016

Preserving the Fruit of Our Labor: Establishing Digital Preservation Policies and Strategies

Preserving the Fruit of Our Labor: Establishing Digital Preservation Policies and Strategies at the University of Houston Libraries. Santi Thompson, et al. iPres 2015. November 2015.
     Paper that presents the library's digital preservation efforts. They formed a Digital Preservation Task Force to assess previous digital preservation practices and make recommendations on future efforts. The group was charged to establish a digital preservation policy; identify strategies, actions, and tools needed to sustain long-term access to library digital objects. The group was to look at:
  • Define the policy’s scope and levels of preservation
  • Articulate digital preservation priorities by outlining current practices, identifying preservation gaps and areas for improvement, and establishing goals to address gaps 
  • Determine the tools, infrastructure, and other resources needed to address unmet needs and to sustain preservation activities in the future 
  • Align priorities with digital preservation standards, best practices, and storage services
  • Recommended roles, responsibilities, and next steps for implementing the strategy and policy 

The primary tool used for policy creation was  the Action Plan for Developing a Digital Preservation Program. It helps institutions establish a high-level framework with policies and procedures, and addressing resources to sustain a digital preservation program for the long term. The group also:
  • Selected and studied Action Plan for Developing a Digital Preservation Program to construct digital preservation policies
  • Drafted high-level policy framework
  • Outlined roles and responsibilities for internal and external stakeholders
  • Defined digital assets including digitization quality and metadata specifications; collection selection, acquisition policies, and procedures; and access and use policies
  • Identified and described key functional entities for the digital preservation system, including ingest, archival storage,preservation planning and administration, and access
  • Drafted potential start-up and ongoing costs for digital preservation
  • Focused on evaluating software
Principles outlined in their Digital Preservation Policy include collaboration, partnerships, and technological innovation. As more library resources and services become digital, the responsibilities must expand to include the identification, stewardship, and preservation of designated digital content.

The Digital Preservation Policy consist of three main sections: Policy Framework, Policies and Procedures, and Technological Infrastructure. Sections in the Digital Preservation Policy Framework include:
  • Purpose
  • Objectives
  • Mandate
  • Scope
  • Challenges
  • Principles
  • Roles and Responsibilities
  • Collaboration
  • Selection and Acquisition
  • Access and Use

Policies and Procedures section describe digital preservation policies, procedures, roles, and responsibilities in greater detail than the policy framework. It outlines requirements concerning digital assets, including recommended specifications for digital objects, preferred file formats, personnel also acquisition, transfer, and access of content

Technological Infrastructure section outlines digital preservation system functions and requirements in greater detail than the policy framework and includes:
  • The rules and requirements for Submission Information Packages (SIPs), Archival Information Packages (AIPs), and Dissemination Information Packages (DIPs)
  • The workflow for ingesting, updating, storing, and managing digital objects
  • The metadata requirements
  • The strategic priorities for future digital preservation efforts and risk management

Monday, April 25, 2016

Why Analog-To-Digital Video Preservation, Why Now

Why Analog-To-Digital Video Preservation, Why Now. Bay Area Video Coalition. April 4, 2016.
     The first part is from an article that revisits an earlier publication: How I Learned (Almost) Everything I Know About ½” Video from Dr. Martin Luther King, Jr. By Moriah Ulinskas, Former Director of Preservation. Originally published October 5th, 2011. It describes preserving a video recording of Martin Luther King, Jr. and the difficulties involved. Some quotes from the article and the website in general:
  • "I tell all our clients and partners that they have 5, maybe 10 years left in which they can have these works preserved and transferred and then these recordings are gone for good."
  • "These are the legacy recordings I refer to with such urgency when I talk about the immediacy and importance of video preservation. These moments of political and cultural significance that inspired someone, 40 years ago, to hook up a camera and record this tape which we’ve inherited from dusty basements and disregarded shelves."
  • "If we do not do diligence in transferring these recordings to new formats, as the originals become impossibly obsolete, these are the moments and the messages we will lose forever."

Some items from the rest of the website:
  • As audio and video technologies have changed, and as old formats age and disintegrate, we are at risk of losing significant media that documents the art, culture and history of our diverse communities. Link
  • Analog media preservation is necessary because of two central factors: technical obsolescence and deterioration. Experts say that magnetic media has an estimated lifespan for playback of 10-15 years, and companies have already ceased manufacture of analog playback decks, the devices required to digitize and preserve analog media.

Audio / Video Preservation Tools
  • QCTools (Quality Control Tools for Video Preservation) is a free, open ­source tool that helps  conservators and archivists ways to inspect, analyze and understand their digitized video files, in order to prioritize archival quality control, detect common errors in digitization, facilitate targeted response, and thus increase trust in video digitization efforts. 
  • A/V Artifact Atlas.  An open­-source guide used to define and identify common technical issues and problems with audio and video signals. The intent of the guide is to assist and promote reformatting archival media content.
  • AV Compass. A suite of free online resources to help with organizing and preserving media collections. It includes step-­by­-step educational videos, PDF guides, an overview of preservation concepts, and a simple tool for creating inventories. This guide helps users with creating a preservation plan and taking specific steps to make that plan happen.

Saturday, April 23, 2016

Closing the Gap in Born-Digital and Made-Digital Curation

Closing the Gap in Born-Digital and Made-Digital Curation. Jessica Tieman, Mike Ashenfelder. The Signal. April 21, 2016. 
     The post is about an upcoming symposium that refers to “Digital Frenemies”. The author observes that a trend in digital stewardship divides expertise into “made digital” and “born digital.” The landscape of the digital preservation field should not be divided like that. "Rather, the future will be largely defined by the symbiotic relationships between content creation and format migration. It will depend on those endeavors where our user communities intersect rather than lead to us to focus on challenges specific to our individual areas of the field."

Friday, April 22, 2016

Providing Access to Disk Image Content: A Preliminary Approach and Workflow

Providing Access to Disk Image Content:  A Preliminary Approach and Workflow. Walker Sampson, Alexandra Chassanoff. iPres 2015. November 2015.   Abstract    Poster
     The paper describes a proposed workflow that can be used by collecting institutions acquiring disk images to support the capture, analysis, and final access to disk image content of born-digital collections. The materials present certain challenges. Some use open-source digital forensics software environments like BitCurator, for the capture and analysis of these born-digital materials.

The workflow is for the research archives at the University of Colorado Boulder; they do not have a digital repository or collection management software deployed. However it "addresses the immediate needs of the material, such as bit-level capture and triage, while remaining flexible enough to have the outputs integrate with a future digital repository and collection management software." It allows researchers to access a bit-level copy of a floppy disk found in an archival collection. Access is typically regarded as the last milestone of processing work.

The workflow for processing born-digital materials starts with obtaining the physical disk; it is photographed then a disk image is created. The BitCurator Reporting Tool generates analytic reports and other programs can be carried out here as well. The total output from BitCurator is placed into a single BagIt package and uploaded to a managed storage space with redundant copies. That will be the AIP in a future repository. The disk image can provide access to the public.

Scientific Archives in the Age of Digitization

Scientific Archives in the Age of Digitization. Brian Ogilvie. The University of Chicago Press Journals. March 2016.
     Historians are increasingly working with material that has been digitized; they need to be aware "of the scope of digitization, the reasons why material is chosen to be digitized, and limitations on the dissemination of digitized sources."  Some physical aspects of sources, and of collections of sources, are lost in their digital versions. Some notes from the article:
  • "Digitization of unique archival material occupies an ambiguous place between access and publication."
  • digitized archives reproduce unique archival material with finding aids but without significant editorial commentary that allows for open-ended historical inquiry without the need to travel to archives  
  • the digitized archive also raises questions and challenges for historical practice, specifically 
    • the digitizing decision and funding
    • balancing digital access against some owners’ interests in restricting access
    • aspects of the physical archive that may be lost in digitization
    • the possibility of combining resources from a number of physical archives
  • most digitization projects have been selective in their scope
  • scholars cannot assume that material has been digitized, nor that all material has been digitized, unless the archive specifically states that
  • digitized material is not always freely available, e.g. subscription based archives
  • many archivists "fear that their traditional task of preparing detailed collection inventories is under threat owing to dwindling resources and the demand for digitization."

Digital Preservation notes:
  • projects have undeniable benefits for the preservation of documents and access to them.
  • In the interest of preserving their holdings and disseminating them to a broad public, archives are increasingly digitizing their collections. 
  • historians interested in digital preservation of archives, and electronic access to them, would be well advised to seek out collaborations with archivists.

Thursday, April 21, 2016

Expanding NDSA Levels of Preservation

Expanding NDSA Levels of Preservation. Shira Peltzman, Mike Ashenfelder. The Signal. April 12, 2016.
     Alice Prael and Shira Peltzman have been working on a project to update the NDSA Levels of Digital Preservation to include a metric for access. The  NDSA Levels is a tool to help organizations manage digital preservation risks. The matrix contains a tiered list of technical steps that correspond to levels of complexity and preservation activities: Storage and Geographic Location, File Fixity and Data Integrity, Information Security, Metadata and File Formats. Access is one of the "foundational tenets of digital preservation. It follows that if we are unable to provide access to the materials we’re preserving, then we aren’t really doing such a great job of preserving those materials in the first place."

They have added an Access row to the NDSA Levels designed to help measure and enhance progress in proving access. The updated Levels of Preservation:

Level One
(Protect Your Data)
Level Two
(Know Your data)
Level Three
(Monitor Your Data)
Level Four
(Repair Your Data)
Storage and Geographic Location Two complete copies that are not collocated For data on heterogeneous media (optical disks, hard drives, etc.) get the content off the medium and into your storage system At least three complete copies At least one copy in a different geographic location/
Document your storage system(s) and storage media and what you need to use them
At least one copy in a geographic location with a different disaster threat Obsolescence monitoring process for your storage system(s) and media At least 3 copies in geographic locations with different disaster threats Have a comprehensive plan in place that will keep files and metadata on currently accessible media or systems
File Fixity and Data Integrity Check file fixity on ingest if it has been provided with the content Create fixity info if it wasn’t provided with the content Check fixity on all ingestsUse write-blockers when working with original media Virus-check high risk content Check fixity of content at fixed intervals Maintain logs of fixity info; supply audit on demand
Ability to detect corrupt data
Virus-check all content
Check fixity of all content in response to specific events or activities Ability to replace/repair corrupted data
Ensure no one person has write access to all copies
Information Security Identify who has read, write, move, and delete authorization to individual files Restrict who has those authorizations to individual files Document access restrictions for content Maintain logs of who performed what actions on files, including deletions and preservation actions Perform audit of logs
Metadata Inventory of content and its storage location Ensure backup and non-collocation of inventory Store administrative metadata Store transformative metadata and log events Store standard technical and descriptive metadata Store standard preservation metadata
File Formats When you can give input into the creation of digital files encourage use of a limited set of known open file formats and codecs Inventory of file formats in use Monitor file format obsolescence issues Perform format migrations, emulation and similar activities as needed
Access Determine designated community1 Ability to ensure the security of the material while it is being accessed. This may include physical security measures (e.g. someone staffing a reading room) and/or electronic measures (e.g. a locked-down viewing station, restrictions on downloading material, restricting access by IP address, etc.)
Ability to identify and redact personally identifiable information (PII) and other sensitive material
Have publicly available catalogs, finding aids, inventories, or collection descriptions available to so that researchers can discover material Create Submission Information Packages (SIPs) and Archival Information Packages (AIPs) upon ingest2 Ability to generate Dissemination Information Packages (DIPs) on ingest3 Store Representation Information and Preservation Description Information4
Have a publicly available access policy
Ability to provide access to obsolete media via its native environment and/or emulation

1 Designated Community essentially means “users”; the term that comes from the Reference Model for an Open Archival Information System (OAIS).
2 The Submission Information Package (SIP) is the content and metadata received from an information producer by a preservation repository. An Archival Information Package (AIP) is the set of content and metadata managed by a preservation repository, and organized in a way that allows the repository to perform preservation services.
3 Dissemination Information Package (DIP) is distributed to a consumer by the repository in response to a request, and may contain content spanning multiple AIPs.
4 Representation Information refers to any software, algorithms, standards, or other information that is necessary to properly access an archived digital file. Or, as the Preservation Metadata and the OAIS Information Model put it, “A digital object consists of a stream of bits; Representation Information imparts meaning to these bits.” Preservation Description Information refers to the information necessary for adequate preservation of a digital object. For example, Provenance, Reference, Fixity, Context, and Access Rights Information.

[I've been asked to add the footnotes, which I have done. By way of clarification, my notes are the things that I want to remember from the articles I read. The real source for the concepts is the actual article itself; the link is provided at the top of the notes. - chris]

Wednesday, April 20, 2016

On the Marginal Cost of Scholarly Communication

On the Marginal Cost of Scholarly Communication. Tiffany Bogich, et al. Science.ai by Standard Analytics. 18 April, 2016.
     An article that looks at the marginal cost of scholarly communication from the perspective of an agent looking to start an independent, peer-reviewed scholarly journal. It found that vendors can accommodate all of the services required for scholarly communication for between $69 and $318 per article, and with alternate software solutions replacing the vendor services, the marginal cost of scholarly communication would drop to between $1.36 and $1.61 per article, almost all of which is the cost of  DOI registration. The development of high quality “plug-and-play” open source software solutions would have a significant impact in reducing the marginal cost of scholarly communication, making it more open to experimentation and innovation.  For the cost of long term journal preservation, the article looked at CLOCKSS and Portico.

Tuesday, April 19, 2016

Requirements on Long-Term Accessibility and Preservation of Research Results with Particular Regard to Their Provenance

Requirements on Long-Term Accessibility and Preservation of Research Results with Particular Regard to Their Provenance. Andreas Weber, Claudia Piesche. ISPRS Int. J. Geo-Inf. 11 April 2016.
     The importance of long-term accessibility increased when the “OECD Principles and Guidelines for Access to Research Data from Public Funding” was published. The description of the long-term accessibility of research data now has to be a part of research proposals and a precondition for the funding of projects.
The demand for long-term preservation of research data has developed slowly and are established in only few research areas.  Existing solutions for the long-term storage of specialized data are specialized and usually not designed for public use or reuse.

At universities, the support for the preservation of research data is mostly limited to the provision of high-available disk storage and appropriate backup solutions. Collaboration is limited tools to support the search of metadata are very rare. "The institutions that could play an important role, like libraries or IT centers, hesitate to build up solutions, because policies for the treatment of research results are not yet installed by the administration."   Solutions to manage research data would also need a very sophisticated rights management system to protect data from unauthorized access, yet also providing access. 

"Long-term preservation in a more classical sense means the bit stream preservation, and aims at a subsequent use of data in content as well as in technical purpose." A solution for the long-term preservation of research data should be compliant with OAIS. To access the specific research data, a unique identifier  would be needed and the storage has to satisfy the "norms of long-term preservation".

"Currently the most important standard is the Open Archival Information System (OAIS) reference model. The OAIS model specifies how digital assets can be preserved through subsequent preservation strategies. It is a high-level reference model, and therefore is not bound to specific technology. Although the model is complex, systems for the long-term storage of digital data will have to meet the requirements."

Monday, April 18, 2016

Calculating All that Jazz: Accurately Predicting Digital Storage Needs Utilizing Digitization Parameters for Analog Audio and Still Image Files

Calculating All that Jazz: Accurately Predicting Digital Storage Needs Utilizing Digitization Parameters for Analog Audio and Still Image Files. Krista White. ALCTS. 14 Apr 2016.

  The library science literature does not show a reliable way to calculate digital storage needs when digitizing analog materials such as documents, photographs, and sound recordings in older formats."Library professionals and library assistants who lack computer science or audiovisual training are often tasked with writing digital project proposals, grant applications or providing rationale to fund digitization projects for their institutions." Digital project managers need tools to accurately predict the amount of storage for digital objects and also estimate the starting and ongoing costs for the storage. This paper provides two formulae for calculating digital storage space for uncompressed, archival master image and document files and sound files.

Estimates from earlier sources:
  • thirty megabytes of storage for every hour of compressed audio,  
  • one megabyte for a page of uncompressed, plain text (bitmap format)
  • three gigabytes for two hours of moving image media
  • 90 megabytes for uncompressed raster image files, 
  • 600 megabytes for one hour of uncompressed audio recording, 
  • “nearly a gigabyte of disk space,” for one minute of uncompressed digital video.
  • 100 gigabytes (GB) of storage for 100 hours of audio tape
  • These can be adjusted to alter both file size and quality, depending on the choice of digitization standard, the combination of variables used in a chosen standard and the quantity of digital storage required.
Some additional notes from the article:
  • As the experiments demonstrate, the formulae for still image and audio recordings are extremely accurate. They will prove invaluable to digital archivists, digital librarians and the average user in helping to plan digitization projects, as well as in evaluating hardware and software for these projects. 
  • Digital project managers armed with the still image and audio formulae will be able to calculate file sizes using different standards to determine which standard will suit the project needs. 
  • Knowing the parameters of the still image and audio formulae will allow managers to evaluate equipment on the basis of the flexibility of the software and hardware before purchase. 
  • Using the still image and audio calculation formulae in workflows will help digital project managers create more efficient project plans and tighter grant proposals. 
  • The formulae for calculating storage sizes: length of the original audio recording, sampling rate, bit depth, and number of audio channels. 
  • Formula for Calculating File Sizes of Uncompressed, Still Images:


One of the tables in the article on calculating file size and comparing to the actual size: