Wednesday, November 30, 2016

To Act or Not to Act - Handling File Format Identification Issues in Practice

To Act or Not to Act - Handling File Format Identification Issues in Practice. Matthias Töwe, Franziska Geisser, Roland E. Suri. Poster, iPres 2016.  (Proceedings p. 288-89 / PDF p. 145).
     Format identification output needs to be assessed within an institutional context and also consider provenance information to determine actions. This poster presents ways to address file format identification and validation issues that are mostly independent of the specific tools and systems employed; they use Rosetta, DROID, PRONOM, and JHOVE. Archives rely on technical file format information and therefore want to derive as much information about the digital objects as possible before ingest. But there are issues that occur in everyday practice, such as:
  • how to proceed without compromising preservation options
  • how to make efforts scalable 
  • issues with different types of data
  • issues related to the tool's internal logic
  • metadata extraction which is also format related
 The use cases vary depending on the customers, types of material, and formats. A broad range of use cases apply to safeguarding research data for a limited period of time (ten years at minimum) to publishing and preserving data in the long term. Understanding the use cases’ characteristics helps provides "a better understanding of what actually matters most in each case."

Ideally, format identification should yield reliable and unambiguous information on the format of a given file, however a number of problems make the process more complicated. Handling files on an individual basis does not scale well. This may mean that unsatisfactory decisions need to be taken to keep the volume of data manageable. Some criteria to consider:
  • Usability: can the file be used as expected with standard software?
  • Tool errors: is an error known to be tool-related?
  • Understanding: is the error actually understood?
  • Seriousness: does the error concern the format's significant properties?
  • Correctability: is there a documented solution to the error?
  • Risk of correcting: what risks are associated with correcting the error?
  • Effort: what effort is required to correct the error?
  • Authenticity: is the file’s authenticity more relevant than format identification?
  • Provenance: can the data producer help resolve this and future errors?
  • Intended preservation: what solution is acceptable for lower preservation periods?
There are no simple rules to resolve these, so other considerations are needed to determine what actions to take:
  • Should format identification be handled at ingest or as a pre-ingest activity?
  • How to document measures taken to resolve identified problems?
  • Can unknown formats be admitted to the archive? 
  • Should the format identification be re-checked later? 
  • Do we rely on PRONOM or do we need local registries? 
  • How to preserve formats where no applications exist.
"Format validation can fail when file properties are not in accord with its format’s specification. However, it is not immediately clear if such deviations prevent current usability of a file orcompromise the prospects for a file’s long term preservability." If the file is usable today, does that mean it is valid? Digital archives need to "balance the efforts for making files valid vs. making files pass validation in spite of known issues."

The failure to extract significant properties has no immediate consequences, and institutions need to decide if they will correct the issues, especially if the metadata is embedded and the file must be altered, which creates the risk of unknowingly introducing other changes. The authors of this poster act even more cautiously when fixing metadata extraction issues that require working with embedded metadata or file properties.

No comments: