One crucial factor in the preservation of research data is the file format. Depending on the format selected, you have to be prepared for the fact that the data might only be legible for a few years. Programmes, formats and operating systems change in the course of time, which means that, in many cases, the usability of the data can’t be guaranteed.
Following on from February’s blog entry, in which we presented various methods to create PDF/A documents, this time we summarise some of the criteria that play a role in the suitability of file formats for long-term archiving.
Format requirements for long-term archiving
Long-term digital archives recommend their customers use a smallish selection of file formats. The list of recommendations from the Digital Curation office differentiates according to the storage duration, for example.
Formats for long-term archiving ought to meet certain core requirements:
- Ideally, the format should correspond to open standards with full documentation.
- It should be widely used.
- Preferably, the format should be supported by a larger number of tools, i.e. not just by the specific software of one provider.
- Moreover, files should be available in an uncompressed form, or at least be compressed using a documented, lossless method.
- If possible, an encryption should be avoided.
Use in research
Obviously, these file format guidelines are formulated from a long-term archiving perspective. Corresponding formats are not available for all purposes, and in research the consistent implementation of the requirements is fanciful. However, the guidelines should also trigger a discussion on what purpose which format is to fulfil here. Does it make sense to create another format in addition to the manufacturer-specific output format of a measuring device that is better suited to exchanging data with partners? What information is lost in the process? Does the actual data remain intact?
Validity as a criterion for long-term archives
Unfortunately, the selection of an essentially suitable file format is no guarantee for the longevity of the individual file. Long-term digital archives therefore try to verify whether the individual file actually conforms to the respective specification and is valid. Errors caused systematically by a particular application while creating the file might be discovered, or problems that only affect certain files may become apparent. Such deviations can jeopardise the usability and long-term preservation of data.
The poster “To Act or Not to Act – Handling File Format Identification and Validation Issues in Practice” by the ETH Library’s Digital Curation team, which won the Best Poster Award at the 13th International Conference on Digital Preservation (iPRES 2016) at the end of 2016, provides an insight into the practical consequences of this analysis.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International Public License.
DOI Link: 10.16911/ethz-ib-2790-en