This article appears in Library Review, Vol. 54 (9), pp 508-513
Purpose - With heavy ongoing investment in the creation, storage and delivery of electronic content, it is important to consider the long-term preservation of the resources produced.
Methodology/Approach - A viewpoint paper based on extensive practitioner experience with the management of digitisation, digital preservation, and quality assurance procedures.
Findings - The choice of file and media formats for the content can have a significant effect on long-term access to electronic content.
Practical implications (if applicable) - Gives some useful insights on some of the issues surrounding the choice of open or proprietary formats. The paper also examines some of the pitfalls of a proprietary approach and suggests some strategies which might be employed for managing digital content formats in the long-term.
Originality/value of paper - An attempt to provide clear, experience based strategies on how best to engage in the long-term management of digital content formats.
Keywords - Digital Preservation, Digital Documents, Records Management, Digital Libraries
Paper Type - Viewpoint
Digital content has become an increasingly important element in many library collections over recent years. At institutional, regional, national and international levels, large sums of money are being invested in the creation of such content, and the means of storage and delivery to users. Google made headlines in 2004, pledging to spend between $150 million and $200 million over a decade on digitising some 15 million books from library collections in the USA and the UK (Riding, 2005). Also in the UK, the NOF-Digi programme has invested £50 million across 150 projects to produce and publish online material that supports lifelong learning for all (Nicholson & Macgregor, 2003).
Consideration must be given at an early stage to ensuring the longevity of digital resources, in order to protect and maximise the return on the investment in content creation. One of the key components in ensuring resource longevity is the choice of file and media formats used to create, store, and deliver digital content, and the strategies that are employed to manage these in the long term.
Guidance from funding bodies and advisory services now generally recommends, and in some cases mandates, a standards based approach to the entire process, arguing that electronic content should be created, stored, maintained and disseminated using open standards whenever possible. An example of such guidance can be found in UKOLN (2003).
The UK Joint Information Systems Committee (JISC) Quality Assurance Focus (QA Focus, 2003) identified the following as the characteristics of open standards:
An open standards approach brings a wide range of benefits including:
While preference should always be given to an open standards approach, it is important to realise that situations will arise where an open approach is not possible and proprietary formats will be chosen instead. These formats are owned by an organisation or group (e.g. Microsoft), may sometimes be accepted as de facto standards through sheer ubiquity, and might even be referred to as standards, but cannot be regarded as open since the owner could theoretically choose to change the format or the conditions of usage at any time.
The main focus of this article is on the proprietary approach; considering some of the reasons why organisations may choose a proprietary format, the problems this might cause in the future, and considering some of the strategies which may be employed to manage digital content formats - both open and proprietary - in the long term.
Organisations or individuals may choose to utilise proprietary rather than open formats for a number of reasons:
Delayed development of open formats: For certain content types there may be no suitable open format available at the time that the content is being created;
Organisational expertise: Proprietary software and formats (e.g. Microsoft Office), may already be widely deployed within an organisation, with staff being trained and comfortable in its use;
Resourcing: There may be a reluctance to move to an open standards approach due to the additional training and software costs required, particularly when ubiquitous proprietary solutions are already easily available.
The choice of proprietary media and/or storage formats can lead to digital preservation problems in the future, arising from both the choice of digital media and the file formats encoded on that media.
When a physical media format is chosen for the storage of electronic content, consideration must be given to the possibility of that format becoming obsolete over time. This can particularly be a problem with new storage technologies, where a number of similar formats may be competing or coexisting in the marketplace e.g. the competition between VHS and Betamax format video recorders, or the current market for recordable DVD technology, which sees several competing standards vying for dominance (D'Ambrise, 2004). There is always the possibility that one format will eventually dominate - whether through technological superiority or the power of marketing - thus marginalising competitors and, ultimately over time, rendering any opposing formats obsolete.
Darlington et al (2003) outline a famous example of media obsolescence; that of the BBC Domesday Project, a collection of digital content created in 1986 to mark the 900th anniversary of the original Domesday book. The content was stored using a proprietary laser disk format, the media and players for which were no longer available, thus rendering the output of the innovative project virtually inaccessible. Darlington outlines the painstaking work undertaken in 2002 and 2003 to preserve the content, noting that the work had taken place just in time while some original systems and hardware were still available and workable.
It is clear that physical storage media (CDs, tapes, etc.), the associated storage hardware, and the necessary software for reading/writing the media must be considered and maintained together, as each becomes effectively useless without the others. If hardware develops faults over time it may become impossible to retrieve the content from the media and may result in damage to the media, compounding the problems. Equally, pristine hardware cannot protect against data loss due to compromised media. As some degree of physical degradation is inevitable over time, the strategies outlined later should be employed to mitigate loss.
The choice of proprietary file formats adds further complexity to ensuring long-term access to electronic content. Proprietary software applications are regularly updated with new versions. While functionality may not change markedly from one version to its immediate successor, cumulative changes to a file format may become more significant in the longer term, potentially jeopardising backwards compatibility.
Maintaining copies of legacy software may seem desirable, but can be fraught with problems. Just like application software, operating systems are also periodically upgraded and may, in the long term, simply cease to support legacy packages as underlying system architectures develop. For example, the release of Service Pack 2 for Windows XP in 2004 witnessed reports of functionality problems with over 200 applications (Leyden, 2004). Maintaining older operating systems may not be an attractive solution, particularly in an online networked environment where there exists an increased risk of new security problems emerging in unsupported legacy systems.
As outlined above, the choice of media and file formats for the storage of electronic content could cause serious problems for the long-term accessibility of the materials, particularly where a proprietary format has been used. Whatever the choice of approach, strategies must be put in place to manage digital formats over the long term, in order to mitigate (or avoid altogether) the problems outlined earlier.
These strategies are grouped under six headings, though most of the strategic elements within are interlinked and few will be successful in the long term if pursued in isolation. While some of these elements are more applicable to the proprietary approach, the strategies are generally valid across all electronic content, regardless of format.
Each of these strategic components may be problematic within organisations - particularly in project-funded environments, where staffing and other technical resources may not be readily available beyond the funded lifespan of a project.
It is with some irony that the preservation of digital resources begins with ensuring the preservation of staff knowledge and sound knowledge management practices. Quality documentation is a key component of any preservation strategy and it is important that information about the technical decisions taken at each stage of the creation, storage and maintenance process is available in the long term, possibly after those staff that had direct knowledge and experience of the process have moved on.
Migration involves ensuring that all electronic content is held in a format which is useable and accessible by current software and hardware; keeping content up to date with the latest developments and guarding against format obsolescence. Where content is stored using a proprietary format, it is particularly desirable to migrate to a suitable open standard format, as and when one becomes available.
Migration is potentially time-consuming and expensive, and could represent a significant drain on organisational resources in the long term, particularly as the need to migrate may depend on the progress of a volatile technology industry. However, these costs must be balanced against the initial investment in content creation and the value of long-term access to the content.
Refreshment is the periodic transfer of electronic content to newer storage media (e.g. CD/DVD/DAT tape). This helps to guard against data loss due to media degradation. The timing of refreshment cycles should be informed by manufacturers' information on, and practitioners' experience of, the typical lifespan of their physical media. It is advisable to check a random sample of used storage media on a regular basis - at least annually - to ensure that the physical media remain accessible and the contents remain intact. If problems emerge within the sample, then urgent refreshment action should be taken A prudent strategy would be to ensure that content is on at least two types of digital media and in different physical locations.
In the event of system or media obsolescence, organisations may choose to create or use emulation software, to mimic the behaviour of obsolete hardware and operating systems, and enable use of legacy software. The emergence of a significant market in legacy emulators would seem a real possibility as and when access problems begin to be widespread.
To mitigate as against the degradation of storage media and access devices, these should be stored and operated in suitable environmental conditions, ideally within the environmental tolerances specified by manufacturers. Storage media should be handled as infrequently as possible, with minimal movement that involves exposing the media to significantly different environmental conditions. Backup media should ideally be stored offsite, as a precaution against disasters that may damage onsite resources.
Digital content is inherently vulnerable to loss or damage from hardware or software faults. Resources must therefore be allocated to the backup and recovery requirements of an organisation. Initial backups should be created at the time a resource is created, with a regular routine implemented so that further backups are created during the lifetime of the resource. The recovery phase must also be considered. Procedures for data recovery should be tested periodically to ensure that data can be restored from backup media, and that the media remains compatible with changes in backup technology.
The huge sums being invested in the creation of electronic content have the potential to create a golden digital heritage for future generations. For this potential to be realised however, attention must be given at all stages of the content creation, storage and delivery process to the digital content formats being employed, and steps must be taken to actively manage content formats over time, to guard against the dangers of creeping technical obsolescence or long-term degradation of resources.
D'Ambrise, R. (2004), DVD Update: From Double Layers to Blue Lasers, Computer Technology Review, Vol.24 No.5, pp.30-32.
Darlington, J., Finney, A., & Pearce, A. (2003), Domesday Redux: The rescue of the BBC Domesday Project videodisc, Ariadne, No.36, available at http://www.ariadne.ac.uk/issue36/tna/ (accessed 25 May 2005).
Leyden, J. (2004), 200 apps clash with XP SP2, The Register, 17th August 2004, available at http://www.theregister.co.uk/2004/08/17/xp_sp2_glitches/ (accessed 25 May 2005).
New Opportunities Fund. (2004). NOF-Digitise Programme Manual: Digital Preservation, NOF-Digitise Technical Advisory Service, University of Bath, available at http://www.ukoln.ac.uk/nof/support/manual/digital-preservation/ (accessed 25 May 2005).
Nicholson, D. & Macgregor, G. (2003), "NOF-Digi": Putting UK Culture Online, OCLC Systems & Services, Vol.19 No.3, pp.96-99.
QA Focus. (2003), What are Open Standards?, UKOLN, University of Bath, available at http://www.ukoln.ac.uk/qa-focus/documents/briefings/briefing-11/html/ (accessed 25 May 2005).
Riding, A. (2005), France detects a cultural threat in Google, The New York Times, April 11th 2005.
UKOLN. (2003), Technical Guidelines for Digital Content Creation Programmes, Working Draft Version 0.05, UKOLN, University of Bath, available at http://www.minervaeurope.org/structure/workinggroups/servprov/documents/techguid005draft.pdf (accessed 25 May 2005).
Lin, L, S., Ramaiah, C, K. & Wal, P, K. (2003), Problems in the preservation of electronic records, Library Review, Vol.52 No.3, pp.117-125.