Development issues for the Glasgow Digital Library

Alan Dawson

Centre for Digital Library Research, University of Strathclyde, Glasgow G1 1XH

May 2003

alan.dawson@strath.ac.uk

This article appears in the book Computers for librarians: An introduction to the electronic library (Stuart Ferguson with Rodney Hebels, ed. 2003)

Background

The Glasgow Digital Library (GDL)¹ is intended to be a regional, collaborative, distributed digital library, based in Glasgow, the largest city in Scotland. It aims to combine theory and practice to create a digital library that can support teaching, learning, research and public information. The GDL was initially funded as a two-year research project to investigate and report on planning, implementing, operating and evaluating the library. This meant that a digital library service had to be created in order to research its management and operation, although funding was provided only to carry out the research, not create a user service. The paradox was partially resolved by obtaining additional funding for specific digitisation projects, which allowed the library to create its own content. By early 2003 the GDL incorporated six main digital collections, with a total of around 5000 items.

This chapter summarises the main issues encountered in developing the Glasgow Digital Library. A more detailed account is given by the author in the book Digital Libraries: Policy, Planning and Practice (Law & Andrews, ed, 2003).

Library purpose and philosophy

The principles of operation underpinning the GDL helped determine its collection development policy and priorities for implementation. As a regional initiative based in a university, the GDL is a hybrid between public library and higher education library, so its content is not geared to particular courses nor limited to a specific target audience. It has a philosophy of thinking globally before acting locally, aiming to balance the needs of local users with a worldwide and long-term perspective on digital resources, and is therefore committed to using international standards.

Collection development policy

Initially there were three main priorities:

To make use of physical collections held by project partners, for political as well as practical purposes.
To demonstrate the benefits of collaboration by pooling resources that might otherwise result in duplication of effort.
To create and co-ordinate content that was not just about Glasgow, but also for Glasgow.

These priorities were incorporated into a brief but important collection development policy agreed by all project partners.

Standards

Choice of standards for the GDL was determined by its international outlook and concern for long-term interoperability. Key standards are Dublin Core and MARC 21 for metadata, LCSH for subject vocabulary, AACR2 for resource descriptions, XHTML and W3C web accessibility guidelines for web pages, TIFF and JPEG for image format, and Z39.50 for cross-searching with other catalogues. These standards are supplemented by editorial guidelines and authority files for place names and resource types. Although local in scope, these are equally important for providing consistency of resource description and assisting information retrieval.

Compliance with these standards was rarely technically difficult but required clear policies, guidelines and disciplined work practices. Adherence to standards was regarded as an inherent part of digital library development rather than a burden to be imposed after content creation. The process was assisted by a coherent information environment and guidelines for content contributors, though central validation and editing was also carried out.

Content creation

Objects do not select themselves for digitisation. Research has to be carried out, items selected for value and relevance, captions written and edited, titles applied or invented. Even with a basic formula of text, images and web pages, decisions have to be made about image resolution, file size and format, characters sets and workflow processes.

The GDL carried out some digitisation internally on a simple desktop scanner, some was handled by project partners, and some by a specialist service (including large images and glass slides). Research and content selection was also distributed amongst project staff, partners and external contractors with specialist knowledge. Images were captured and archived at high resolution (300dpi TIFF) then copied to lower-quality higher-speed JPEG files for web usage. Optical character reading was carried out where feasible and was carefully checked before publication. Digitisation itself was quick ( images can be captured in seconds ( but the prerequisite selection and the subsequent manipulation and management processes were far more time-consuming.

The aim of digitisation was to create an accurate copy of the source material, but in practice images were routinely manipulated after creation; they were often resized and cropped, sometimes lightened and sharpened, and thumbnail images sometimes created. This is common practice and does not usually affect authenticity of the material, yet the ease of digital manipulation and enhancement illustrates that the issues of preservation and authenticity are less clear-cut than might be imagined.

Metadata

Inevitably, metadata creation and management was a big issue. The emphasis throughout was on ensuring accuracy, consistency and completeness in key high-value fields, especially title, description, date and subject terms. One major question was whether to catalogue the original item or the digital copy, e.g. was the creator of a booklet the person who wrote it or the organisation responsible for digitising it? However, the biggest issue of all was how to handle the whole metadata creation and management process. Flexibility was achieved by using a database structure that was not tied to any particular metadata scheme. Content extraction programs were written (using Visual Basic) to generate material in different formats for different purposes. For example, separate forename and surname fields in one collection were combined to produce standard name format for web display but surname-first format for use Dublin Core metadata and MARC records.

Classification

Maintaining Dewey Decimal Classification as well as a controlled subject vocabulary was considered but judged unsustainable for the GDL. Library of Congress Subject Headings (LCSH) were therefore used as the primary means of linking diverse collections into a coherent information structure. To overcome the LCSH problem of cultural bias toward North America, it was supplemented by controlled local subject terms where this was considered essential. The GDL subject terms were used in the web interface, with LCSH terms included in metadata where international compatibility was required². Controlled authority files for place names and people names were also used to provide library-wide consistency and an alternative to the collection-centred view. These classifications helped illustrate the scope and scale of collections and influenced topic chunking, as well as making searching more reliable.

Development methodology

A robust system for organising diverse collections of digital objects is essential for any heterogeneous digital library, with mechanisms for handling multiple contributors, collections, file formats and access methods, as well as controlling updating procedures.

The initial solution adopted for the GDL was technically straightforward but enabled rapid development. Microsoft Access databases were required as a condition of funding for four of the first six collections, so Access was chosen as the primary content and metadata repository for all six collections. Library content and metadata was then generated from Access in different formats for different purposes, e.g. web pages, Dublin Core metadata and MARC 21 records³. This was achieved by adding HTML markup and MARC tags to the database and using Visual Basic programs to automate the integration and export of content, markup and metadata.

In effect this methodology created a modular, flexible, inexpensive content management system using common desktop software. The modular approach enabled content creation to be readily distributed amongst contributors, facilitated re-use of metadata, and allowed additional collections to be plugged in to the library relatively quickly. This approach was effective but required a consistent item naming scheme for all objects in the library and a disciplined approach from those involved.

Interface design

There were five main requirements for the GDL web interface: consistency, flexibility, scalability, accessibility and feasibility. Consistency meant devising a design template that could provide visual coherence across all collections without imposing blanket uniformity. Flexibility meant enabling users to access library content in different ways ( across as well as within collections. Scalability meant creating an interface that would work with only three or four collections yet be able to cope with dozens or hundreds. Accessibility meant meeting requirements of funding bodies and standard web accessibility guidelines.⁴ Feasibility meant creating something quickly and inexpensively.

The solution was to make a virtue out of necessity and go for simplicity. Content of the library was judged inherently interesting, with plenty of striking images, so all that was needed was a clear set of labels and some example images to illustrate each collection, together with options for navigating the library as a whole ( by place or subject ( as an alternative to the collection-centred view. Cascading stylesheets were used throughout to provide flexibility and design consistency. The result was minimalist but acceptable and practicable.

Searching and browsing

The GDL aims to offer several search options ( across the entire library, within a single collection, or within a single field ( as well as options to cross-search the GDL with other digital libraries and library catalogues. However, web users have become used to the simplest possible search interface, and there are real difficulties in offering complex search options, and summarising the meaning of these options, via a simple user interface.

In the short term, priority was given to creating a browseable interface to each collection, offering flexible access to content. Search options are being added one collection at a time, using different software solutions for different collections⁵. This is not ideal for a user service, but from a research perspective it is useful to investigate and understand different search mechanisms. The longer term aim is to use intelligent scripting to provide cascading search facilities, i.e. to search highest-value fields first (titles and subject terms), then other metadata fields (if no matches are found), then full-text searching or cross-searching.

Learning materials

In the absence of time or funding to spend on authoring learning materials, the GDL approach has been to semi-automate the creation of educational resources. This was possible to an extent by drawing on the values of controlled metadata fields for people, organisations and dates to automatically generate illustrated indexes linking related items, drawing out themes, timelines and inter-relationships between objects. Better results could be obtained by additional research and hand-crafting of materials, but this simply was not possible. The automated process was feasible, and the added value created was judged to be worth the effort involved. It is also satisfying to see metadata being used to worthwhile effect. All the library contents can still be accessed as independent digital objects and assembled in different ways for different purposes.

Copyright

Copyright restrictions caused inconvenience for the GDL rather than major problems, as much material is historical and out of copyright. In other cases the standard requirement was followed by requesting permission from copyright holders before digitising any copyright material. The main implementation decision was to proceed anyway in those few cases where the copyright holder could not be traced.

Preservation

A complementary principle to 'think globally, act locally' is 'think long-term, act short-term'. The GDL aims to create, describe and manage content with an indefinite life-span, including historical material that may still be of interest in hundreds of years time. There are three main categories of concern: physical storage media (disks, CDs, tapes etc), content format (relational databases, Word documents, web pages, image files etc) and information structures (MARC, DC, LCSH etc). For content and metadata, the solutions are in structures and standards. Textual content held in a consistent manner in a structured database can always be exported to another format. Storage media are more of a problem, as there is no knowing what technical developments will take place in future, so the policy is to keep master copies of data in two formats (CD and disk) while being prepared to migrate in future.

Conclusions

Digital library developers can only do what is possible with the resources available. The solutions adopted will vary from case to case, but a clear understanding of the issues involved and techniques available can help maximise value obtained from limited resources. Few of the solutions adopted for the GDL to date have been technologically innovative, but they were achievable, they suited its purpose and philosophy, and they sustained its potential for long-term scalability and interoperability with other digital libraries.

Notes

The Glasgow Digital Library is available at http://gdl.cdlr.strath.ac.uk/
The CDLR subscribes to the OCLC Connexion service and is evaluating its use for collaborative cataloguing and other purposes. http://www.oclc.org/connexion/
MARC records were created in text format and converted to machine-readable MARC format using MarcEdit software. http://www.onid.orst.edu/~reeset/marcedit/
W3C Web Content Accessibility Guidelines. http://www.w3.org/TR/WAI-WEBCONTENT/
The ease of importing Access databases into SQL Server made this the simplest solution to implement, with ASP or Cold Fusion scripts controlling the interface and search requests.