Twenty issues in e-book creation

Alan Dawson and Jake Wallis

Centre for Digital Library Research, University of Strathclyde, Glasgow G1 1XH

January 2005

alan.dawson@strath.ac.uk

This article appears in Against The Grain Volume 17, Number 1, February 2005

Abstract

Discussion of e-books in libraries and universities covers a range of issues such as selection, purchasing, licensing, management and user support. One topic that rarely arises is that of e-book creation. Most librarians see themselves as consumers rather than creators of e-books. Yet libraries have a collective treasure trove of books and other materials that may be suitable for digitization. This article summarises numerous issues that arise when creating e-books and publishing them in open-access XHTML format on the web, and describes policies that can help resolve these issues.

Background

Models of commercial e-book distribution cause problems for many potential purchasers. One reason for the complexity of e-books as products is the extent to which they unsettle traditional certainties within the publishing and library worlds about the nature of the book (Dillon, 2001). Commercial distributors and technology companies jostle to get a stranglehold on the market by developing proprietary delivery formats. The publishing industry's approach to online distribution focuses on controlling access, aiming to ensure that profit compares favourably with investment. The attitudes of libraries inversely reflect those of publishers, with concerns over restrictive access models and value for money, while sharing publishers' uncertainty regarding the level of demand. The intensely commercial environment in which e-books are provided to libraries can distract from the treasures that many libraries already hold within their collections, and which could potentially be converted to e-books.

A literature review of the issues associated with e-books provides reams of material on licensing, digital rights management, commercial e-book publishing and e-book usability, but very little on the process of e-book creation. Similarly, there is no shortage of published guidance about digitization (e.g. Morrison, Popham & Wikander, 2000; Liu, 2004). Although this guidance mentions relevant protocols and standards for metadata, archiving, interoperability etc, it fails to give the practical detail required by those wishing to create accessible e-books.

As the information environment continues to evolve, the trend is for users to focus on ease of retrieval, and to search for aggregated material from diverse sources, freely available in digital format. Electronic publication can add value to resources (Lonsdale & Armstrong, 2000), however access controls and proprietary formats can inhibit the uptake of e-books and their embedding in virtual learning environments. An over-emphasis on the printed book as metaphor in constructing e-books (Diaz, 2003), and on traditional publishing mechanisms in their delivery, continues to limit the extent of their functionality and benefits to users.

Given the complexity of the commercial information environment, there is a refreshing simplicity to the mission statement of Project Gutenberg (http://www.gutenberg.org/) - 'To encourage the creation and distribution of e-books'. Project Gutenberg is a collection of more than 13,000 classic works of literature, freely available in digital form and produced by volunteers. Its model and motivation for creation and distribution is quite different from those of commercial publishers.

An initiative based at the University of Strathclyde's Centre for Digital Library Research (CDLR) is developing methods for producing e-books more akin to Project Gutenberg than commercial models, whilst using its expertise to add value in ways that a voluntary collaboration such as Project Gutenberg cannot. This e-book project is helping to prevent the marginalisation, and assist in the preservation, of valuable material available in Glasgow's library collections, and to publish them via the Glasgow Digital Library (http://gdl.cdlr.strath.ac.uk/). Despite difficulties of sustainability, the initiative manages to side-step many of the problems related to the commercial distribution of e-books by making content freely available over the web. However, this creates a new set of problems that need to be addressed.

E-book formats

One good reason why libraries are wary of e-books is the variety of proprietary formats used. This makes their management and support difficult, and is alien to the principles of universal access that underpin the web. It is therefore not surprising that librarians may welcome e-books provided in portable document format (PDF). Although this is also proprietary, it is widely used for ejournals and other forms of online publishing, so will be familiar to most web users. PDF preserves all formatting, and is useful for printing, so some may regard it as an ideal format for e-book publishing. However, PDF has many disadvantages too, which are often overlooked but worth summarising:

PDF is a convenient means of digitally distributing documents designed to be printed, but has few advantages over other proprietary e-book or document formats such as Microsoft Reader or Microsoft Word.

The obvious question that arises is why PDF is so widely used for e-books and other documents in preference to the basic web document format - HTML - which has none of these drawbacks. For commercial publishers, the restrictions inherent in PDF are appealing, allowing them to retain tight control over the appearance and re-use of their products. However, for non-commercial publications, PDF may be chosen simply because it is perceived to be easier to produce. While this is arguable, the fact is that use of HTML introduces numerous issues that do not arise with PDF. Rather than deal with these issues, in order to provide a more flexible and usable format, many content providers default to offering PDF despite its disadvantages for users.

The remainder of this article summarises some issues that have been encountered in producing e-books in XHTML format (XML-complaint HTML), and the policies adopted by CDLR to resolve them. The aim is not to be prescriptive but to illustrate the choices made in specific circumstances, in the belief that this level of detail will be useful to others planning similar initiatives. All the issues mentioned below have arisen during digitization of just six substantial books.

Digitization issues

1. Preservation vs accessibility

Issue: The demands of digital preservation and user accessibility are not incompatible but involve different priorities and may require compromises.

Policy: Priority is given to making the content freely available, easily usable and readily searchable via open-access standards, hence the choice of XHTML for publication format. The page design of the printed books is not transferred to digital format (other than the cover or title page), but the original text and structure is preserved with a high degree of accuracy, and presented in accordance with guidelines for electronic textbook design (Wilson & Landoni, 2002) and accessibility (W3C, 1999/2004).

2. Equipment selection

Issue: Accessible e-book creation requires digitising a printed book using an effective but non-destructive process (unless the book is already held in digital form).

Policy: A standard flatbed desktop scanner is used for relatively small books that are not noticeably damaged by being fully opened at each page. A digital camera is used for large or valuable books that may be damaged by repeated scanning. If digital preservation of images is considered important, and the book is not suitable for flatbed scanning, a specialist agency is used for capturing images at high resolution, though this significantly increases digitization costs.

3. Capturing text and images

Issue: An efficient procedure is required to minimise scanning or photography costs, but different settings may be necessary for text and images.

Policy: If a digital camera is used then a single photograph of each page will sometimes suffice. The resulting image file can be interpreted by specialist software (such as Abbyy Finereader) to create machine-readable text, with any pictures being ignored. The same image file can be cropped by image editing software (such as Paint Shop Pro) to remove text. However, better image quality is possible by taking a close-up of the image only. If a scanner is used then two passes are usually required; one to capture the image and one to capture and interpret the text.

4. Object naming

Issue: A coherent system is required for managing and publishing image files, documents and web pages.

Policy: For each e-book a convention for file naming is defined and applied consistently to all component files, for ease of identification, cross-referencing, and generation of persistent digital object identifiers. This helps enable the automated creation of e-books with embedded images.

Text management issues

5. Text file format

Issue: Most optical character reading (OCR) software attempts to interpret formatting detail such as lists, tables, bold and italic text, superscripts etc, and to save this formatting information along with the text in a rich format such as RTF or Word. Although superficially useful, this formatting can be counter-productive, as it is prone to error, and does not directly translate to formatting via HTML markup.

Policy: In most cases the results of OCR are saved in plain text files, with any formatting produced during OCR deliberately discarded. Structures such as lists and tables are later reproduced using styles in Word documents, from where they may be precisely converted to XHTML. The only formatting that is sometimes preserved during digitization is bold, italic and underlined text. Though rarely used in older books, this formatting can be accurately converted to Word and then XHTML markup.

6. Proof reading

Issue: All OCR software is prone to error.

Policy: All text is read and corrected by a specialist proof-reader. This is by far the most time-consuming step in the e-book creation process, but is regarded as essential for producing credible and high-quality e-books. In order to avoid repeated handling of large and valuable books, image files of the text are sometimes printed and used as a surrogate original for checking purposes.

7. Error correction

Issue: Most printed books contain some spelling or typesetting mistakes, factual errors, misleading punctuation, or other forms of error, which it is possible to correct in the digital version.

Policy: This is an issue where compromise is required between preservation and functionality. Limited error correction is considered justifiable as part of the process of producing useful machine-readable e-books. Indisputable spelling or typographical errors are corrected, while apparent factual errors are reproduced unchanged. This policy is publicised along with the text. If a book includes an errata page or slip, the changes specified are applied to the digitised text, the errata retained, and a note inserted explaining that the errata are no longer applicable.

8. Symbols and character sets

Issue: Many books contain symbols or foreign-language characters not found on English-language keyboards, whose inclusion can detract from text searchability.

Policy: Where characters in ordinary words can easily be represented by keyboard characters, e.g. the diphthong can easily be typed as ae, then the plain text version is used. However, in place names, personal names and other proper nouns, the original form of the word is retained and represented by using HTML entities. Although this makes the names difficult to search for, this is usually compensated for by including the names in an index.

9. Typefaces and typographic conventions

Issue: Books may use typefaces that are not available to web users. Many older books use typesetting conventions that do not translate well to online publication, e.g. upper-case headings, full stops after headings, and open quotation marks at the start of every line within long quotations.

Policy: Any use of bold, italic or underlined text is retained in the e-book. However, the font used for printing is not retained. External stylesheets are used to specify fonts currently regarded as suitable for online usage, such as Trebuchet or Verdana. Typographic conventions that were common in older books are regarded as artefacts of publishing that do not need to be preserved in e-books that use more modern formatting conventions. It is therefore considered acceptable to change the case of headings, punctuation in headings, and the appearance of quotations, provided the text itself is unchanged.

Image management issues

10. Image file format

Issue: A compromise is required between image quality, file size and usability.

Policy: JPEG format is considered adequate for e-book publication, but if digital preservation is important as well as online publication, images are archived as lossless TIFF files and converted to JPEG for publication. A resolution of 300 dots per inch is often used for preservation, but lower resolutions of 150dpi or 200dpi are adequate for some types of image, e.g. line drawings.

11. Image captions

Issue: Images in books may or may not have captions.

Policy: Any existing captions are faithfully reproduced, though not necessarily in exactly the same position in relation to the image, and are used as the Alt text for <img> tags. If no printed caption exists, one is created for use as Alt tag text, to meet accessibility criteria, but is not displayed on the published page. Lists of captions may be used to create a linked list of illustrations.

12. Image positioning

Issue: Images in books may be printed on a different page or section to the referring text.

Policy: Images can be moved. It makes more sense to include an image in the appropriate place in an e-book than to faithfully reproduce its original position in the book.

Publication issues

13. Content structuring

Issue: Books may or may not be conveniently divided into chapters and chunks of a suitable size for online access (Wilson & Landoni, 2002).

Policy: Wherever possible, the natural structure of a book (chapters, sections and paragraphs, but not pages) is used to determine the structure of the corresponding e-book. Similarly, structures such as lists, tables, notes and quotations are retained in the e-book. However, in order to produce web pages of reasonable length, e-books may have a finer structure than the original, with additional section or paragraph breaks inserted if necessary. If a book has a running heading, applying to one or more pages, this may be adopted to function as a section heading. New section headings may be inserted as a last resort if existing ones are misleading, and should be identified to users as a component of the e-book rather than the original book. In some books an extensive index helps compensate for inadequate section headings.

14. Contents pages

Issue: Printed contents pages may include different chapter or section headings from those used in the body of the book.

Policy: The methodology for e-book creation involves automatic generation of contents pages that link to the relevant headings used within the book. This ensures consistency and functionality, but may mean that discrepancies in the original contents pages are lost.

15. Notes

Issue: Many books have footnotes whose location is dependent on specific pagination.

Policy: It is more important for notes to be associated with the referring text than to appear in exactly the same position, or with the same number, as in the original book. Footnotes are therefore converted to endnotes, appearing at the end of e-book sections. If necessary the symbol used to identify notes is changed to a unique character, to enable automatic linking from note to reference. Notes may also be renumbered to ensure uniqueness.

16. Cross-references

Issue: Books may include internal cross-references, to a page or image or section in the same book, or external references to other publications.

Policy: Links are a fundamental function of the web, and e-books should include them where appropriate so that they function properly, even if this requires making minor changes to wording of the text. References to page numbers may therefore be reworded to refer to sections in order to work effectively as links. Similarly, text references to images 'opposite' or 'on the next page' may be changed to read 'above' or 'below' etc, so that the text makes sense in an online context.

Cataloguing and indexing issues

17. Title tags and metadata granularity

Issue: In order for e-book sections to be meaningfully retrieved via search engines, each web page requires a title tag, and perhaps other metadata, that varies from page to page yet retains its identity as part of a book.

Policy: The database-driven methodology used for e-book creation ensures that all e-book sections have customised title tags, including the book title and the section heading. This has proven extremely effective in assisting retrieval via search engines (Dawson, 2004).

18. Names and authority files

Issue: The names used to refer to people and places change over time, so that many proper names in older books differ from the forms of name in current usage. This poses problems for indexing and searching. For example, islands of the St Kilda group currently known as Boreray, Soay and Stac Lee are referred to in one e-book as Borrera, Soa and Stack Lee (Kearton, 1897).

Policy: The forms of name given in the original book are left unchanged in the e-book. The use of extensive indexes alleviates the problem to an extent, but further research is continuing into how to apply the concepts of established names and authority files to e-book collections.

19. Summaries

Issue: Funding was received to write summary biographies for one historical e-book (Maclehose, 1886). These summaries are useful to users but do not form part of the original text.

Policy: The summaries are included before each chapter, but displayed in a different font and style from the main text. An explanatory note is included to make clear these were not part of the original book.

20. Indexes

Issue: Book indexes that refer to page numbers do not translate well to e-books that use sections as the basic structure. Simply reproducing existing indexes is of limited value and does not take advantage of the e-book publication medium.

Policy: The CDLR aims to preserve the 'access richness' of the original texts (Diaz, 2003) by digitising and fully inter-linking their back-of-the-book indexes, which are regarded as useful and significant components of the original works. The original index entries are retained but are enhanced by converting page number references into active links to relevant web pages and paragraphs. Research is continuing into methods for automating this process and for creating aggregated indexes from related works. This exploration may offer significant scholarly benefit, as an aggregated index of several texts can assist fresh comparative, analytical and structural perspectives on the content, facilitating interpretations not previously available

Further issues

Various other issues arise during publication where decisions need to be made on pragmatic and aesthetic grounds rather than as matters of policy, for example concerning page design, navigation, use of stylesheets, retention of page numbers, and making e-books searchable (individually and collectively). There are also many issues concerning the automation of production methodology and the creation of indexes which have not been covered here but will be described in subsequent publications.

Principles for e-book creation

The policies described above were not specified in advance but were defined as necessary during the production process in an attempt to balance the requirements of online usability and digital preservation. The policies are not arbitrary but are based on the following underlying principle:

The original substantive text must be retained unaltered. However, minor changes may be made to structural elements of the text that are a product of printing in book format, rather than inherent in the original work, in order to enhance the value of publication in e-book format.

Implementation of this principle leads to the following set of rules:

The policies adopted and described above are intended to reflect these rules and principles.

Conclusions

Experience of creating usable e-books in non-proprietary format for publication via the Glasgow Digital Library has shown that e-book creation is more complex than might be expected. The mechanics of publication can be simplified and automated by adopting an efficient methodology. While Google has started working with major libraries to carry out automated digitization of their collections on a large scale (Google, 2004), there is great scope for smaller libraries to digitise historical materials of local relevance not available elsewhere. In doing so, numerous policy decisions need to be made regardless of the production process, but these can be simplified if they are based on an underlying principle that reflects an appropriate compromise between digital preservation and usability. Creating open-access searchable e-books does require more thought and effort than creating e-books in PDF or image format, but the resulting publications are far more flexible and useful, are more easily retrieved via search engines, and better meet the needs and expectations of most web users.

References

Dawson, A. Creating metadata that work for digital libraries and Google. Library Review. Vol 53 (7), 2004. pp.347-350. Available URL http://cdlr.strath.ac.uk/pubs/dawsona/ad200402.htm (checked 15 December 2004)

Diaz, P. Usability of hypermedia educational e-books. D-Lib Magazine. Vol 9 (3) 2003. Available URL http://www.dlib.org/dlib/march03/diaz/03diaz.html (checked 22 November 2004).

Dillon, D. E-books: the University of Texas experience, part 2. Library Hi Tech. Vol 19 (4), 2001. pp350-362.

Ferris, P. F. The effects of computers on traditional writing. Journal of Electronic Publishing. Vol 8 (1), 2002. Available URL http://www.press.umich.edu/jep/08-01/ferris.html (checked 22 November 2004).

Google Inc. Google checks out library books. Google press release, 14 Dec 2004. Available URL http://www.google.com/press/pressrel/print_library.html (checked 15 December 2004).

Kearton, R. With nature and a camera: being the adventures and observations of a field naturalist and an animal photographer. 1898. Available URL http://gdl.cdlr.strath.ac.uk/keacam/ (checked 14 December 2004).

Liu, Y. Q. Best practices, standards and techniques for digitizing library materials: a snapshot of library digitization practices in the USA. Online Information Review. Vol 28 (5), 2004.

Lonsdale, R.E. and Armstrong, C.J. New perspectives in electronic publishing: an investigation into the publishing of electronic scholarly monographs. Program. Vol 34 (1) 2000. pp 29-41.

Maclehose, J. Memoirs and portraits of one hundred Glasgow men who have died during the last thirty years and in their lives did much to make the city what it now is. Second edition, 1886. Available URL http://gdl.cdlr.strath.ac.uk/100men/ (checked 14 December 2004).

Morrison, A., Popham, M. and Wikander, K. Creating and documenting electronic texts: a guide to good practice. Arts and Humanities Data Service, 2000. Available URL http://ota.ahds.ac.uk/documents/creating/ (checked 13 December 2004).

Senserini, A. et al. Archiving and accessing web pages. D-Lib Magazine. Vol 10 (11), 2004. Available URL http://www.dlib.org/dlib/november04/hodge/11hodge.html (checked 14 December 2004).

W3C. Web accessibility initiative (1999-2004). Available URL http://www.w3.org/WAI/ (checked 14 December 2004).

Wilson, R. and Landoni, M. Electronic textbook design guidelines, Eboni project, 2002. Available URL http://ebooks.strath.ac.uk/eboni/guidelines/