32. Conversion of Primary Sources

Marilyn Deegan and Simon Tanner

Material Types

The primary source materials for humanists come in many data forms, and the digital capture of that data needs to be carefully considered in relation to its formats and the materials that form the substrates upon which it is carried. It is difficult to list these comprehensively, but it is true to say that anything that can be photographed can be digitized, and that some materials can be photographed or digitized with more fidelity than others. This is an interesting point, as many digitization projects aim to present some kind of "true" representation of the original, and some materials can get closer to this goal than others.

The materials of interest to humanists are likely to include the following, and there are probably many more examples that could be given. However, this list covers a sufficiently broad area as to encompass most of the problems that might arise in digitization.

Documents

This category covers a huge variety of materials, from all epochs of history, and includes both manuscript and printed artifacts. Manuscripts can be from all periods, in any language, and written on a huge variety of surfaces: paper, parchment, birch bark, papyrus, lead tablets, wood, stone, etc. They will almost certainly have script or character set issues, and so may require special software for display or analysis. They may be music manuscripts, which will have issues of notation and representation of a whole range of special characters; or a series of personal letters, which being loose-leafed will pose their own organizational problems. Manuscripts may be composite in that they will have a number of different (often unrelated) texts in them, and there may be difficulties with bindings. They may be very large or very small – which will have implications for scanning mechanisms and resolutions. With handwritten texts, there can be no automated recognition of characters (although some interesting work is now being done in this area by the developers of palm computers and tablet PC computers – we need to wait to see if the work being done on handwriting recognition for data input feeds into the recognition of scripts on manuscript materials of the past), and they may also have visual images or illuminated letters. The documents that humanists might want to digitize also include printed works from the last 500 years, which come in a huge variety: books, journals, newspapers, posters, letters, typescript, gray literature, musical scores, ephemera, advertisements, and many other printed sources. Earlier materials and incunabula will have many of the same problems as manuscripts, and they may be written on paper or parchment. With printed materials, there can be a wide range of font and typesetting issues, and there is often illustrative content to deal with as well. Printed materials will also come in a huge range of sizes, from posters to postage stamps.

Visual materials

These may be on many different kinds of substrates: canvas, paper, glass, film, fabric, etc., and for humanists are likely to include manuscript images, paintings, drawings, many different types of photographs (film negatives, glass plate negatives, slides, prints), stained glass, fabrics, maps, architectural drawings, etc.

Three-dimensional objects and artifacts

With present technologies, it is not possible to create a "true" facsimile of such objects, but there are now three-dimensional modeling techniques that can give good representations of three-dimensional materials. These are likely to include the whole range of museum objects, sculpture, architecture, buildings, archaeological artifacts.

Time-based media

There is increasing interest in the digitization of film, video, and sound, and indeed there have been large advances in the techniques available to produce time-based media in born-digital form in the commercial world, and also to digitize analogue originals. The Star Wars movies are a good example of the historical transition from analogue to digital. The first three movies were filmed totally in analogue and then later digitally remastered; then all the films were made available on DVD. Finally, the last film to be made, Attack of the Clones, was totally filmed in the digital medium.

The conversion of three-dimensional objects and of time-based media is more problematic than that of text or still images. It draws upon newer technologies, the standards are not as well supported, and the file sizes produced are very large. The hardware and software to manipulate such materials is also generally more costly.

The Nature of Digital Data

If the nature of humanities data is very complex, the nature of digital data in their underlying form is seemingly very simple: all digital data, from whatever original they derive, have the same underlying structure, that of the "bit" or the fonary digit. A bit is an electronic impulse that can be represented by two states, "on" or "off", also written as "1" or "0." A "byte" consists of 8 bits, and 1 byte represents 1 alphanumeric character. A 10-letter word, for example, would be 10 bytes. Bits and bytes are linked together in chains of millions of electronic impulses; this is known as the "bit stream." A "kilobyte" is 1,024 bytes, and a "megabyte" 1,024 kilobytes. Digital images are represented by "pixels" or picture elements – dots on the computer screen or printed on paper. Pixels can carry a range of values, but at the simplest level, one pixel equals one bit, and is represented in binary form as "black" (off) or "white" (on). Images captured at this level are "bi-tonal" – pure black and white. Images can also be represented as 8-bit images, which have 256 shades of either gray or color and 24-bit images, which have millions of colors – more than the eye can distinguish. The number of bits chosen to represent each pixel is known as the "bit depth", and devices capable of displaying and printing images of higher bit depths than this (36 or 48 bits) are now emerging.

Bit depth is the number of bits per pixel, "resolution" is the number of pixels (or printed dots) per inch, known as ppi or dpi. The higher the resolution, the higher the density of a digital image. The resolution of most computer screens is generally in the range of 75 to 150 pixels per inch. This is adequate for display purposes (unless the image needs to be enlarged on-screen to show fine detail), but visual art or photographic content displayed at this resolution is inadequate for printing (especially in color), though images of black and white printed text or line art are often acceptable. High-density images of originals (manuscripts, photographs, etc.) need to be captured in the range 300–600 ppi for print quality output. Note that this will depend on the size of the original materials: 35 mm slides or microfilm originals will need to be captured at much higher resolutions, and scanners are now available offering resolutions of up to 4,000 dpi for such materials. These issues are discussed further below.

Almost any kind of information can be represented in these seemingly simple structures, as patterns of the most intricate complexity can be built up. Most primary sources as outlined above are capable of digital representation, and when digital they are susceptible to manipulation, interrogation, transmission, and cross-linking in ways that are beyond the capacity of analogue media. Creating an electronic photocopy of a plain page of text is not a complex technical process with modern equipment, but being able to then automatically recognize all the alphanumeric characters it contains, plus the structural layout and metadata elements, is a highly sophisticated operation. Alphanumeric symbols are the easiest objects to represent in digital form, and digital text has been around for as long as there have been computers.

As will be clear from what is outlined above, there are diverse source materials in the humanities and, using a variety of techniques, they are all amenable to digital capture. Nowadays there are many different approaches that can be taken to capture such artifacts, dependent upon (a) the materials themselves; (b) the reasons for capturing them; (c) the technical and financial resources available to the project; and (d) the potential uses and users. One point that should be emphasized is that digital capture, especially capture of materials with significant image content, is a skilled process that is best entrusted to professionals if high-quality archive images are required. However, for the production of medium- or lower-quality materials for Web delivery or for use in teaching, and for learning about the issues and processes attendant upon digital capture, scholars and students may find it valuable to experiment with the range of medium-cost, good-quality capture devices now available commercially.

There is rarely only one method for the capture of original source materials, and so careful planning and assessment of all the cost, quality, conservation, usage, and access needs for the resultant product needs to be done in order to decide which of several options should be chosen. It is vital when making these decisions that provision is made for the long-term survivability of the materials as well as for immediate project needs: the costs of sustaining a digital resource are usually greater than those of creating it (and are very difficult to estimate) so good planning is essential if the investments made in digital conversion are to be profitable. (See chapters 31 and 37, this volume, for more information on project design and long-term preservation of digital materials.) However, when working with sources held by cultural institutions, the range of digitization options may be limited by what that institution is willing to provide: rarely will an institution allow outsiders to work with its holdings to create digital surrogates, especially of rare, unique, or fragile materials. Many now offer digitization services as adjuncts or alternatives to their photographic services, at varying costs, and projects will need to order digital files as they would formerly have ordered photographs. See Tanner and Deegan (2002) for a detailed survey of such services in the UK and Europe.

The Advantages and Disadvantages of Digital Conversion

Digital conversion, done properly, is a difficult, time-consuming and costly business, and some of the properties of digital objects outlined above can be prove to be disadvantageous to the presentation and survivability of cultural materials – the resultant digital object, for instance, is evanescent and mutable in a way that the analogue original isn't. It can disappear in a flash if a hard drive crashes or a CD is corrupt; it can be changed without trace with an ease that forgers can only dream about. However, the digitization of resources opens up new modes of use for humanists, enables a much wider potential audience, and gives a renewed means of viewing our cultural heritage. These advantages may outweigh the difficulties and disadvantages, provided the project is well thought out and well managed – and this applies however large or small the project might be. The advantages of digitization for humanists include:

• the ability to republish out-of-print materials

• rapid access to materials held remotely

• potential to display materials which are in inaccessible formats, for instance, large volumes or maps

• "virtual reunification" – allowing dispersed collections to be brought together

• the ability to enhance digital images in terms of size, sharpness, color contrast, noise reduction, etc.

• the potential for integration into teaching materials

• enhanced searchability, including full text

• integration of different media (images, sounds, video etc.)

• the potential for presenting a critical mass of materials for analysis or comparison.

Any individual, group or institution considering digitization of primary sources will need to evaluate potential digitization projects using criteria such as these. They will also need to assess the actual and potential user base, and consider whether this will change when materials are made available in digital form. Fragile originals which are kept under very restricted access conditions may have huge appeal to a wide audience when made available in a form which does not damage the originals. A good example of this is the suffrage banners collection at the Women's Library (formerly known as the Fawcett Library). This is a unique collection of large-sized women's suffrage banners, many of which are woven from a variety of materials – cotton and velvet, for instance, often with applique lettering – which are now in a fragile state. The sheer size of the banners and their fragility means that viewing the original is heavily restricted, but digitization of the banners has opened up the potential for much wider access for scholarly use, adding a vital dimension to what is known about suffrage marches in the UK at the beginning of the twentieth century (see <http://ahds.ac.uk/suffrage.htm>).

An important question that should be asked at the beginning of any digitization project, concerns exactly what it is that the digitization is aiming to capture. Is the aim to produce a full facsimile of the original that when printed out could stand in for the original? Some projects have started with that aim, and then found that a huge ancillary benefit was gained by also having the digital file for online access and manipulation. The Digital Image Archive of Medieval Music (DIAMM) project, for instance, had as its original goal the capture of a specific corpus of fifteenth-century British polyphony fragments for printed facsimile publication in volumes such as the Early English Church Music (EECM) series. However, early studies showed that there was much to be gained from obtaining high-resolution digital images in preference to slides or prints. This was not only because of the evidence for growing exploitation of digital resources at that time (1997), but also because many of these fragments were badly damaged and digital restoration offered opportunities not possible with conventional photography. The project therefore decided to capture manuscript images in the best quality possible using high-end digital imaging equipment, set up according to the most rigorous professional standards; to archive the images in an uncompressed form; to enhance and reprocess the images in order to wring every possible piece of information from them; to preserve all these images – archive and derivative – for the long term. That has proved to be an excellent strategy for the project, especially as the image enhancement techniques have revealed hitherto unknown pieces of music on fragments that had been scraped down and overwritten with text: digitization has not just enhanced existing humanities sources, it has allowed the discovery of new ones (see <http://www.diamm.ac.uk>).

Digital techniques can also allow a user experience of unique textual materials that is simply not possible with the objects themselves. Good-quality images can be integrated with other media for a more complete user experience. In 2001, the British Library produced a high-quality digital facsimile of the fifteenth-century Sherborne Missal that is on display in its galleries on the largest touch screen in the UK. The unique feature of this resource is that the pages can be turned by hand, and it is possible to zoom in at any point on the page at the touch of a finger on the screen. High-quality sound reproduction accompanies the images, allowing users to hear the religious offices which make up the text being sung by a monastic choir. This is now available on CD-ROM, and such facsimiles of other unique objects are now being produced by the British Library. See <http://www.bl.uk/collections/treasures/missal.html>.

The Cornell Brittle Books project also started with the aim of recreating analogue originals: books printed on acid paper were crumbling away, and the goal was to replace these with print surrogates on non-acid paper. Much experimentation with microfilm and digitization techniques in the course of this project has produced results which have helped to set the benchmarks and standards for the conversion of print materials to digital formats all around the world. See Chapman et al. (1999) for further details.

Another aim of digital conversion might be to capture the content of a source without necessarily capturing its form. So an edition of the work of a literary author might be rekeyed and re-edited in electronic form without particular reference to the visual characteristics of an existing print or manuscript version. Or, if the aim is to add searchability to a written source while preserving the visual form, text might be converted to electronic form and then attached to the image.

With a visual source such as a fine art object or early manuscript, what level of information is needed? The intellectual content or the physical detail of brushstrokes, canvas grain, the pores of the skin of the animal used to make the parchment, scratched glosses? Is some kind of analysis or reconstruction the aim? The physics department at the University of Bologna has developed a digital x-ray system for the analysis of paintings (Rossi et al. 2000) and the Beowulf Project, a collaboration between the British Library and the University of Kentucky, has used advanced imaging techniques for the recovery of letters and fragments obscured by clumsy repair techniques, and demonstrated the use of computer imaging to restore virtually the hidden letters to their place in the manuscript. See <http://www.bl.uk/collections/treasures/beowulf.html>.

With three-dimensional objects, no true representation of three-dimensional space can be achieved within the two-dimensional confines of a computer screen, but some excellent results are being achieved using three-dimensional modeling and virtual reality techniques. The Virtual Harlem project, for instance, has produced a reconstruction of Harlem, New York, during the time of the Harlem Renaissance in the 1920s, using modeling and VR immersion techniques. (See <http://www.evl.uic.edu/cavern/harlem/> and Carter 1999.) The Cistercians in Yorkshire project is creating imaginative reconstructions of Cistercian abbeys as they might have been in the Middle Ages, using three-dimensional modeling and detailed historical research. (See <http://cistercians.shef.ac.uk/>.)

Methods of Digital Capture

Text

Humanists have been capturing, analyzing, and presenting textual data in digital form for as long as there have been computers capable of processing alphanumeric symbols. Thankfully, long gone are the days when pioneers such as Busa and Kenny laboriously entered text on punchcards for processing on mainframe machines – now it is an everyday matter for scholars and students to sit in the library transcribing text straight into a laptop or palmtop computer, and the advent of tablet PC computers could make this even easier and more straightforward. Gone, too, are the days when every individual or project invented codes, systems, or symbols of their own to identify special features, and when any character that could not be represented in ASCII had to be receded in some arcane form. Now the work of standards bodies such as the Text Encoding Initiative (TEI, <www.tei-c.org>), the World Wide Web Consortium (W3C, <www.w3c.org>) and the Dublin Core Metadata Initiative (DCMI, <http://dublincore.org/>) has provided standards and sche-mas for the markup of textual features and for the addition of metadata to non-textual materials that renders them reusable, interoperable, and exchangable. The work done to develop the Unicode standard, too, means that there is a standard way of encoding characters for most of the languages of the world so that they too are interoperable and exchangable (<http://www.unicode.org/unicode/standard/WhatIsUnicode.html>).

In considering the different features of electronic text, and the methods of their capture, it is important to make the distinction between machine-readable electronic text and machine-viewable electronic text. A bitmap of a page, for instance, is machine-displayable, and can show all the features of the original of which it is a representation. But it is not susceptible to any processing or editing, so though it is human-readable, it is not machine-readable. In machine-readable text, every individual entity, as well as the formatting instructions and other codes, is represented separately and is therefore amenable to manipulation. Most electronic texts are closer to one of these two forms than the other, although there are some systems emerging which process text in such a way as to have the elements (and advantages) of both. These are discussed further below.

For the scholar working intensively on an individual text, there is probably no substitute for transcribing the text directly on to the computer and adding appropriate tagging and metadata according to a standardized framework. The great benefit of the work done by the standards bodies discussed above is that the schemas proposed by them are extensible: they can be adapted to the needs of individual scholars, sources and projects. Such rekeying can be done using almost any standard word processor or text editor, though there are specialist packages like XMetaL which allow the easy production and validation of text marked up in XML (Extensible Markup Language).

Many scholars and students, empowered by improvements in the technology, are engaging in larger collaborative projects which require more substantial amounts of text to be captured. There are now some very large projects such as the Early English Books Online Text Creation Partnership (EEBO TCP, <http://www.lib.umich.edu/eebo/>), which is creating accurately keyboarded and tagged editions of up to 25,000 volumes from the EEBO corpus of 125,000 volumes of English texts from between 1473 and 1800, which has been microfilmed and digitized by ProQuest. The tagged editions will be linked to the image files. For projects like these, specialist bureaux provide rekeying and tagging services at acceptable costs with assured levels of accuracy (up to 99–995 percent). This accuracy is achieved through double or triple rekeying: two or three operators key the same text, then software is used to compare them with each other. Any differences are highlighted and then they can be corrected manually. This is much faster, less prone to subjective or linguistic errors, and therefore cheaper than proofreading and correction. There will always be a need for quality control of text produced by any method, but most specialist bureaux give an excellent, accurate, reliable service: the competition is fierce in this business, which has driven costs down and quality up. For an example of a project that effectively used the double rekeying method see the Old Bailey Court Proceedings, 1674 to 1834 (<http://www.oldbaileyonline.org/>).

Optical character

The capture of text by rekeying either by an individual or by a specialist bureau is undertaken either because very high levels of accuracy are needed for textual analysis or for publication, or because the originals are not susceptible to any automated processes, which is often the case with older materials such as the EEBO corpus discussed above. While this level of accuracy is sometimes desirable, it comes at a high cost. Where modern, high-quality printed originals exist, it may be possible to capture text using optical character recognition (OCR) methods, which can give a relatively accurate result. Accuracy can then be improved using a variety of automated and manual methods: passing the text through spellcheckers with specialist dictionaries and thesauri, and manual proofing. Research is currently being done to improve OCR packages and to enable them to recognize page structures and even add tagging: the European Union-funded METAe Project (the Metadata Engine Project) is developing automatic processes for the recognition of complex textual structures, including text divisions such as chapters, sub-chapters, page numbers, headlines, footnotes, graphs, caption lines, etc. See <http://meta-e.uibk.ac.at/>. Olive Software's Active Paper Archive can also recognize complex page structures from the most difficult of texts: newspapers. This is described more fully below. OCR techniques were originally developed to provide texts for the blind. OCR engines can operate on a wide range of character sets and fonts, though they have problems with non-alphabetic character sets because of the large number of symbols, and also with cursive scripts such as Arabic. Software can be "trained" on new texts and unfamiliar characters so that accuracy improves over time and across larger volumes of data. Though this requires human intervention in the early stages, the improvement in accuracy over large volumes of data is worth the initial effort.

Though OCR can give excellent results if (a) the originals are modern and in good condition and (b) there is good quality control, projects must consider carefully the costs and benefits of deciding between a rekeying approach and OCR. Human time is always the most costly part of any operation, and it can prove to be more time-consuming and costly to correct OCR (even when it seems relatively accurate) than to go for bureau rekeying with guaranteed accuracy. It is worth bearing in mind that what seem like accurate results (between 95 and 99 percent, for instance) would mean that there would be between 1 and 5 incorrect characters per 100 characters. Assuming there are on average 5 characters per word then a 1 percent character error rate equates to a word error rate of 1 in 20 or higher.

OCR with fuzzy matching

OCR, as suggested above, is an imperfect method of text capture which can require a great deal of post-processing if accurate text is to be produced. In an ideal world, one would always aim to produce electronic text to the highest possible standards of accuracy; indeed, for some projects and purposes, accurate text to the highest level attainable is essential, and worth what it can cost in terms of time and financial outlay. However, for other purposes, speed of capture and volume are more important than quality and so some means has to be found to overcome the problems of inaccurate OCR. What needs to be taken into account is the reason a text is to be captured digitally and made available. If the text has important structural features which need to be encoded in the digital version, and these cannot be captured automatically, or if a definitive edition is to be produced in either print or electronic form from the captured text, then high levels of accuracy are paramount. If, however, retrieval of the information contained within large volumes of text is the desired result, then it may be possible to work with the raw OCR output from scanners, without post-processing. A number of text retrieval products are now available which allow searches to be performed on inaccurate text using "fuzzy matching" techniques. However, at the moment, fuzzy searching will only work with suitable linguistic and contextual dictionaries, and therefore is pretty much limited to the English language. Other languages with small populations of speakers are poorly represented, as are pre-1900 linguistic features.

Hybrid solutions: page images with underlying searchable text

An increasing number of projects and institutions are choosing to deliver textual content in digital form through a range of hybrid solutions. The user is presented with a facsimile image of the original for printing and viewing, and attached to each page of the work is a searchable text file. This text file can be produced by rekeying, as in the EEBO project described above, or by OCR with or without correction. Decisions about the method of production of the underlying text will depend on the condition of the originals and the level of accuracy of retrieval required. With the EEBO project, the materials date from the fifteenth to the eighteenth centuries, and they have been scanned from microfilm, so OCR is not a viable option. The anticipated uses, too, envisage detailed linguistic research, which means that retrieval of individual words or phrases will be needed by users. The highest possible level of accuracy is therefore a requirement for this project.

The Forced Migration Online (FMO) project based at the Refugee Studies Centre, University of Oxford, is taking a different approach. FMO is a portal to a whole range of materials and organizations concerned with the study of the phenomenon of forced migration worldwide, with content contributed by an international group of partners. One key component of FMO is a digital library of gray literature and of journals in the field. The digital library is produced by attaching text files of uncorrected OCR to page images: the OCR text is used for searching and is hidden from the user; the page images are for viewing and printing. What is important to users of FMO is documents or parts of documents dealing with key topics, rather than that they can retrieve individual instances of words or phrases. This type of solution can deliver very large volumes of material at significantly lower cost than rekeying, but the trade-offs in some loss of accuracy have to be understood and accepted. Some of the OCR inaccuracies can be mitigated by using fuzzy search algorithms, but this can give rise to problems of over-retrieval. FMO (<www.forcedmigration.org>) uses Olive Software's Active Paper Archive, a product which offers automatic zoning and characterization of complex documents, as well as OCR and complex search and retrieval using fuzzy matching. See Deegan (2002) and <http://www.oclc.org/digitalpreservation/digitizing/newspaper/>.

One document type that responds well to hybrid solutions is newspapers, which are high-volume, low-value (generally), mixed media, and usually large in size. Newspaper digitization is being undertaken by a number of companies, adopting different capture strategies and business models. ProQuest are using a rekeying solution to capture full text from a number of historic newspapers, and are selling access to these through their portal, History Online (<http://historyonline.chadwyck.co.uk>). These newspapers include The Times, the New York Times and the Wall Street Journal. A number of Canadian newspapers are digitizing page images and offering searchability using the Cold North Wind software program (<http://www.coldnorthwind.com/>); and the TIDEN project, a large collaboration of Scandinavian libraries, is capturing content using advanced OCR and delivering page images with searchability at the page level using a powerful commercial search engine, Convera's RetrievalWare (<http://tiden.kb.se/>). OCLC (the Online Computer Library Center) has an historic newspaper digitization service which provides an end-to-end solution: the input is microfilm or TIFF images, the output a fully searchable archive with each individual item (article, image, advertisement) separated and marked up in XML. Searches may be carried out on articles, elements within articles (title, byline, text), and image captions. OCLC also uses Olive Software's Active Paper Archive.

Images

As we suggest above, humanists deal with a very large range of image-based materials, and there will be different strategies for capture that could be employed, according to potential usage and cost factors. Many humanists, too, may be considering digital images as primary source materials rather than as secondary surrogates: increasingly photographers are turning from film to digital, and artists are creating digital art works from scratch. Many digital images needed by humanists are taken from items outside their control: objects that are held in cultural institutions. These institutions have their own facilities for creating images which scholars and students will need to use. If they don't have such facilities, analogue surrogates (usually photographic) can be ordered and digitization done from the surrogate. The costs charged by institutions vary a great deal (see Tanner and Deegan 2002). Sometimes this depends on whether a reproduction fee is being charged. Institutions may offer a bewildering choice of digital and analogue surrogate formats with a concomitant complexity of costs. A thorough understanding of the materials, the digitization technologies and the implications of any technical choices, the metadata that need to be added to render the images useful and usable (which is something that will almost certainly be added by the scholar or student), and what the potential cost factors might be are all essential for humanists embarking on digital projects, whether they will be doing their own capture or not. Understanding of the materials is taken as a given here, so this section will concentrate upon technical issues.

Technical issues in image capture

Most image materials to be captured by humanists will need a high level of fidelity to the original. This means that capture should be at an appropriate resolution, relative to the format and size of the original, and at an appropriate bit depth. As outlined above, resolution is the measure of information density in an electronic image and is usually measured in dots per inch (dpi) or pixels per inch (ppi). The more dots per inch there are, the denser the image information provided. This can lead to high levels of detail being captured at higher resolutions. The definition of "high" resolution is based upon factors such as original media size, the nature of the information, and the eventual use. Therefore 600 dpi would be considered high-resolution for a photographic print, but would be considered low-resolution for a 35 mm slide. Careful consideration must be given to the amount of information content required over the eventual file size. It must be remembered that resolution is always a factor of two things: (1) the size of the original and (2) the number of dots or pixels. Resolution is expressed in different ways according to what particular part of the digital process is being discussed: hardware capabilities, absolute value of the digital image, capture of analogue original, or printed output. Hardware capabilities are referred to differently as well: resolution calculated for a flatbed scanner, which has a fixed relationship with originals (because they are placed on a fixed platen and the scanner passes over them at a fixed distance) is expressed in dpi. With digital cameras, which have a variable dpi in relation to the originals, given that they can be moved closer or further away, resolution is expressed in absolute terms, either by their x and y dimensions (12,000 × 12,000, say, for the highest-quality professional digital cameras) or by the total number of pixels (4 million, for instance, for a good-quality, compact camera). The digital image itself is best expressed in absolute terms: if expressed in dpi, the size of the original always needs to be known to be meaningful.

The current choices of hardware for digital capture include flatbed scanners, which are used for reflective and transmissive materials. These can currently deliver up to 5,000 dpi, but can cost tens of thousands of dollars, so most projects can realistically only afford scanners in the high-end range of 2,400 to 3,000 dpi. Bespoke 35 mm film scanners, which are used for transmissive materials such as slides and film negatives, can deliver up to 4,000 dpi. Drum scanners may also be considered as they can deliver much higher relative resolutions and quality, but they are generally not used in this context as the process is destructive to the original photographic transparency and the unit cost of creation is higher. Digital cameras can be used for any kind of material, but are generally recommended for those materials not suitable for scanning with flatbed or film scanners: tightly bound books or manuscripts, art images, three-dimensional objects such as sculpture or architecture. Digital cameras are becoming popular as replacements for conventional film cameras in the domestic and professional markets, and so there is now a huge choice. High-end cameras for purchase by image studios cost tens of thousands of dollars, but such have been the recent advances in the technologies that superb results can be gained from cameras costing much less than this – when capturing images from smaller originals, even some of the compact cameras can deliver archive-quality scans. However, they need to be set up professionally, and professional stands and lighting must be used.

For color scanning, the current recommendation for bit depth is that high-quality originals be captured at 24 bit, which renders more than 16 million colors – more than the human eye can distinguish, and enough to give photorealistic output when printed. For black and white materials with tone, 8 bits per pixel is recommended, which gives 256 levels of gray, enough to give photorealistic printed output.

The kinds of images humanists will need for most purposes are likely to be of the highest possible quality for two reasons. First, humanists will generally need fine levels of detail in the images, and, secondly, the images will in many cases have been taken from rare or unique originals, which might also be very fragile. Digital capture, wherever possible, should be done once only, and a digital surrogate captured that will satisfy all anticipated present and future uses. This surrogate is known as the "digital master" and should be kept under preservation conditions. (See chapter 37, this volume.) Any manipulations or post-processing should be carried out on copies of this master image. The digital master will probably be a very large file: the highest-quality digital cameras (12,000 × 12,000 pixels) produce files of up to 350Mb, which means that it is not possible to store more than one on a regular CD-ROM disk. A 35 mm color transparency captured at 2,700 dpi (the norm for most slide scanners) in 24-bit color would give a file size of 25 Mb, which means that around 22 images could be stored on one CD-ROM. The file format generally used for digital masters is the TIFF (Tagged Image File Format), a de facto standard for digital imaging. There are many other file formats available, but TIFF can be recommended as the safest choice for the long term. The "Tagged" in the title means that various types of information can be stored in a file header of the TIFF files.

The actual methods of operation of scanning equipment vary, and it is not the intention of this chapter to give details of such methodologies. There are a number of books which give general advice, and hardware manufacturers have training manuals and online help. Anyone who wishes to learn digital capture techniques in detail is advised to seek out professional training through their own institution, or from one of the specialist organizations offering courses: some of these are listed in the bibliography at the end of this chapter.

Compression and derivatives

It is possible to reduce file sizes of digital images using compression techniques, though this is often not recommended for the digital masters. Compression comes in two forms: "lossless", meaning that there is no loss of data through the process, and "lossy", meaning that data is lost, and can never be recovered. There are two lossless compression techniques that are often used for TIFF master files, and which can be recommended here – the LZW compression algorithm for materials with color content, and the CCITT Group 4 format for materials with 1-bit, black and white content.

Derivative images from digital masters are usually created using lossy compression methods, which can give much greater reduction in file sizes than lossless compression for color and greyscale images. Lossy compression is acceptable for many uses of the images, especially for Web or CD-ROM delivery purposes. However, excessive compression can cause problems in the viewable images, creating artifacts such as pixelation, dotted or stepped lines, regularly repeated patterns, moire, halos, etc. For the scholar seeking the highest level of fidelity to the originals, this is likely to be unacceptable, and so experimentation will be needed to give the best compromise between file size and visual quality. The main format for derivative images for delivery to the web or on CD-ROM is currently JPEG. This is a lossy compression format that can offer considerable reduction in file sizes if the highest levels of compression are used, but this comes at the cost of some compromise of quality. However, it can give color thumbnail images of only around 7 KB and screen resolution images of around 60 KB – a considerable benefit if bandwidth is an issue.

Audio and video capture

As technology has advanced since the 1990s, more and more humanists are becoming interested in the capture of time-based media. Media studies is an important and growing area, and historians of the modern period too derive great benefit from having digital access to time-based primary sources such as news reports, film, etc. Literary scholars also benefit greatly from access to plays, and to filmed versions of literary works.

In principle the conversion of audio or video from an analogue medium to a digital data file is simple, it is in the detail that complexity occurs. The biggest problem caused in the digital capture of video and audio is the resultant file sizes:

There is no other way to say it. Video takes up a lot of everything; bandwidth, storage space, time, and money. We might like to think that all things digital are preferable to all things analog but the brutal truth is that while analog formats like video might not be exact or precise they are remarkably efficient when it comes to storing and transmitting vast amounts of information.
(Wright 2001a)
Raw, uncompressed video is about 21 MB/sec.
(Wright 2001b)

There are suppliers that can convert audio and video, and many scholars will want to outsource such work. However, an understanding of the processes is of great importance.

The first stage in the capture process is to have the original video or audio in a format that is convertible to digital formats. There are over thirty types of film and video stock and many types of audio formats that the original content may be recorded upon. Most humanists, however, should only need to concern themselves with a limited number of formats, including:

• VHS or Betamax video

• mini-DV for digital video cameras

• DAT tape for audio

• cassette tape for audio

• CD-ROM for audio

The equipment needed to capture video and audio includes:

• video player capable of high-quality playback of VHS and mini-DV tapes

• hi-fi audio player capable of high-quality playback of DAT and cassette tapes

• small TV for review and location of video clips

• high-quality speakers and/or headphones for review and location of audio clips

Once the suitable video or audio clip has been identified on the original via a playback device, this has to be connected to the capture device for rendering in a digital video (DV) or digital audio (DA) format upon a computer system. This is normally achieved by connecting input/output leads from the playback device to an integrated digital capture card in the computer or via connection to a device known as a "breakout box." The breakout box merely allows for more types of input/output leads to be used in connecting to and from the playback device. Whatever the method of connection used, there must be a capture card resident in the machine to allow for digital data capture from the playback device. Usually, a separate capture card is recommended for video and for audio: although all video capture cards will accept audio input, the quality required for pure audio capture is such that a bespoke capture card for this purpose is recommended; indeed, it is probably best to plan the capture of audio and video on separate PCs – each card requires its own Input/Output range, Interrupt Request number and its own software to control the capture. These are temperamental and tend to want to occupy the same I/O range, interrupts and memory spaces, leading to system conflicts and reliability problems. This of course adds to cost if a project is planning to capture both audio and video.

The benefit of having a capture card is that it introduces render-free, real-time digital video editing, DV and analogue input/output, and this is usually augmented with direct output for DVD, and it also provides Internet streaming capabilities, along with a suite of digital video production tools. Render-free real-time capture and editing is important because capture and editing can then happen without the long waits (which can be many minutes) associated with software packages for rendering and editing. The capture from video or audio will be to DV or DA, usually at a compression rate of around 5:1 which gives 2.5–3.5 MB/sec. This capture is to a high standard even though there is compression built into the process. It is also too large for display and use on the Internet and so further compression and file format changes are required once edited into a suitable form for delivery.

Editing of Captured Content

The editing of the content captured is done via software tools known as non-linear editing suites. These allow the content to be manipulated, edited, spliced, and otherwise changed to facilitate the production of suitable content for the prospective end user. The ability to do this in real time is essential to the speed and accuracy of the eventual output. Also, the editing suite should have suitable compressors for output to Web formats and Internet streaming.

Compressors for the Web

When viewing video or listening to audio on the Web the content has to go through a process of compression and de-compression (CODEC). There is initial compression at the production end using a suitable CODEC (e.g., Sorenson, Qdesign) to gain the desired file size, frames per second, transmission rate, and quality. This content may then be saved in a file format suitable for the architecture of the delivery network and expected user environment – possibly QuickTime, Windows Media or Real. The user then uses viewer software to decompress the content and view/hear the content.

Compression is a difficult balancing act between gaining the smallest file size and retaining suitable quality. In video, compression works by taking a reference frame that contains all the information available and then subsequent frames are represented as changes from the reference frame. The number of frames between the reference frames is a defining factor in the compression and quality – the fewer reference frames the more compressed the file but the lower quality the images are likely to be. However, if the number of reference frames is increased to improve visual quality then the file size will also rise. As the nature of the content in video can differ radically, this compression process has to be done on a case by case basis. A video of a "talking head" type interview will require fewer reference frames than, say, sports content because the amount of change from frame to frame is less in the "talking head" example. Thus higher compression is possible for some content than for others if aiming for the same visual quality – there is no "one size fits all" equation in audio and video compression.

It is generally assumed with images that the main cost will be in the initial capture and that creating surrogates for end-user viewing on the Internet will be quick and cheap. In the video sphere this paradigm is turned on its head, with the initial capture usually cheaper than the creation of the compressed surrogate for end-user viewing as a direct consequence of the increased amount of human intervention needed to make a good-quality Internet version at low bandwidth.

Metadata

Given that humanists will almost certainly be capturing primary source data at a high quality and with long-term survival as a key goal, it is important that this data be documented properly so that curators and users of the future understand what it is that they are dealing with. Metadata is one of the critical components of digital resource conversion and use, and is needed at all stages in the creation and management of the resource. Any creator of digital objects should take as much care in the creation of the metadata as they do in the creation of the data itself – time and effort expended at the creation stage recording good-quality metadata is likely to save users much grief, and to result in a well-formed digital object which will survive for the long term.

It is vitally important that projects and individuals give a great deal of thought to this documentation of data right from the start. Having archive-quality digital master files is useless if the filenames mean nothing to anyone but the creator, and there is no indication of date of creation, file format, type of compression, etc. Such information is known as technical or administrative metadata. Descriptive metadata refers to the attributes of the object being described and can be extensive: attributes such as: "title", "creator", "subject", "date", "keywords", "abstract", etc. In fact, many of the things that would be catalogued in a traditional cataloguing system. If materials are being captured by professional studios or bureaux, then some administrative and technical metadata will probably be added at source. Or it may be possible to request or supply project-specific metadata. Descriptive metadata can only be added by experts who understand the nature of the source materials, and it is an intellectually challenging task in itself to produce good descriptive metadata.

Publications

Understanding the capture processes for primary source materials is essential for humanists intending to engage in digital projects, even if they are never going to carry out conversion activities directly. Knowing the implications of the various decisions that have to be taken in any project is of vital importance for short- and long-term costs as well as for the long-term survivability of the materials to which time, care, and funds have been devoted. The references below cover most aspects of conversion for most types of humanities sources, and there are also many websites with bibliographies and reports on digital conversion methods. The technologies and methods change constantly, but the underlying principles outlined here should endure.

References for Further Reading

Baca, M., (ed.) (1998). Introduction to Metadata, Pathways to Digital Information. Los Angeles, CA: Getty Research Institute.

Carter, B. (1999). From Imagination to Reality: Using Immersion Technology in an African American Literature Course. Literary and Linguistic Computing 14: 55–65.

Chapman, S., P. Conway, and A. R. Kenney (1999). Digital Imaging and Preservation Microfilm: The Future of the Hybrid Approach for the Preservation of Brittle Books. RLG DigiNews. Available at http://www.rlg.org/preserv/diginews/diginews3-1.html.

Davies, A. and P. Fennessey (2001). Digital Imaging for Photographers. London: Focal Press.

Deegan, M., E. Steinvel, and E. King (2002). Digitizing Historic Newspapers: Progress and Prospects. RLG DigiNews. Available at http://www.rlg.org/preserv/diginews/diginews6-4.html#feature2.

Deegan, M. and S. Tanner (2002). Digital Futures: Strategies for the Information Age. London: Library Association Publishing.

Feeney, M., (ed.) (1999). Digital Culture: Maximising the Nation's Investment. London: National Preservation Office.

Getz, M. (1997). Evaluating Digital Strategies for Storing and Retrieving Scholarly Information. In S. H. Lee (ed.), Economics of Digital Information: Collection, Storage and Delivery. Binghamton, NY: Haworth Press.

Gould, S. and R. Ebdon (1999). IFLA/UNESCO Survey on Digitisation and Preservation. IFLA Offices for UAP and International Lending Available at http://www.unesco.org/webworld/mdm/survey_index_en.html.

Hazen, D., J. Horrell, and J. Merrill-Oldham (1998). Selecting Research Collections for Digitization CLIR. Available at http://www.clir.org/pubs/reports/hazen/pub74.html.

Kenney, A. R. and O. Y. Rieger, (eds.) (2000). Moving Theory into Practice: Digital Imaging for Libraries and Archives. Mountain View, CA: Research Libraries Group.

Klijn, E. and Y. de Lusenet (2000). In the Picture: Preservation and Digitisation of European Photographic Collections. Amsterdam: Koninklijke Bibliotheek.

Lacey, J. (2002). The Complete Guide to Digital Imaging. London: Thames and Hudson.

Lagoze, C. and S. Payette (2000). Metadata: Principles, Practices and Challenges. In A. R. Kenney and O. Y. Rieger (eds.), Moving Theory into Practice: Digital Imaging for Libraries and Archives (pp. 84–100). Mountain View, CA: Research Libraries Group.

Lawrence, G. W., et al. (2000). Risk Management of Digital Information: A File Format Investigation. Council on Library and Information Resources. Available at http://www.clir.org.pubs/reports/reports.html.

Parry, D. (1998). Virtually New: Creating the Digital Collection. London: Library and Information Commission.

Robinson, P. (1993). The Digitization of Primary Textual Sources. London: Office for Humanities Communication Publications, 3, King's College, London.

Rossi, M. F. Casali, A. Bacchilega, and D. Romani (2000). An Experimental X-ray Digital Detector for Investigation of Paintings. 15th World Conference on NonDestructive Testing, Rome, 15–21 October.

Smith, A. (1999). Why Digitize? CLIR. Available at http://www.clir.org/pubs/abstract/pub80.html.

Tanner, S. and M. Deegan (2002). Exploring Charging Models for Digital Cultural Heritage. Available at http://heds.herts.ac.uk.

Watkinson, J. (2001). Introduction to Digital Video. London: Focal Press.

Wright, G. (2001a). Building a Digital Video Capture System. Part I, Tom's Hardware Guide. Available at http://www.tomshardware.com/video/20010524/.

Wright, G. (2001b). Building a Digital Video Capture System. Part II, Tom's Hardware Guide. Available at http://www.tomshardware.com/video/20010801/.

Courses

School for Scanning, North East Documentation Center, <http://www.nedcc.org>.

Cornell University Library, Department of Preservation and Conservation, <http://www.library.cornell.edu/preservation/workshop/index.html>.

SEPIA (Safeguarding European Photographic Images for Access), project training courses, <http://www.knaw.nl/ecpa/sepia/training.html>.