17. Text Encoding

Allen H. Renear

Before they can be studied with the aid of machines, texts must be encoded in a machine-readable form. Methods for this transcription are called, generically, "text encoding schemes"; such schemes must provide mechanisms for representing the characters of the text and its logical and physical structure … ancillary information achieved by analysis or interpretation [may be also added] …
Michael Sperberg-McQueen, Text Encoding and Enrichment. In The Humanities Computing Yearbook 1989–90, ed. Ian Lancashire (Oxford: Oxford University Press, 1991)

Introduction

Text encoding holds a special place in humanities computing. It is not only of considerable practical importance and commonly used, but it has proven to be an exciting and theoretically productive area of analysis and research. Text encoding in the humanities has also produced a considerable amount of interesting debate – which can be taken as an index of both its practical importance and its theoretical significance.

This chapter will provide a general orientation to some of the historical and theoretical context needed for understanding both contemporary text encoding practices and the various ongoing debates that surround those practices. We will be focusing for the most part, although not exclusively, on "markup", as markup-related techniques and systems not only dominate practical encoding activity, but are also at the center of most of the theoretical debates about text encoding. This chapter provides neither a survey of markup languages nor a tutorial introduction to the practice of markup. The reader new to SGML/XML text encoding should read this chapter of the Companion concurrently with the short (21-page) second chapter of the TEI Guidelines, "A Gentle Introduction to XML", online at http://www.tei-c.org/P4X/SG.html. The justly renowned "Gentle Introduction" remains the best brief presentation of SGML/XML text encoding and it provides a necessary complement of specific description to the background and theory being presented here. For a good general introduction to text encoding in the humanities, see Electronic Texts in the Humanities: Theory and Practice, by Susan Hockey (Hockey 2001).

In accordance with the general approach of this Companion, we understand text encoding in the "digital humanities" in a wide sense. Traditional humanities computing (particularly when related to literature and language) typically emphasized either analytical procedures on encoded textual material – such as, for instance, stylometric analysis to support authorship or seriation studies – or the publishing of important traditional genres of scholarship such as critical and variorum editions, indexes, concordances, catalogues, and dictionaries. But text encoding is no less important to digital humanities broadly conceived, in the sense which includes the creation and study of new cultural products, genres, and capabilities, such as those involving hypertext, multimedia, interactivity, and networking – cultural products that are often called "new media." In order to be presented using computers, such material must be encoded in machine-readable form. Although the presentation below does not take up hypertext or "new media" topics directly, we believe the background presented nevertheless provides a useful background for text encoding in general, new media applications as well as traditional humanities computing and publishing. For more specific treatment of encoding issues, hypertext, multimedia, and other new media topics, see chapters 10, 28, 29, and 30. For discussions of traditional humanities computing applications exploiting text encoding, see chapters 20, 21, 22, and 35.

What follows then is intended as a background for understanding the text encoding as a representation system for textually based cultural objects of all kinds.

Markup

Introduction

Markup, in the sense in which we are using the term here, may be characterized, at least provisionally, as information formally distinct from the character sequence of the digital transcription of a text, which serves to identify logical or physical features or to control later processing. In a typical unformatted view of a digital representation of a text such markup is visibly evident as the more or less unfamiliar expressions or codes that are intermixed with the familiar words of the natural-language writing system. The term markup comes, of course, from traditional publishing, where an editor marks up a manuscript by adding annotations or symbols on a paper copy of text indicating either directly (e.g., "center") or indirectly ("heading") how something is to look in print (Spring 1989); Chicago 1993.

Many markup theorists have found it instructive to conceptualize markup as a very general phenomenon of human communication. Extending the notion of markup in straightforward and natural ways, one can easily use this concept to illuminate aspects of the general nature and history of writing systems and printing, particularly in the areas such as page layout, typography, and punctuation (Coombs et al. 1987).

In addition, other fields and disciplines related to communication, such as rhetoric, bibliography, textual criticism, linguistics, discourse and conversation analysis, logic, and semiotics also seem to be rich sources of related findings and concepts that generalize our understanding of markup practices and make important connections between markup practices narrowly understood and other bodies of knowledge and technique.

However, although such a broad perspective can be illuminating, the significance of markup for humanities computing is best approached initially by considering markup's origin and development in computer-based typesetting and early text processing. Although markup might arguably be considered part of any communication system, and of fundamental theoretical significance, it is with straightforward applications in digital text processing and typesetting in the 1960s, 1970s, and 1980s that the use of markup, in our sense, first becomes explicit and begins to undergo deliberate and self-conscious development (Goldfarb 1981; Spring 1989; SGML Users' Group 1990).

Emergence of descriptive markup

The use of computers to compose text for typesetting and printing was common by the mid-1960s and the general process was more or less the same regardless of the specific technology. Typically, the first step was to create and store in a computer file a representation of the text to be printed. This representation consisted of both codes for the individual characters of the textual content and codes for formatting commands, the commands and the text being distinguished from each other by special characters or sequences of characters serving as "delimiters." The file would then be processed by a software application that acted on the formatting instructions to create data which could in turn be directly read and further processed by the phototypesetter or computer printer – creating formatted text as final output (Seybold 1977; Furuta et al. 1982).

The 1960s, 1970s, and 1980s saw extensive development of these document markup systems as software designers worked to improve the efficiency and functionality of digital typesetting and text processing software (IBM 1967; Ossanna 1976; Lesk 1978; Goldfarb 1978; Reid 1978; Knuth 1979; Lamport 1985).

One natural improvement on the approach just described was to replace the long strings of complex formatting codes with simpler abbreviations that could be automatically expanded into the formatting commands being abbreviated. The compositor would enter just the abbreviation instead of the entire string of commands. In many typesetting systems in the 1960s and 1970s these abbreviations were called "macros", a term drawn from assembly language programming where it referred to higher-level symbolic instructions which would be expanded, before program execution, into sequences of lower-level primitive instructions.

In typesetting and text processing these macros had the obvious immediate advantage of easier and more reliable data entry. However, during their early use it wasn't always entirely clear whether a macro got its primary identity from the text component (e.g., a caption, extract, or heading) whose formatting it was controlling, or whether it was simply a short name for the combination of the specific formatting codes it was abbreviating, with no other meaning or identity. The distinction is subtle but important. If a macro is truly just an abbreviation of a string of formatting codes then it can be appropriately used wherever those formatting codes would be used. So if, for instance, figure captions and third-level headings happen to have the same design specifications, then the same macro could reasonably and appropriately be used for both. In such a case the macro, as a mere abbreviation, gets its entire meaning and identity from the formatting commands it abbreviates and it would be natural for the macro name to then simply indicate the appearance of the formatted text (e.g., ":SmallCenteredBold;"), or be an arbitrary expression (e.g., ":format!7;") rather than have a name that suggested an intrinsic relationship with the text component being formatted (such as, ":figurecaption") (Goldfarb 1997).

Although macros used as described above obviously provided some advantages over entering strings of formatting codes, it was a natural next step to see that many more advantages could be achieved by understanding the presence of the macro name in the file to be identifying the occurrence of a particular text component – a third-level heading, caption, stanza, extract, title, etc. – rather than just being an abbreviation for a string of formatting commands. On this new approach, figure captions, for instance, would be identified with one code (say, ":FigureCaption;"), and third-level headings would be identified with another (say, ":Heading3;" even if, according to the page design specification currently being applied, these codes were mapped to the same set of formatting commands.

The first advantage to this new approach is that it is now possible to globally alter the formatting of figure captions (by simply updating the formatting commands associated with ":FigureCaption;") without necessarily changing the formatting of the third-level headings (identified by the macro name ":Heading3;"). In addition, authoring and even composition can now take place without the author or compositor needing to know how the different components are to be formatted. As this approach to typesetting and text processing began to be systematically applied it became quickly apparent that there were a great many other advantages. So many advantages, in fact, and such a diversity of advantages, that the descriptive markup approach began to appear to be somehow the fundamentally correct approach to organizing and processing text (Goldfarb 1981; Reid 1981; Coombs et al. 1987).

The promotion of descriptive markup as the fundamentally correct systematic approach in digital publishing and text processing is usually traced to three events: (i) a presentation made by William Tunnicliffe, chairman of the Graphic Communications Association's Composition Committee, at the Canadian Government Printing Office in September 1967; (ii) book designer Stanley Rice's project, also in the late 1960s, of developing a universal catalogue of "editorial structure" tags that would simplify book design and production; and (iii) early work on the text processing macro language "GML", led by Charles Goldfarb, at IBM in 1969 (SGML Users' Group 1990; Goldfarb 1997). In the late 1970s these events would lead to an effort to develop SGML, a standard for machine-readable definitions of descriptive markup languages. Other examples of early use of descriptive markup in digital text processing include the Brown University PRESS hypertext system (Carmody et al. 1969; DeRose and van Dam 1999), and, later in the 1970s, Brian Reid's SCRIBE (Reid 1978, 1981). In addition, the seminal work on text processing by Douglas Engelbart in the 1960s should probably also be seen as exhibiting some of the rudiments of this approach (Engelbart et al. 1973).

Nature and advantages of descriptive markup (adapted from DeRose et al. 1990)

Early experience with descriptive markup encouraged some text processing researchers to attempt to develop a general theoretical framework for markup and to use that framework to support the development of high-function text processing systems. Some of the research and analysis on markup systems was published in the scientific literature (Goldfarb 1981; Reid 1981; Coombs et al. 1987), but most was recorded only in the working documents and products of various standards bodies, and in the manuals and technical documentation of experimental systems.

At the heart of this effort to understand markup systems was the distinction between "descriptive" and "procedural" markup originally put forward by Goldfarb. Descriptive markup was typically said to "identify" or "describe" the "parts" of a document, whereas procedural markup was a "command" or "instruction" invoking a formatting procedure. It was also often said that descriptive markup identified the "logical" or "editorial" parts or "components" of a document, or a text's "content objects" or its "meaningful structure" – emphasizing the distinction between the intrinsic ("logical") structure of the document itself, and the varying visual, graphic features of a particular presentation of that document. (For recent arguments that descriptive markup can be further divided into the genuinely descriptive and the "performative", see Renear 2000.)

Several particular advantages of descriptive markup, such as simplified composition and systematic control over formatting, have already been alluded to, but in order to appreciate how the descriptive markup motivated a new theory of the nature of text it is useful to rehearse the number, diversity, and value of the advantages of descriptive markup in overview; so we present a categorized summary below.

Advantages for authoring, composition, and transcription

• Composition is simplified. With descriptive markup, intended formatting considerations make no claim on the attention of the author, compositor, or transcriber, whereas with procedural markup one must remember both (i) the style conventions that are intended, and (ii) the specific commands required by the formatting software to get those effects. With descriptive markup one simply identifies each text component for what it is and the appropriate formatting takes place automatically. (Particularly striking is how descriptive markup allows the author to work at an appropriate "level of abstraction" – identifying something as a quotation, paragraph, or caption is a natural authorial task, while knowing whether to, and how to, format that text a certain way is not.)

• Structure-oriented editing is supported. Descriptive markup supports "structure-oriented editors" who "know" about what patterns of components can be found in a particular genre of document and who use this knowledge to assist the author or compositor. For instance, if a date component must always follow a title component then the software, upon detecting that a title component has just been entered by an author, can automatically add the required date markup and prompt the author to enter the actual date. If either date or status is allowed after a title then the author will be presented with a choice. During editing the cursor location can be used to identify and present to the author the list of components that may be added or deleted at that point. For complicated document genres this means that there is much less for the author to remember and fewer possibilities for error.

• More natural editing tools are supported. Moves and deletes, for example, can take as their targets and scope the natural meaningful parts of the text (words, sentences, para graphs, sections, extracts, equations, etc.) rather than relying on the mediation of accidental features (such as current lineation) or arbitrarily marked regions.

• Alternative document views are facilitated. An outline view of a text, for instance, can be done automatically, by taking advantage of the descriptive markup for chapters, sections, and headings. Or a more sophisticated and specialized display of portions of documents can be effected using identified discipline-specific components: such as equations, examples, cautions, lines spoken by a particular character in a play script, and so on.

Advantages for publishing

• Formatting can be generically specified and modified. When procedural markup is being used, the appearance of paragraphs can only be modified by editing the formatting commands preceding each actual occurrence of a paragraph in the source file, whereas with descriptive markup only the rule associating formatting commands with the descriptive markup for paragraph needs to be updated – and if these rules are stored separately there may be no need to even alter the files containing text in order to make formatting alterations. Obviously, controlling formatting with descriptive markup is easier, less error-prone, and ensures consistency.

• Apparatus can be automated. Descriptive markup supports the creation of indexes, appendices, and such. For instance, if stanzas and verse lines are explicitly identified, the creation of an index of first lines of stanzas (or second lines or last lines) is a matter of simple programming, and does not require that a human editor again laboriously identify verses and lines. Similarly, generating tables for equations, plates, figures, examples, is also easy, as are indexes of place names, personal names, characters, medium, authors, periods, and so on.

• Output device support is enhanced. When coding is based on logical role rather than appearance, output-device specific support for printers, typesetters, video display terminals and other output devices can be maintained separately, logically and physically, from the data with the convenient result that the data files themselves are output-device independent while their processing is efficiently output-device sensitive.

• Portability and interoperability are maximized. Files that use descriptive markup, rather than complicated lists of application-specific formatting instructions, to identify components are much easier to transfer to other text processing systems. In some cases little more than a few simple systematic changes to alter the delimiter conventions, substitute one mnemonic name for another, and a translation of format ting rules into those for the new system, are all that is necessary.

Advantages for archiving, retrieval, and analysis

• Information retrieval is supported. Descriptive markup allows documents to be treated as a database of fielded content that can be systematically accessed. One can request all equations, or all headings, or all verse extracts; or one can request all titles that contain a particular personal name, place name, chemical, disease, drug, or therapy. This can facilitate not only personal information retrieval functions, such as the generation of alternative views, but also a variety of finding aids, navigation, and data retrieval functions. It may seem that in some cases, say where equations are uniquely formatted, it is not necessary to identify them, as the computer could always be programmed to exploit the formatting codes. But in practice it is unlikely that equations will always be consistently formatted, and even more unlikely that they will be uniquely formatted. Similarly, it might be thought that a string search might retrieve all references to the city Chicago, but without markup one cannot distinguish the city, the rock band, and the artist.

• Analytical procedures are supported. Computer-based analysis (stylometrics, content analysis, statistical studies, etc.) can be carried out more easily and with better results if features such as sentences, paragraphs, stanzas, dialogue lines, stage directions, and so on have been explicitly identified so that the computer can automatically distin guish them. Consider some examples: If the style of spoken language in a play is being analyzed, the text that is not speech (stage directions, notes, etc.) must not be conflated with the dialogue. If the speech of a particular character is being studied it must be distinguished from the speech of other characters. If proximity is being studied, the word ending one paragraph should perhaps not be counted as collocated with the word beginning the next, and so on. Descriptive markup identifies these important components and their boundaries, supporting easier, more consistent, and more precise automatic processing.

The OHCO view of "what text really is"

That the descriptive markup approach had so many advantages, and so many different kinds of advantages, seemed to some people to suggest that it was not simply a handy way of working with text, but that it was rather in some sense deeply, profoundly, correct, that "descriptive markup is not just the best approach … it is the best imaginable approach" (Coombs et al. 1987).

How could this be? One answer is that the descriptive markup approach, and only the descriptive markup approach, reflects a correct view of "what text really is" (DeRose et al. 1990). On this account, the concepts of descriptive markup entail a model of text, and that model is more or less right. The model in question postulates that text consists of objects of a certain sort, structured in a certain way. The nature of the objects is best suggested by example and contrast. They are chapters, sections, paragraphs, titles, extracts, equations, examples, acts, scenes, stage directions, stanzas, (verse) lines, and so on. But they are not things like pages, columns, (typographical) lines, font shifts, vertical spacing, horizontal spacing, and so on. The objects indicated by descriptive markup have an intrinsic direct connection with the intellectual content of the text; they are the underlying "logical" objects, components that get their identity directly from their role in carrying out and organizing communicative intention. The structural arrangement of these "content objects" seems to be hierarchical – they nest in one another without overlap. Finally, they obviously also have a linear order as well: if a section contains three paragraphs, the first paragraph precedes the second, which in turn precedes the third.

On this account then text is an "Ordered Hierarchy of Content Objects" (OHCO), and descriptive markup works as well as it does because it identifies that hierarchy and makes it explicit and available for systematic processing. This account is consistent with the traditional well-understood advantages of "indirection" and "data abstraction" in information science.

A number of things seem to fall into place from this perspective. For one thing, different kinds of text have different kinds of content objects (compare the content objects in dramatic texts with those in legal contracts), and typically the patterns in which content objects can occur is at least partially constrained: the parts of a letter occur in a certain order, the lines of a poem occur within, not outside of a stanza, and so on. Presentational features make it easier for the reader to recognize the content objects of the text.

There are alternative models of text that could be compared with the OHCO model. For instance, one could model text as a sequence of graphic characters, as in the "plain vanilla ASCII" approach of Project Gutenberg; as a combination of procedural coding and graphic characters, as in a word processing file; as a complex of geometric shapes, as in "vector graphics" format of an image of a page on which the text is written; as a pure image, as in a raster image format (JPEG, GIF, etc.); or in a number of other ways. However, implementations based on these alternative models are all clearly inferior in functionality to implementations based on an OHCO model, and in any case can be easily and automatically generated from an OHCO format (DeRose et al. 1990; Renear et al. 1996).

The OHCO account of "what text really is" has not gone uncriticized, but many have found it a compelling view of text, one that does explain the effectiveness of descriptive markup and provides a general context for the systematization of descriptive markup languages with formal metalanguages such as SGML.

SGML and XML

As we indicated above, this chapter is not designed to introduce the reader to text encoding, but to provide in overview some useful historical and theoretical background that should help support a deeper understanding. In this section we present some of the principal themes in SGML and XML, and take up several topics that we believe will be useful, but for a complete presentation the reader is again directed to the "Gentle Introduction" and the other readings indicated in the references below.

History of SGML and XML: Part I

Norman Scharpf, director of the Graphics Communication Association, is generally credited with recognizing the significance of the work, mentioned above, of Tunnicliffe, Rice, and Goldfarb, and initiating, in the late 1960s, the GCA "GenCode" project, which had the goal of developing a standard descriptive markup language for publishing (SGML Users' Group 1990; Goldfarb 1997). Soon much of this activity shifted to the American National Standards Institute (ANSI), which selected Charles Goldfarb to lead an effort for a text description language standard based on GML and produced the first working draft of SGML in 1980. These activities were reorganized as a joint project of ANSI and the International Organization for Standardization (ISO) and in 1986 ISO published ISO 8879: Information Processing – Text and Office Systems – Standard Generalized Markup Language (SGML) (ISO 1986; Goldfarb 1991). Later events in the history of SGML, including the development of XML and the profusion of SGML/XML standards that began in the mid-1990s, are discussed below. We will later describe more precisely the relationship between SGML and XML; at this point in the chapter the reader only needs to know that XML is, roughly, just a simpler version of SGML, and as a consequence almost everything we say about SGML applies to XML as well.

The basic idea: a metalanguage for defining descriptive markup languages

Although for a variety of reasons the SGML standard itself can be difficult to read and understand, the basic idea is quite simple. SGML is a language for creating machine-readable definitions of descriptive markup languages. As such it is called a "metalanguage", a language for defining a language. The SGML standard provides a way to say things like this: these are the characters I use for markup tag delimiters; these are the markup tags I am using; these are the acceptable arrangements that the components identified by my markup tags may take in a document; these are some characteristics I may be asserting of those components, these are some abbreviations and shortcuts I'll be using, and so on. An alternative characterization of SGML is as a "metagrammar", a grammar for defining other grammars; here "grammar" is used in a technical sense common in linguistics and computer science. (Although its origin is as a metalanguage for defining document markup languages, SGML can be, and is, used to define markup languages for things other than documents, such as languages for data interchange or interprocess communication, or specific data management applications.)

An almost inevitable first misconception is that SGML is a document markup language with a specific set of document markup tags for features such as paragraphs, chapters, abstracts, and so on. Despite its name ("Standard Generalized Markup Language"), SGML is not a markup language in this sense and includes no document markup tags for describing document content objects. SGML is a metalanguage, a language for defining document markup languages. The confusion is supported by the fact that both markup metalanguages like SGML and XML and also SGML-based markup languages like HTML and XHTML (technically applications of SGML) are "markup languages" in name. In each case the "ML" stands for "markup language", but SGML and HTML are markup languages in quite different senses. The characterization of SGML as a document markup language is not entirely unreasonable of course, as it is in a sense a language for marking up documents, just one without a predefined set of markup for content objects. Whether or not SGML should be called a markup language is historically moot, but to emphasize the distinction between markup languages that are actual sets of markup tags for document content objects (such as HTML, XHTML, TEI, DocBook, etc.) and metalanguages for defining such markup languages (such as SGML and XML) we will, in what follows, reserve the term markup language for languages like HTML and TEI, excluding the metalanguages used to define them.

Standardizing a metalanguage for markup languages rather than a markup language was a key strategic decision. It was an important insight of the GCA GenCode Committee that any attempt to define a common markup vocabulary for the entire publishing industry, or even common vocabularies for portions of the industry, would be extremely difficult, at best. However, a major step towards improving the interoperability of computer applications and data could be achieved by standardizing a metalanguage for defining markup languages. Doing this would obviously be an easier task: (i) it would not require the development of a markup language vocabulary with industry-wide acceptance and (ii) it would not assume that the industry would continue to conform, even as circumstances changed and opportunities for innovations arose, to the particular markup language it had agreed on.

At the same time this approach still ensured that every markup language defined using SGML, even ones not yet invented, could be recognized by any computer software that understood the SGML metalanguage. The idea was that an SGML software application would first process the relevant SGML markup language definition, and then could go on, as a result of having processed that definition, to process the marked-up document itself – now being able to distinguish tags from content, expand abbreviations, and verify that the markup was correctly formed and used and that the document had a structure that was anticipated by the markup language definition. Because the SGML markup language definition does not include processing information such as formatting instructions, software applications would also require instructions for formatting that content; shortly after the SGML project began, work started on a standard for assigning general instructions for formatting and other processing to arbitrary SGML markup. This was the Document Style and Semantics Specification Language (DSSSL). The World Wide Web Consortium (W3C)'s Extensible Stylesheet Language (XSL) is based on DSSSL, and the W3C Cascading Style Sheet specification performs a similar, although more limited function.

This strategy supports technological innovation by allowing the independent development of better or more specialized markup, without the impediment of securing antecedent agreement from software developers that they will manufacture software that can process the new markup – all that is required is that the software understand SGML, and not a specific SGML markup language.

Elements and element types

Up until now we have been using the phrase content object for the logical parts of a document or text. SGML introduces a technical term, element, that roughly corresponds to this notion, developing it more precisely. The first bit of additional precision is that where we have used "content object" ambiguously, sometimes meaning the specific titles, paragraphs, extracts and the like that occur in actual documents, and sometimes meaning the general kind of object (title, paragraph, extract) of which those specific occurrences are instances, SGML has distinct terminology for these two senses of "content object": an actual component of a document is an "element", and the type of an element (title, paragraph, extract, etc.) is an "element type." However, apart from the standard itself, and careful commentary on the standard, it is common, and usually not a problem, for "element" to be used in the sense of "element type" as well. This kind of ambiguity is common enough in natural language and almost always easily and unconsciously resolved by context.

The second bit of additional precision is that where we used "content object" to refer to the parts of texts understood in the usual way as abstract cultural objects, independent of, and prior to, any notion of markup languages, the most exact use of "element" in the technical SGML sense is to refer to the combination of SGML markup tags and enclosed content that is being used to represent (for instance, in a computer file) these familiar abstract textual objects. Again the distinction is subtle and "element" is routinely used in both senses (arguably even in the standard itself). Typically, no damage is done, but eventually encoding cruxes, design quandaries, or theoretical disputes may require appeal to the full range of distinctions that can be made when necessary.

Document types and document instances

Fundamental to SGML is the notion of the document type, a class of documents with a particular set of content objects that can occur in some combinations but not in others. Some examples of document types might be: novel, poem, play, article, essay, letter, contract, proposal, receipt, catalogue, syllabus, and so on. Some examples of combination rules: in a poem, lines of verse occur within stanzas, not vice versa; in a play, stage directions may occur either within or between speeches but not within other stage directions (and not, normally, in the table of contents); in a catalogue, each product description must contain exactly one part number and one or more prices for various kinds of customers. There are no predefined document types in SGML of course; the identification of a document type at an appropriate level of specificity, (e.g., poem, sonnet, petrarchan sonnet, sonnet-by-Shakespeare) is up to the SGML markup language designer.

A document type definition (DTD) defines an SGML markup language for a particular document type. Part of a document type definition is given in a document type declaration that consists of a series of SGML markup declarations that formally define the vocabulary and syntax of the markup language. These markup declarations specify such things as what element types there are, what combinations elements can occur in, what characteristics may be attributed to elements, abbreviations for data, and so on.

The SGML standard is clear that there is more to a document type definition than the markup declarations of the document type declaration: "Parts of a document type definition can be specified by a SGML document type declaration, other parts, such as the semantics of elements and attributes or any application conventions … cannot be expressed formally in SGML" (ISO 1986: 4.105). That is, the document type declarations can tell us that a poem consists of exactly one title followed by one or more lines and that lines can't contain titles, but it does not have any resources for telling us what a title is. The document type definition encompasses more. However, all that the SGML standard itself says about how to provide the required additional information about "semantics … or any application conventions" is: "Comments may be used … to express them informally" (ISO 1986: 4.105). In addition, it is generally thought that properly written prose documentation for a markup language is an account of the "semantics … or any application conventions" of that markup language.

Obviously these two expressions "document type definition" and "document type declaration", which are closely related, sound similar, and have the same initials, can easily be confused. But the difference is important, if irregularly observed. We reiterate that: it is the document type definition that defines a markup language; part of that definition is given in the markup declarations of the document type declaration; and an additional portion, which includes an account of the meaning of the elements' attributes, is presented informally in comments or in the prose documentation of the markup language. We note that according to the SGML standard the acronym "DTD" stands for "document type definition" which accords poorly with the common practice of referring to the markup declarations of the document type declaration as, collectively, the "DTD", and using the extension "DTD" for files containing the markup declarations – these being only part of the DTD.

A document instance is "the data and markup for a hierarchy of elements that conforms to a document type definition" (ISO 1986: 4.100, 4.160). Typically, we think of the file of text and markup, although in fact the definition, like SGML in general, is indifferent to the physical organization of information: an SGML document instance may be organized as a "file", or it may be organized some other way.

Strictly speaking (i.e., according to ISO 1986, definitions 4.100 and 4.160), a document instance by definition actually conforms to the document type definition that it is putatively an instance of. This means that (i) the requirements for element and attribute use expressed in the markup declarations are met; and (ii) any other expressed semantic constraints or applications are followed. However, actual usage of "document instance" deviates from this definition in three ways: (i) it is common to speak of text and markup as a "document instance" even if it in fact fails to conform to the markup declarations of the intended DTD (as in "The document instance was invalid"); (ii) text plus markup is referred to as a document instance even if it is not clear what if any DTD it conforms to; and (iii) even a terminological purist wouldn't withhold the designation "document instance" because of a failure to conform to the "semantics and application conventions" of the document type definition – partly, no doubt, because there is no established way to automatically test for semantic conformance.

History of SGML and XML: Part II

Although SGML was soon used extensively in technical documentation and other large-scale publishing throughout the world, it did not have the broad penetration into consumer publishing and text processing that some of its proponents had expected. Part of the problem was that it was during this same period that "WYSIWIG" word processing and desktop publishing emerged, and had rapid and extensive adoption. Such software gave users a sense that they were finally (or once again?) working with "the text itself", without intermediaries, and that they had no need for markup systems; they could, again, return to familiar typewriter-based practices, perhaps enhanced with "cut-and-paste." This was especially frustrating for SGML enthusiasts because there is no necessary real opposition between WYSIWYG systems and descriptive markup systems and no reason, other than current market forces and perhaps technical limitations, why a WYSIWYG system couldn't be based on the OHCO model of text, and store its files in SGML. Most of what users valued in WYSIWYG systems – immediate formatting, menus, interactive support, etc. – was consistent with the SGML/OHCO approach. Users would of course need to forgo making formatting changes "directly", but on the other hand they would get all the advantages of descriptive markup, including simpler composition, context-oriented support for selecting document components, global formatting, and so on. There would never, needless to say, be any need to remember, type, or even see, markup tags. The "WYSIWYG" classification was misleading in any case, since what was at issue was really as much the combination of editing conveniences such as interactive formatting, pull-down menus, and graphical displays, as it was a facsimile of how the document would look when printed – which would typically vary with the circumstances, and often was not up to the author in any case.

But very few SGML text processing systems were developed, and fewer still for general users. Several explanations have been put forward for this. One was that OHCO/SGML -based text processing was simply too unfamiliar, even if more efficient and more powerful, and the unfamiliarity, and prospect of spending even an hour or two learning new practices posed a marketing problem for salesmen trying to make a sale "in three minutes on the floor of Radio Shack." Another was that OHCO/SGML software was just too hard to develop: the SGML standard had so many optional features, as well as several characteristics that were hard to program, that these were barriers to development of software. Several OHCO/SGML systems were created in the 1980s, but these were not widely used.

The modest use of SGML outside of selected industries and large organizations changed radically with the emergence of HTML, the HyperText Markup Language, on the World Wide Web. HTML was designed as an SGML language (and with tags specifically modeled on the "starter tag set" of IBM DCF/GML). However, as successful as HTML was it became apparent that it had many problems, and that the remedies for these problems could not be easily developed without changes in SGML.

From the start HTML's relationship with SGML was flawed. For one thing HTML began to be used even before there was a DTD that defined the HTML language. More important, though, was the fact that HTML indiscriminately included element types not only for descriptive markup, but also procedural markup as well ("font", "center", and so on). In addition there was no general stylesheet provision for attaching formatting or other processing HTML. Finally none of the popular web browsers ever validated the HTML they processed. Web page authors had only the vaguest idea what the syntax of HTML actually was and had little motivation to learn, as browsers were forgiving and always seemed to more or less "do the right thing." DTDs seemed irrelevant and validation unnecessary. HTML files were in fact almost never valid HTML.

But perhaps the most serious problem was that the HTML element set was impoverished. If Web publishing were going to achieve its promise it would have to accommodate more sophisticated and specialized markup languages. That meant that it would need to be easier to write the software that processed DTDs, so that arbitrary markup vocabularies could be used without prior coordination. And it was also obvious that some provision needed to be made for allowing more reliable processing of document instances without DTDs.

This was the background for the development of XML 1.0, the "Extensible Markup Language." XML was developed within the World Wide Web Consortium (W3C) with the first draft being released in 1996, the "Proposed Recommendation" in 1997, and the "Recommendation" in 1998. The committee was chaired by Jon Bosak of Sun Microsystems, and technical discussions were conducted in a working group of 100 to 150 people, with final decisions on the design made by an editorial review board with eleven members. The editors, leading a core working group of eleven people, supported by a larger group of 150 or so experts in the interest group, were Michael Sperberg-McQueen (University of Illinois at Chicago and the Text Encoding Initiative), Tim Bray (Textuality and Netscape), and Jean Paoli (Microsoft).

XML

The principal overall goal of XML was to ensure that new and specialized markup languages could be effectively used on the Web, without prior coordination between software developers and content developers. One of the intermediate objectives towards this end was to create a simpler, more constrained version of SGML, so that with fewer options to support and other simplifications, it would be easier for programmers to develop SGML/XML software, and then more SGML software would be developed and used, and would support individualized markup languages. This objectively was achieved: the XML specification is about 25 pages long, compared to SGML's 155 (664 including the commentary in Charles Goldfarb's SGML Handbook), and while the Working Group did not achieve their stated goal that a graduate student in computer science should be able to develop an XML parser in a week, the first graduate student to try it reported success in a little under two weeks.

Another objective was to allow for processing of new markup languages even without a DTD. This requires several additional constraints on the document instance, one of which is illustrative and will be described here. SGML allowed markup tags to be omitted when they were implied by their markup declarations. For instance if the element type declaration for paragraph did not allow a paragraph inside of a paragraph, then the start-tag of a new paragraph would imply the closing of the preceding paragraph even without an end-tag, and so the end-tag could be omitted by the content developer as it could be inferred by the software. But without a DTD it is not possible to know what can be nested and what can't – and so in order to allow "DTD-less" processing XML does not allow tags to be omitted. It is the requirement that elements have explicit end-tags that is perhaps the best-known difference between documents in an SGML language like HTML and an XML language like XHTML.

Whereas SGML has just one characterization of conformance for document instances, XML has two: (i) all XML documents must be well-formed; (ii) an XML document may or may not be valid (vis-a-vis a DTD). To be well-formed, a document instance need not conform to a DTD, but it must meet other requirements, prominent and illustrative among them: no start-tags or end-tags may be omitted, elements must not "overlap", attribute values must be in quotation marks, and case must be consistent. These requirements ensure that software processing the document instance will be able to unambiguously determine a hierarchy (or tree) of elements and their attribute assignments, even without a DTD‥.

The Text Encoding Initiative

Background

The practice of creating machine-readable texts to support humanities research began early and grew rapidly. Literary text encoding is usually dated from 1949, when Father Roberto Busa began using IBM punched-card equipment for the Index Thomisticus - in other words, literary text encoding is almost coeval with digital computing itself. By the mid-1960s there were at least three academic journals focusing on humanities computing, and a list of "Literary Works in Machine-Readable Form" published in 1966 already required 25 pages (and included some projects, listed modestly as single items, that were encoding an entire authorial oeuvre) (Carlson 1967). It is tempting to speculate that efforts to standardize encoding practices must have begun as soon as there was more than one encoding project. Anxiety about the diversity of encoding systems appears early – one finds that at a 1965 conference on computers and literature for instance, an impromptu meeting was convened to discuss "the establishment of a standard format for the encoding of text … a matter of great importance." It obviously wasn't the first time such a concern had been expressed, and it wouldn't be the last (Kay 1965).

At first, of course, the principal problem was consistent identification and encoding of the characters needed for the transcription of literary and historical texts. But soon it included encoding of structural and analytic features as well. A standard approach to literary text encoding would have a number of obvious advantages; it would make it easier for projects and researchers to share texts, possible to use the same software across textual corpora (and therefore more economical to produce such software), and it would simplify the training of encoders – note that these advantages are similar to those, described above, which motivated the effort to develop standards in commercial electronic publishing. And one might hope that without the disruptive competitive dynamic of commercial publishing where formats are sometimes aspects of competitive strategy, it would be easier to standardize. But there were still obstacles. For one thing, given how specialized text encoding schemes sometimes were, and how closely tied to the specific interests and views – even disciplines and theories – of their designers, how could a single common scheme be decided on?

Origins

The TEI had its origins in early November, 1987, at a meeting at Vassar College, convened by the Association for Computers in the Humanities and funded by the National Endowment for the Humanities. It was attended by thirty-two specialists, from many different disciplines and representing professional societies, libraries, archives, and projects in a number of countries in Europe, North America, and Asia. There was a sense of urgency, as it was felt that the proliferation of needlessly diverse and often poorly designed encoding systems threatened to block the development of the full potential of computers to support humanities research.

The resulting "Poughkeepsie Principles" defined the project of developing a set of text encoding guidelines. This work was then undertaken by three sponsoring organizations: The Association for Computers in the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics. A Steering Committee was organized and an Advisory Board of delegates from various professional societies was formed. Two editors were chosen: Michael Sperberg-McQueen of the University of Illinois at Chicago, and Lou Burnard of Oxford University. Four working committees were formed and populated. By the end of 1989 well over fifty scholars were already directly involved and the size of the effort was growing rapidly.

The first draft ("PI") of the TEI Guidelines was released in June, 1990. Another period of development followed (the effort now expanded and reorganized into 15 working groups) releasing revisions and extensions throughout 1990–93. Early on in this process a number of leading humanities textbase projects began to use the draft Guidelines, providing valuable feedback and ideas for improvement. At the same time workshops and seminars were conducted to introduce the Guidelines as widely as possible and ensure a steady source of experience and ideas. During this period, comments, corrections, and requests for additions arrived from around the world. After another round of revisions and extensions the first official version of the Guidelines ("P3") was released in May, 1994.

The TEI Guidelines have been an enormous success and today nearly every humanities textbase project anywhere in the world uses TEI. In 2002 the newly formed TEI Consortium released "P4", a revision that re-expressed the TEI Guidelines in XML, and today the Consortium actively continues to evolve the Guidelines and provide training, documentation, and other support. After HTML, the TEI is probably the most extensively used SGML/XML text encoding system in academic applications.

The history of the development of the TEI Guidelines, briefly told here, makes evident an important point. The Guidelines were, and continue to be, the product of a very large international collaboration of scholars from many disciplines and nations, with many different interests and viewpoints, and with the active participation of other specialists from many different professions and institutions. The work was carried out over several years within a disciplined framework of development, evaluation, and coordination, and with extensive testing in many diverse projects. All of this is reflected in a text encoding system of extraordinary power and subtlety.

General nature and structure

The TEI Guidelines take the form of several related SGML document type definitions, specified both in formal markup declarations and in English prose. The main DTD, for the encoding of conventional textual material such as novels, poetry, plays, dictionaries, or term-bank data, is accompanied by a number of auxiliary DTDs for specialized data: tag-set documentation, feature-system declarations used in structured linguistic or other annotation, writing-system declarations, and free-standing bibliographic descriptions of electronic texts.

The main DTD itself is partitioned into eighteen discrete tag sets or modules, which can be combined in several thousand ways to create different views of the main DTD. Some tag sets are always included in a view (unless special steps are taken to exclude them), some (the "base" tag sets) are mutually exclusive, and some (the "additional" tag sets) may be combined with any other tag sets.

In addition to the flexibility of choosing among the various tag sets, the main TEI DTD takes a number of steps to provide flexibility for the encoder and avoid a Procrustean rigidity that might interfere with the scholarly judgment of the encoder. Individual element types may be included or excluded from the DTD, additional element types may be introduced (as long as they are properly documented), and the names of element types and attributes may be translated into other languages for the convenience of the encoder or user.

It cannot be repeated enough that the apparent complexity of the TEI is, in a sense, only apparent. For the most part only as much of the TEI as is needed will be used in any particular encoding effort; the TEI vocabulary used will be exactly as complex, but no more complex, than the text being encoded. Not surprisingly, a very small TEI vocabulary, known as TEI Lite (http://www.tei.org), is widely used for simple texts.

Larger significance of the TEI

The original motivation of TEI was to develop interchange guidelines that would allow projects to share textual data (and theories about that data) and promote the development of common tools. Developing such a language for the full range of human written culture, the full range of disciplinary perspectives on those objects, and the full range of competing theories was a daunting task.

It is easy to talk about accommodating diversity, about interdisciplinarity, about multicul-turalism, about communication across various intellectual gaps and divides. But few efforts along these lines are more than superficial. The experience of the TEI makes it evident why this is so. Not only do different disciplines have quite different interests and perspectives, but, it seems, different conceptual schemes: fundamentally different ways of dividing up the world. What is an object of critical contest and debate for one discipline, is theory-neutral data for another, and then completely invisible to a third. What is structured and composite for one field is atomic for another and an irrelevant conglomeration to a third. Sometimes variation occurred within a discipline, sometimes across historical periods of interest, sometimes across national or professional communities. Practices that would seem to have much in common could vary radically – and yet have enough in common for differences to be a problem! And even where agreement in substance was obtained, disagreements over nuances, of terminology for instance, could derail a tenuous agreement.
(Mylonas and Renear 1999)

The TEI made several important decisions that gave the enterprise at least the chance of success. For one thing the TEI determined that it would not attempt to establish in detail the "recognition criteria" for TEI markup; it would be the business of the encoder-scholar, not that of the TEI, to specify precisely what criteria suffice to identify a paragraph, or a section, or a technical term, in the text being transcribed. Examples of features commonly taken to indicate the presence of an object were given for illustration, but were not defined as normative. In addition the TEI, with just a few trivial exceptions, does not specify what is to be identified and encoded. Again, that is the business of the scholar or encoder. What the TEI does is provide a language to be used when the encoder (i) recognizes a particular object, and (ii) wishes to identify that object. In this sense, the TEI doesn't require antecedent agreement about what features of a text are important and how to tell whether they are present; instead, it makes it possible, when the scholar wishes, to communicate a particular theory of what the text is. One might say that the TEI is an agreement about how to express disagreement.

The principal goal of the TEI, developing an interchange language that would allow scholars to exchange information, was ambitious enough. But the TEI succeeded not only in this, but at a far more difficult project, the development of a new data description language that substantially improves our ability to describe textual features, not just our ability to exchange descriptions based on current practice. The TEI Guidelines represent

an elucidation of current practices, methods, and concepts, [that] opens the way to new methods of analysis, new understandings, and new possibilities for representation and communication. Evidence that this is indeed a language of new expressive capabilities can be found in the experience of pioneering textbase projects which draw on the heuristic nature of the TEI Guidelines to illuminate textual issues and suggest new analyses and new techniques.
(Mylonas and Renear 1999)

Finally, we note that the TEI is now itself a research community,

connecting many professions, disciplines, and institutions in many countries and defining itself with its shared interests, concepts, tools, and techniques. Its subject matter is textual communication, with the principal goal of improving our general theoretical understanding of textual representation, and the auxiliary practical goal of using that improved understanding to develop methods, tools, and techniques that will be valuable to other fields and will support practical applications in publishing, archives, and libraries. It has connections to knowledge representation systems (formal semantics and ontology, objection orientation methodologies, etc.), new theorizing (non-hierarchical views of text, antirealism, etc.), and new applications and tools. ‥ providing new insights into the nature of text, and new techniques for exploiting the emerging information technologies.
(Mylonas and Renear 1999)

Concluding Remarks

Text encoding has proven an unexpectedly exciting and illuminating area of activity, and we have only been able to touch on a few aspects here. Ongoing work is of many kinds and taking place in many venues, from experimental hypermedia systems, to improved techniques for literary analysis, to philosophical debates about textuality. The references below should help the reader explore further.

For Further Information

SGML, XML, and related technologies

The best introduction to XML is the second chapter of the TEI Guidelines, "A Gentle Introduction to XML", online at <http://www.tei-c.org/P4X/SG.html>.

An excellent single-volume presentation of a number of XML-related standards, with useful practical information, is Harold and Means (2001). See also Bradley (2001).

The SGML standard is available as Goldfarb (1991), a detailed commentary that includes a cross-reference annotated version of the specification. A treatment of a number of subtleties is DeRose (1997).

A good detailed presentation of the XML specification is Graham and Quin (1999).

There are two extremely important resources on the Web for anyone interested in any aspect of contemporary SGML/XML text encoding. One is the extraordinary "Cover Pages", http://xml.coverpages.org, edited by Robin Cover and hosted by OASIS, an industry consortium. These provide an unparalleled wealth of information and news about XML and related standards. The other is the W3C site itself, http://www.w3c.org. These will provide an entry to a vast number of other sites, mailing lists, and other resources. Two other websites are also valuable: Cafe con Leche http://www.ibiblio.org/xml/, and XML.com http://www.xml.com. The most advanced work in SGML/ XML markup research is presented at IDEAlliances' annual Extreme Markup Languages conference; the proceedings are available online, at http://www.idealliance.org/papers/extreme03/.

Document analysis, data modeling, DTD design, project design

Unfortunately there is a lack of intermediate and advanced material on the general issues involved in analyzing document types and developing sound DTD designs. There is just one good book-length treatment: Maler and Andaloussi (1996), although there are introductory discussions from various perspectives in various places (e.g., Bradley 2001). Other relevant discussions can be found in various places on the Web (see, for example, the debate over elements vs. attributes at http://xml.coverpages.org/elementsAndAttrs.html) and in various discussion lists and the project documentation and reports of encoding projects. (There are however very good resources for learning the basic techniques of TEI-based humanities text encoding; see below.)

Humanities text encoding and the TEI

For an excellent introduction to all aspects of contemporary humanities text encoding see Electronic Texts in the Humanities: Theory and Practice, by Susan Hockey (Hockey 2001).

The TEI Consortium, http://www.tei-c.org, periodically sponsors or endorses workshops or seminars in text encoding using the Guidelines; these are typically of high quality and a good way to rapidly get up to speed in humanities text encoding. There are also self-paced tutorials and, perhaps most importantly, a collection of project descriptions and encoding documentation from various TEI projects. See: http://www.tei-c.org/Talks/ and http://www.tei-c.org/Tutorials.index.html. Anyone involved in TEI text encoding will want to join the TEI-L discussion list.

Theoretical issues

Text encoding has spawned an enormous number and wide variety of theoretical discussions and debates, ranging from whether the OHCO approach neglects the "materiality" of text, to whether TEI markup is excessively "interpretative"; to whether texts are really "hierarchical", to the political implications of markup, to whether SGML/XML is flawed by the lack of an algebraic "data model." From within the humanities computing community some of the influential researchers on these topics include Dino Buzzetti, Paul Caton, Glaus Huitfeldt, Julia Flanders, Jerome McGann, Michael Sperberg-McQueen, Alois Pichler, and Wendell Piez, among others. A compendium of these topics, with references to the original discussions, may be found at http://www.isrl.uiuc.edu/eprg/markuptheoryreview.html.

One theoretical topic promises to soon have a broad impact beyond as well as within the digital humanities and deserves special mention. There has recently been a renewed concern that SGML/XML markup itself is not a "data model" or a "conceptual model", or does not have a "formal semantics." This is a criticism that on this view SGML/XML markup serializes a data structure, but it does not express the meaning of that data structure (the specification of the conceptual document) in a sufficiently formal way. For an introduction to this topic see Renear et al. (2002) and Robin Cover (1998, 2003). For a related discussion from the perspective of literary theory see Buzzetti (2002). Obviously these issues are closely related to those currently being taken up in the W3C Semantic Web Activity (http://www.w3.org/2001/sw/Activity).

Acknowledgments

The author is deeply indebted to Michael Sperberg-McQueen for comments and corrections. Remaining errors and omissions are the author's alone.

Bibliography

Bradley, N. (2001). The XML Companion. Reading, MA: Addison-Wesley.

Buzzetti, D. (2002). Digital Representation and the Text Model. New Literary History 33: 61–88.

Carlson, G. (1967). Literary Works in Machine-Readable Form. Computers and the Humanities 1: 75–102.

Carmody, S., W Gross, T. H. Nelson, D. Rice, and A. van Dam (1969). A Hypertext Editing System for the 360. In M. Faiman and J. Nievergelt (eds.), Pertinent Concepts in Computer Graphics (pp. 291–330). Champaign: University of Illinois Press.

Chicago (1993). The Chicago Manual of Style: The Essential Guide for Writers, Editors, and Publishers, 14th edn. Chicago: University of Chicago Press.

Coombs, J.-H., A.-H. Renear, and S.-J. DeRose (1987). Markup Systems and the Future of Scholarly Text Processing. Communications of the Association for Computing Machinery 30: 933–47.

Cover, R. (1998). XML and Semantic Transparency. OASIS Cover Pages. At http://www.oasis-open.org/cover/xmlAndSemantics.html.

Cover, R. (2003). Conceptual Modeling and Markup Languages. OASIS Cover Pages. At http://xml.coverpages.org/conceptualModeling.html.

DeRose, S. J. (1997). The SGML FAQ Book: Understanding the Foundation of HTML and XML. Boston: Kluwer.

DeRose, S.-J., and A. van Dam (1999). Document Structure and Markup in the PRESS Hypertext System. Markup Languages: Theory and Practice 1: 7–32.

DeRose, S.-J., D. Durand, E. Mylonas, and A.-H. Renear (1990). What Is Text, Really? Journal of Computing in Higher Education 1: 3–26. Reprinted in the ACM/SIGDOC *Journal of Computer Documentation 21,3: 1–24.

Engelbart, D. C., Watson, R. W. and Norton, J. C. (1973). The Augmented Knowledge. Workshop: AFIPS Conference Proceedings, vol. 42, National Computer Conference, June 4–8, 1973, pp. 9–21.

Furuta, R., J. Scofield, and A. Shaw (1982). Document Formatting Systems: Survey, Concepts, and Issues. ACM Computing Surveys 14: 417–72.

Goldfarb, C.-F. (1978). Document Composition Facility: Generalized Markup Language (GML) Users Guide. IBM General Products Division.

Goldfarb, C.-F. (1981). A Generalized Approach to Document Markup. [Proceedings of the (ACMj (SIGPLAN- SIGOA] Symposium on Text Manipulation (pp. 68–73). New York: ACM.

Goldfarb, C.-F. (1991). The SGML Handbook Guide. Oxford: Oxford University Press.

Goldfarb, C.-F. (1997). SGML: The Reason Why and the First Publishing Hint. Journal of the American Society for Information Science 48.

Graham, I. G. and L. Quin (1999). The XML Specification Guide. New York: John Wiley.

Harold, E. R. and W. S. Means (2001). XML in a Nutshell: A Quick Desktop Reference. Sebastopol, CA: O'Reilly & Associates.

Hockey, S. M. (2001). Electronic Texts in the Humanities: Theory and Practice. Oxford: Oxford University Press.

IBM (1967). Application Description, System/360 Document Processing System. White Plains, NY: IBM.

Ide, N.-M. and C.-M. Sperberg-McQueen (1997). Toward a Unified Docuverse: Standardizing Document Markup and Access without Procrustean Bargains. Proceedings of the 60th Annual Meeting of the American Society for Information Science, ed. C. Schwartz and M. Rorvig (pp. 347–60). Medford, NJ: Learned Information Inc..

ISO (1986). Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML). ISO 8879–1986 (E). Geneva: International Organization for Standardization.

Kay, M. (1965). Report on an Informal Meeting on Standard Formats for Machine-readable Text. In J. B. Bessinger and S. M. Parrish (eds), Literary Data Processing Conference Proceedings (pp. 327–28). White Plains, NY: IBM.

Knuth, D. E. (1979). TEX and Metafont. New Directions in Typesetting. Bedford, MA: Digital Press.

Lamport, L. (1985). LaTeX - A Document Preparation System. Reading, MA: Addison-Wesley.

Lesk, M.-E. (1978). Typing Documents on the UNIX System: Using the -ms Macros with Troft and Nroff. Murray Hill, NJ: Bell Laboratories.

Maler, E. and E. L. Andaloussi (1996). Developing SGML DTDs: From Text to Model to Markup. Englewood Cliffs, NJ: Prentice Hall.

Mylonas, M. and Renear, A. (1999). The Text Encoding Initiative at 10: Not Just an Interchange Format Anymore - but a New Research Community. Computers and the Humanities 33: 1–9.

Ossanna, J.-F. (1976). NROFF/TROFF User's Manual (Tech. Rep.-54). Murray Hill, NJ: Bell Laboratories.

Reid, B.-K. (1978). Scribe Introductory User's Manual. Pittsburgh, PA: Carnegie-Mellon University, Computer Science Department.

Reid, B.-K. (1981). Scribe: A Document Specification Language and its Compiler. PhD thesis. Pittsburgh, PA: Carnegie-Mellon University. Also available as Technical Report CMU-CS-81–100.

Renear, A. H. (2000). The Descriptive/Procedural Distinction is Flawed. Markup Languages: Theory and Practice 2: 411–20.

Renear, A. H., E. Mylonas, and D.-G. Durand (1996). Refining Our Notion of What Text Really Is: The Problem of Overlapping Hierarchies. In S. Hockey and N. Ide (eds.), Research in Humanities Computing 4: Selected Papers from the ALLC/ACH Conference, Christ Church, Oxford, April, 1992 (pp. 263–80).

Renear, A. H., D. Dubin, C. M. Sperberg-McQueen, and C. Huitfeldt (2002). Towards a Semantics for XML Markup. In R. Furuta, J. I. Maletic, and E. Munson (eds.), Proceedings of the 2002 ACM Symposium on Document Engineering (pp. 119–26). McLean, VA, November. New York: Association for Computing Machinery.

Seybold, J. W. (1977). Fundamentals of Modern Composition. Media, PA: Seybold Publications.

SGML Users' Group (1990). A Brief History of the Development of SGML. At http://www.sgmlsource.com/history/sgmlhist.htm.

Sperberg-McQueen, C.-M. (1991). Text Encoding and Enrichment. In Ian Lancashire (ed.), The Humanities Computing Yearbook 1989–90. Oxford: Oxford University Press.

Sperberg-McQueen, M. and L. Burnard, (eds.) (1994). Guidelines for Text Encoding and Interchange (TEI P3). Chicago, Oxford: ACH/ALLC/ACL Text Encoding Initiative.

Sperberg-McQueen, M. and L. Burnard, (eds.) (2002). Guidelines for Text Encoding and Interchange (TEI P4). Oxford, Providence, Charlottesville, Bergen: ACH/ALLC/ACL Text Encoding Initiative.

Spring, M. B. (1989). The Origin and Use of Copymarks in Electronic Publishing. Journal of Documentation 45: 110–23.