One of the first issues we must encounter in thinking about the durability of digital texts is the format in which those texts are produced and encoded. The format selected can enable long-term access to the text by its adherence to a commonly agreed-upon set of standards for the production of that text, or it can create difficulties in preservation through its use of a non-standard protocol. This is not to say that all texts must or even should conform to the same structures and formats; any text of course contains its own peculiarities, and the possibilities presented by digital publishing only expand the range of potential forms and formats. But certain kinds of standardization are helpful for ensuring that a text is at least commonly readable across as many platforms as possible, and for as long as possible.
We employ standards in this way across our lives, where they often appear wholly naturalized but in fact represent the imposition of certain kinds of socially determined regulations that provide us with a stable and reliable experience of the phenomenon in question; the electrical system that provides power to our homes and offices, for instance, does so through a set of standards for voltages and interfaces, and nearly anyone who has traveled abroad can testify to the problems that using an appliance that does not conform to the local standards can produce. Even time itself had to be standardized; the development of phenomena such as time zones didn’t take place until the spread of the railroads demanded a commonly accepted schedule. Textual standards exist for many of the same reasons, making nearly any given newspaper, journal, or book we pick up, from any publisher in any city, instantly comprehensible to us (at least in format, if not in the particulars of its content). But the phenomena that operate all but invisibly to make the pages of a book readable to us today, including spacing between words, punctuation, regularized spelling, paragraphing, page numbers and headers, tables of contents, and so forth, took centuries to develop. Digital texts, by contrast, proliferated quickly enough, and their producers were concerned enough about sharing them, that the problem of standards arose quite early in their lifespan.
Certain kinds of standards have long been available in web publishing; standards for HTML, or HyperText Markup Language, for instance, are developed by the World Wide Web Consortium, or W3C, for instance, which under the direction of World Wide Web inventor Tim Berners-Lee, issues protocols and guidelines designed to ensure robust web interoperability. Such “vendor-neutral” interoperability ensures, among other things, that web pages are not only interpretable by any major browser, but will remain so into the future.[4.4] As Nick Montfort and Noah Wardrip-Fruin advise authors of electronic literature, “[v]alidating a page or site, using a service like the W3C Validator or the validator built into BBEdit, ensures that all browsers that comply with World Wide Web Consortium standards, now and in the future, will deal with the page correctly” (Montfort and Wardrip-Fruin). This is not to say that HTML hasn’t changed over time, or that browsers are somehow required to conform to the W3C’s recommendations, but the web’s general stability is the product of a voluntary cooperation among of very wide range of W3C member organizations, among whom number a range of hardware and software manufacturers who recognize the value of ensuring that their products are in compliance with what the broader industry considers its “best practices,” so that they might be adopted by as wide a range of users as possible.
One of the ways that the standardization of HTML works is through a separation between issues that relate to a web document’s structure and issues that relate to that document’s design. This separation is in part a legacy of HTML’s parent language, SGML, or Standard Generalized Markup Language. SGML relies on an interpreter-agnostic set of tags that describe the structural characteristics of a text and its component parts, ignoring entirely the way that any given browser or system will present those tags. HTML thus inherits from SGML tags such as
<head> to designate a document’s header information, to designate the main content of a document,
<h1> to designate a top-level heading,
<p> to designate paragraphs, and so forth. None of these tags specify anything about the actual presentation of the data they contain on the computer screen;
<h1> demarcates a heading, but says nothing about the font or size of that heading. Similarly, to emphasize text within a paragraph, HTML provides the
<em> tag, which generally renders as (but does not specify) italics.[4.5] Also inherited from SGML is the fact that most such tags come in pairs, indicating both the beginning and the end of the data they contain;
<h1>Introduction</h1> thus produces a level-one heading that reads “Introduction,” formatted in whatever way the browser’s defaults indicate, and everything after that will belong to some other part of the document’s structure.
As HTML began its development, in the very early 1990s, the only existing web browsers were entirely text-based, and thus limiting HTML to controlling document structure rather than presentation made sense. With the introduction of the first web browser capable of displaying inline images, Mosaic, in 1993, things became much more complex; suddenly browsers were able to manage a much wider and more idiosyncratic range of tags and to interpret them much more loosely, resulting in web pages that would look vastly different, or potentially even be uninterpretable, on different browsers. (This period led to the introduction of the <blink> tag and other such web design abominations.) In order to reign in the chaos, in mid-1994, Dan Connolly produced a draft specification for what would come to be HTML 2, circulating it within “the Internet community” for discussion, incorporating much of the feedback that he received, and finally producing a Document Type Definition for HTML 2.[4.6] Later that year, the W3C was founded, in order to provide for the continued community-based management of HTML and its specifications for the broadest possible interoperability.
HTML, however, is a document type specifically meant for use in creating hypertext, and as such does not provide for the full range of kinds of documents a scholar or publisher might want to create.[4.7] HTML’s parent language, SGML, has roots in generic coding techniques for document processing developed in the late 1960s, though SGML as a formal specification wasn’t officially recognized by the International Standards Organization until 1986. SGML was developed in order to standardize the markup through which document processing took place, allowing documents to be shared across platforms and ensuring that the markup of digital documents would contain “not only formatting codes interpreted by computer itself, but also descriptive human-legible information about the nature and role of every element in a document” (Darnell). This human legibility, a product of the fact that SGML documents are produced in plain text, is particularly important for ensuring that documents remain accessible, as such plain-text formats “can be edited, read, and inspected on many platforms. This accessibility remains even if the program that created it, or the program that was meant to interpret it, is no longer available (or exists in a radically different and incompatible version)” (Montfort and Wardrip-Fruin). But this accessibility is also produced through the careful use of a set of tags specified in a Document Type Definition, or DTD, which is a schema that lays out the syntax of a particular class of document; HTML is thus not an independent language, but rather a DTD, or an application of SGML, which specifies the codes that may be used to markup hypertext web documents.[4.8] SGML, and its more recent and now far more widespread descendant XML (or eXtensible Markup Language), are thus metalanguages, or languages that provide the specifications for the creation of more particular languages, including HTML.[4.9] What makes XML so significant is precisely its extensibility; as a metalanguage, it allows users to create whatever tags or entities their particular applications require, as long as those tags are defined in the application’s schema or DTD. A range of validators for such applications are readily available, both online and in desktop clients, such that coders can ensure that the documents they produce conform to the schema they are employing, thus ensuring that their texts adhere to the standards that any interpreter of that document will employ.
For this reason, among others, it’s important for the longevity of web-based projects to use software that adheres to open standards rather than proprietary ones.[4.11] Open standards, such as those supported by the W3C, should of course not be confused with open source software, which is a means of software distribution that allows users certain kinds of access to and interactions with its source code. One author has compared open standards with the interoperability of the telephone jack: whoever your carrier, and whoever the manufacturer of your handset, plugging one into the other will always produce the same results.[4.12] But the phone system is of course not open source; if it were, users would be able to access, tinker with, and redistribute the system’s underlying architecture, which might produce some interesting results! Nonetheless, open standards and open source software bear some important relations to one another, not least that both are, to some extent, community-supported; both draw upon development communities committed to their sustainability. For this reason, the data structures of an open-source system such as WordPress are likely to remain supported or at least migratable well into the future, where the closed data structures of proprietary systems such as Blackboard may not.[4.13] Perhaps more important than the openness of its source code, however, is the use of open standards in a system such as WordPress, which can produce XML-based “feeds” of the data it manages, feeds which are then broadly reusable and interoperable with a range of web-based systems. In this sense, the openness of open standards is arguably deeper than that of open source software, as it allows for robust data portability.
Even the most open publishing systems require clear standards, however, as the chaos of late 1990s HTML might suggest, and the question of who will be responsible for setting those standards, and how those standards will achieve community buy-in, remains. In order to explore how such standards come into being, and how they might come to be commonly accepted within digital scholarship, I want to turn to the Text Encoding Initiative, or TEI. Work on TEI began in 1987, with a meeting at Vassar College; prior to this time a number of separate text digitization and encoding projects were underway at several different institutions, and the scholars involved were looking for ways to manage “the proliferation of systems for representing textual material on computers” (Mylonas and Renear 3). As Lou Burnard, one of TEI’s editors, framed their concerns, “Scholarship has always thrived on serendipity and the ability to protect and pass on our intellectual heritage for re-evaluation in a new context; many at that time suspected (and events have not yet proved them wrong) that longevity and re-usability were not high on the priority lists of software vendors and electronic publishers” (Burnard). A group of 32 scholars thus came together to explore the development of a set of standards to support the exchange and interoperability of the texts they produced. The meeting resulted in what have come to be called the “Poughkeepsie Principles,” a document that would steer the development of guidelines for future text encoding. These principles include a commitment to creating “a standard format for data interchange in humanities research,” to drawing up recommendations for syntax and usage within the format, to the production of a metalanguage for describing text encoding schemas, and to the creation of “sets of coding conventions suited for various applications” (The Preparation of Text Encoding Guidelines). The production of these guidelines was to be undertaken by three sponsoring organizations, the Association for Computers in the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics, which together appointed a steering committee for the project, to be led by two editors and contributed to by several working groups focused on specific issues. The first draft of the TEI guidelines (labeled “P1”) was released in June 1990; following an extensive process of revision, the first official version of the guidelines (“P3”) was released in May 1994. In all, well over 100 scholars participated in the production of the TEI Guidelines during the first ten years of the project, which marks TEI as “an exemplary achievement in collaboration, one on a scale fairly rare in the history of the humanities” (Mylonas and Renear 4).
Such a large-scale enterprise required careful and committed management, however, particularly in order to survive beyond its early stages. In 1999, two of the principal institutions involved in the TEI project, the University of Virginia and the University of Bergen in Norway, submitted a proposal to the TEI executive committee for the formation of a membership-oriented parent body, which became incorporated in late 2000 as the TEI Consortium. The goal of the Consortium was two-fold: first, “to maintain a permanent home for the TEI as a democratically constituted, academically and economically independent, self-sustaining, non-profit organization,” and second, “to foster a broad-based user community with sustained involvement in the future development and widespread use of the TEI Guidelines” (TEI: History). Since the founding of the Consortium, TEI has undergone some significant transformations. The first drafts of the guidelines were SGML-based; beginning with P4, TEI was entirely revised to be fully XML-compliant. The guidelines have since been further revised to version P5, and the consortium has also produced some TEI customizations, including TEI Lite, a streamlined version of the tagset that is sufficient to support the vast majority of users, and a number of TEI-oriented tools, including Roma, which allows users to create customized validators for their particular applications. TEI is widely in use throughout digital humanities publishing projects (see TEI: Projects Using the TEI for an extensive listing) and has generally become accepted as a community-driven standard for text encoding, included as part of the “best practices” embraced by groups including the MLA’s Committee on Scholarly Editions and the National Endowment for the Humanities. Even more, “techniques pioneered by the Text Encoding Initiative have been taken up into wider development of technical and engineering standards supporting networked communication” (Mylonas and Renear); in fact, methods used by the TEI were incorporated into the development of XML itself.[4.14]
Descriptive rather than procedural, demarcating logical structure rather than visual presentation, and thus both hardware and software independent, TEI’s “lasting achievement,” as Lou Burnard has pointed out, is “not in its DTD, but in the creation of the intellectual model underlying it, which can continue to inform scholarship as technology changes” (Burnard). That intellectual model, the fundamental understanding of markup as a descriptive act focused on the logical structure of a document rather than its physical appearance, allows TEI to be customized to nearly any use, and allows the texts marked up with TEI to be repurposed in numerous ways, not only for digital and print republication, but also for intensive text-mining and analysis. The current TEI Guidelines fill an over 1300-page manual, containing an “exhaustive tag library” (Lazinger 150) and complete specifications for syntax, but, at least according to one scholar, “the apparent complexity of the TEI is, in a sense, only apparent. For the most part only as much of the TEI as is needed will be used in any particular encoding effort; the TEI vocabulary used will be exactly as complex, but no more complex, than the text being encoded” (Renear 234). This is made possible by TEI’s reliance on the DTD-model; every TEI project must begin with the construction of a TEI schema that details the tags and usages available within the project. Every document in the project then becomes an instance of that document type, which it declares in a comment that precedes the text; this declaration provides for the document’s proper validation. The text itself then begins with a header that serves to “describ[e] an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented,” thus serving as “an electronic analogue to the title page attached to a printed work” (Sperberg-McQueen et al 17), providing both metadata and instructions for the document’s use. Because this header information, as well as the rest of the marked-up document, is both human- and machine-readable, and because it is platform-agnostic, capable of being parsed by any number of browsers and other applications, TEI promises a great deal of longevity for the projects encoded with it.
That having been said, TEI is not and cannot be a singular solution to all of the preservation issues that will present themselves as digital scholarly publishing moves forward. One of its primary shortcomings has precisely to do with its grounding in text encoding: as the X-Lit Initiative of the Electronic Literature Organization points out, TEI is focused on the digitization of previously printed texts, or the digital formatting of otherwise print-like texts:
Many technical solutions are being developed by humanities computing scholars and information-science researchers to ensure that digital media will have a longer “shelf life.” However, as the shelf metaphor might indicate, these solutions (for example, the Text Encoding Initiative’s TEI schema or the library METS metadata standard) are often currently better suited for print, or print-like, static works that have been digitized than for born-digital artifacts of electronic literature with dynamic, interactive, or networked behaviors and other experimental features. (Liu et al)
Genuinely “born digital” texts, texts that take robust advantage of the multimodal potential of the network, will require other solutions. The TEI may point the way, however, in its reliance on the common, portable standards of XML; new projects like the TEI may need to be developed in order to deal with changing publishing circumstances, but the flexibility of XML and its related languages might provide the basis for such new formats. For instance, the Electronic Literature Organization’s Preservation/Archiving/Dissemination project-in-development, X-Lit, which “involves developing a rich representation for electronic literature” regardless of the original format of that literature, will be an application of the XML standard, allowing “the representation of media elements (including text, graphics, sound, and video) as well as a description of the interactive and computational workings of an e-lit piece. The standard will also provide a way to document the physical setup and material aspects of an e-lit work,” thus ensuring that such texts “will be human-readable and machine-playable long into the future” (Montfort and Wardrip-Fruin). Like the TEI, perhaps the most significant aspect of X-Lit’s potential is its community-driven basis: first, its grounding in the work of a professional organization with a common if complex set of concerns for the preservation of digital work, and second, its adherence to an open standard, one that will no doubt change in the future but that has a broad enough user base to ensure reverse compatibility for any such changes. As more publishers and publishing centers produce growing numbers and kinds of digital texts across the academy, such issues of community support for the standards they employ will become increasingly important for securing the future of those texts.
<i>tag, which does specify italics, points to the fact that the separation between structure and presentation became increasingly difficult to manage in the early days of HTML, resulting in the development, in 1996, of Cascading Style Sheets, or CSS, which allows web designers to specify how particular HTML tags should look when rendered in a browser.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">, which indicates the specific DTD to which the page claims adherence.