[4.1] See Nicholson Baker, Double Fold. As Baker addresses, and as I’ll go on to discuss later in this section, the primary way in which the assumed permanence of print is being challenged today is through the deaccessioning practices of libraries.
[4.2] See, however, Terry Harpold on the shortcomings of emulators, as well as the difficulties faced in their production: “Writing software that duplicates the myriad interactions of hardware and software is an exceedingly difficult task, and emulators are often buggy and incomplete in their support of the systems they reproduce. Many are hobbyist projects created by enthusiasts of programs designed for an obsolete system, most often, games; they may be less interested in reproducing the complete behavior of the OS than in supporting those features needed by their favorite programs. Emulation projects usually lack the support of – or are actively opposed by — the publishers of emulated systems, who wish to maintain control over their intellectual property even when it is no longer in use” (5).
[4.3] Though the scandal over Amazon’s removal of legally-purchased copies of two of George Orwell’s novels from users’ Kindles has only recently brought the issue to widespread attention, Clifford Lynch raised questions about this very concern with respect to e-books back in 2001, asking libraries to consider whether their purchases result in ownership of “objects or access” (Lynch). This question is even more pressing in the area of digital journals, particularly considering the bundling practices and astronomically inflated subscription costs of many commercial journal publishers. In the era of print journals, when a library cancelled a subscription (or when a journal ceased publishing), the library maintained ownership of the issues released during the subscription period. Whether that will continue to be true in the digital era — whether, for instance, libraries have the right to create backup archives of digital journals to which they subscribe — remains an issue still being negotiated. I’ll return to this question later in the chapter.
[4.4] See “About W3C.” It’s of course worth noting that the W3C’s management of HTML and the standards that it focuses on are far from uncontroversial; see Baron.
[4.5] That HTML also provides the
<i> tag, which does specify italics, points to the fact that the separation between structure and presentation became increasingly difficult to manage in the early days of HTML, resulting in the development, in 1996, of Cascading Style Sheets, or CSS, which allows web designers to specify how particular HTML tags should look when rendered in a browser.
[4.6] See Dave Raggett’s brief history of HTML. That there could conceivably be a thing referred to as an “Internet community” only indicates how early in the Internet’s spread these developments took place; 1994 seems recent in many ways, but in Internet time, it’s positively paleolithic.
[4.7] Problems with HTML as a coding language include, as Steven DeRose notes, a fixed, non-customizable tagset that prevents users from creating many of the kinds of documents they need; also, despite being theoretically focused on structure, as a descendant of SGML, HTML was in its first decade subject to a kind of format-creep, becoming treated as more akin to word-processing software than true document markup. Worst, perhaps, is that despite the interventions of the W3C in its attempts to establish valid HTML markup, most browsers will attempt to interpret any code a document contains, meaning that “[i]n effect, there is almost no erroneous HTML,” and therefore no impetus for users to conform to the standards meant to provide document longevity (DeRose 12-13).
[4.8] Thus, before the header of most HTML pages, you will find a tag something like
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-, which indicates the specific DTD to which the page claims adherence.
[4.9] XML is often referred to as a subset of SGML, developed in order to streamline and simplify the unwieldiness of SGML’s specification.
[4.10] Bob Sutor draws an important distinction between de facto standards and community standards; Microsoft Word’s “doc” filetype is an example of the former, and the struggles of many users to find alternate means of working with such filetypes is evidence of the ways one standard’s lock on a particular market might not reflect the best interests or practices of a community.
[4.11] Of course, not all electronic texts are produced for the web; the discussion in this chapter is admittedly limited in that regard, but as the example of StorySpace-created hypertexts might indicate, the basic issues with respect to the openness of standards are nonetheless applicable to non-web texts as well.
[4.12] See Sutor. It’s worth noting, of course, that this set of standards was only forcibly opened as a result of the breakup of the AT&T monopoly, which likewise opened the telephone lines to the transmission of non-voice data.
[4.13] Ironically, perhaps, in June 2009 Blackboard issued a promise to its customers to adhere more closely to open standards; see Young, “Blackboard.”
[4.14] See Bosak, however, on the up- and down-side of such acceptance: “this group, which has spent the last ten years urging a certain way of presenting information, suddenly is about to find itself completely successful in a movement driven by people who haven’t the faintest idea that you exist. You are in fact driving a revolution that doesn’t know that you’re here” (Bosak 199).
[4.15] Thanks to Barbara Hui and George Williams for this observation, which they shared with me via Twitter on 22 July 2009. That I don’t have an appropriate framework for citing their contributions represents a failure of metadata that’s much to the point; Twitter appears to be ephemeral, and so doesn’t provide means of preservation via persistent archiving or linking, or means of citation. (Which is to say that I could include URLs for the individual posts involved, but those URLs will cease to work in very short order.) This instability becomes a problem as the service trends away from ephemeral status updates and toward the more substantive conversations that are taking place within it, which suggests the ways that metadata requirements change over time. See below.
[4.16] See Cory Doctorow, “Metacrap,” for a discussion dating from 2001 of the reasons metadata usage often breaks down online, including that “People lie.”
[4.17] See Dublin Core Metadata Initiative and “Open Archives Initiative Protocol for Metadata Harvesting.”
[4.18] For more on the ways Google works and some of the problems it poses for the organization of knowledge, see Grimmelmann, “The Google Dilemma.”
[4.19] Thanks to Amanda French for this observation which, as with footnote 17 above, was provided via Twitter on 22 July 2009.
[4.20] It’s shocking to remember that, not so very long ago, our library cataloging systems didn’t provide us with this crucial bit of information. Not knowing whether a text is actually available in my library before I walk over there is unthinkable to me today, suggesting the extent to which the kinds of information we consider crucial in our metadata changes over time.
[4.21] See McCown et al. See also Koehler for a longitudinal study that suggests both that link degradation stabilizes after an initial, precipitous drop, and that links to different kinds of web objects degrade at different rates.
[4.22] See the seventh edition of the MLA Handbook: “Inclusion of URLs has proved to have limited value, however, for they often change, can be specific to a subscriber or a session of use, and can be so long and complex that typing them into a browser is cumbersome and prone to transcription errors. Readers are now more likely to find resources on the Web by searching for titles and authors’ names than by typing URLs” (182). Note, of course, that the assumption is that a reader wanting to find a cited resource would need to transcribe that URL rather than simply clicking on a link; the default assumption in this handbook is still that the citation itself will appear in print.
[4.23] Other forms of identifying digital objects by name rather than location exist, including URNs (or Uniform Resource Names); URLs and URNs are both subsets of the larger category of URIs, or Uniform Resource Identifiers. Technically, the W3C has deprecated the term URL in favor of URI, but popularly, the location- based term remains the norm, as it is location through which web browsers address the object.
[4.24] See Handle System, “Quick Facts.”
[4.25] It should be noted that the International DOI Foundation has announced its plans to move toward an economic model based on fees paid by registration agencies, who may in turn charge publishers wishing to register DOIs. See The DOI Handbook 78.
[4.26] See Rosenblatt, “The Digital Object Identifier.”
[4.27] See CrossRef.org, “Fast Facts.”
[4.28] That said, the most common reason most people need backups does not originate with hard disk failure but rather with human intervention: the accidental deletion of the wrong file, the theft of a laptop, or whathaveyou.
[4.29] The continued viability of service providers also presents a potential crisis for the locator issue discussed in the last section; a range of URL-shortening services have come into vogue in recent days, and the failure of one such service, tr.im, at least temporarily meant that links using such shortened URLs would not resolve.
[4.30] See, for instance, Manoff: “Access and preservation, two key historical functions of academic and research libraries, are more difficult to reconcile in a digital environment” (2).
[4.31] See Thibodeau: “In addition to identifying and retrieving the digital components, it is necessary to process them correctly. To access any digital document, stored bit sequences must be interpreted as logical objects and presented as conceptual objects. So digital preservation is not a simple process of preserving physical objects but one of preserving the ability to reproduce the objects. The process of digital preservation, then, is inseparable from accessing the object. You cannot prove that you have preserved the object until you have re-created it in some form that is appropriate for human use or for computer system applications” (Thibodeau). See also Don Waters: “User access in some form is needed in any case for an archive to certify that its content is viable” (“Good Archives” 87).
[4.32] Questions have been raised, for the obvious reasons, about the sustainability of a system that does not require participation in order to receive its benefits (see, for instance, Morrow et al 17). CLOCKSS, however, believes that it will be able to reduce fees at the end of five years, once an endowment has been raised (see “CLOCKSS FAQ”).
[4.33] The JISC report mentioned in the following system describes the benefits and drawbacks of each of these philosophies as follows: “The advantages of source file preservation [as used by Portico] is that it is very complete (and likely to include more content than appears in the journal); is received directly from the publisher and is frequently delivered or converted to a few normalized formats facilitating long-term preservation. The disadvantages are that it requires a large upfront investment; there is no assurance that the archive will actually be needed; and the presentation will almost certainly differ from that of the publisher. The advantages of harvesting presentation files (rendition archiving)[the LOCKSS approach] are that it is possible to retain the look and feel of the publication and initial costs are likely to be lower. The disadvantages of this technique are that it may be more difficult to preserve the content over time (for example, a strategy for the large scale migration of presentation files from one format to another is still untested)” (Morrow et al 9).
[4.34] See Morrow et al 16-18.
[4.35] Portico is moving toward the preservation of e-book holdings, with hundreds of titles (primarily published by Elsevier and Walter de Gruyter) listed as “queued” on their website.
[4.36] See BRTF 8, 20.