|
Publishing, Technology, and the Future of the Academy

metadata

1 Leave a comment on paragraph 1 0 One thing that’s especially important about both TEI and the X-Lit project is that, through their markup, they aim to preserve not just the content of the texts they encode, but sufficient information about that content, such that the experience of using those texts might be recreated in the future. This metadata might include, in the case of TEI, information about the authorship, the publication history, the provenance, the structure, or the format of the text being encoded; in the case of X-Lit, it might include information about the hardware and software environment within which the text was composed and that it requires to run. In each case, it might also include appropriate bibliographic information that allows a text to be appropriately cataloguable, searchable, and citeable by future scholars. Given the proliferation of digital texts, it’s become increasingly clear that we need much more robust and extensible metadata than we have ever had before; not only do we need better ways of organizing and finding materials today, but we need to allow for the different means of storage and retrieval that will no doubt develop in the future. As Christine Borgman has argued, access is not simply a matter of a document being available, but it rather crucially “depends on the ability to discover and retrieve documents of interest, and then follow a trail through the scholarly record” (88). That trail is built of metadata.

2 Leave a comment on paragraph 2 2 As the previous paragraph suggests, there are many different kinds of metadata, some of which provide information about a text’s production context, some of which provide information about the particular form in which a text appears, and some of which provide information about what has been done with the text since its production. Metadata can thus provide a map of sorts to a large set of data, enabling a user to find patterns that make sense of the data, or to find her way to the particular pieces of data she needs. In this sense, while much of what goes into a document’s metadata is objectively verifiable information, the production of the set of metadata, as the production of any map, is always an interpretive act, indicating what the map-maker has found to be significant about the terrain.[4.15] One of the problems that metadata poses for the future of digital publishing lies precisely in the difficulty of making maps of future terrain; we never have enough information at present about what will be important in the future, and this truism is particularly applicable to technological developments. We need therefore to develop structures for organizing information, and metadata to describe those structures, that will remain flexible and extensible into the future.

3 Leave a comment on paragraph 3 0 In thinking through the issues surrounding our uses of metadata in digital publishing, I’m mostly concerned with the sorts of citational metadata used by scholars in order to record, maintain, and communicate findable references to the texts they use. This form of metadata falls under the category of the bibliographic, including information about the document itself, about its production and about where the document is stored, such that searching a digital database will produce results about the document, as well as links or other information that allow the document to be retrieved. One might think that such organizational systems have been made unnecessary by the development of the search engine — now that we can search our documents for whatever information we like, why would we need to impose such systems upon them? The first reason is that all search engines rely on metadata in some form; full-text searching of the vast quantity of information now available to us is unwieldy at best, and thus most search engines rely upon the existence of information about the information they’re searching. The question is rather what metadata search engines are using. This returns us to a point that I made in discussing the issues surrounding filtering systems in chapter 1: any such filtering system is only as good as its algorithm, and we know surprisingly little about the algorithms used by most search engines. And what we do know doesn’t exactly inspire confidence. Between the mid-1990s and the mid-2000s, as Christine Borgman has pointed out, most search engines tended to ignore user-created metadata such as keywords embedded in HTML-encoded webpages, “despite the massive investments of libraries and publishers in describing the contents and subject matter of scholarly books and journals” (90), because such metadata, in the early days of the web, was subject to extreme abuse. Website producers often loaded the <meta keywords> tags of their HTML headers with redundant and misleading keyword information in order to drive search engines to return links to their pages regardless of the search’s actual object — the metadata version of spam, which often loaded search results pages with links to porn sites — which led, by about 1997, to the tag being almost entirely deprecated.[4.16] With the advent of more trusted systems, such as the Dublin Core Metadata Initiative, which provides community-derived standards for metadata terms, and the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which allows data providers to make their metadata available to various web services, publishers are increasingly able to provide search engines with metadata worth relying upon.[4.17]

4 Leave a comment on paragraph 4 1 Again, though, what metadata search engines actually rely upon remains an open question. Google, for instance, depends more heavily on the ways other pages link to a text in determining its search results than it does on the actual content of that text. Most famous is its “PageRank” system, which analyzes links to particular web pages as a means of determining the “importance” of any given page on the web; the more inbound links to a particular page, the higher its PageRank, and the more inbound links to the pages that link to that particular page, the greater weight given to their links in determining the importance of the original page. Links, in other words, are treated as votes, and some votes carry greater weight than others. The result is that Google’s algorithm is heavily determined by popularity, and given the mushiness of popularity — not to mention its potential for manipulation, as can be seen in the rash of “Googlebombing” that swept the internet during the early 2000s — as an arbiter of relevance, particularly within scholarly work, we might do well to be cautious about overreliance on the search engine as our primary means of finding the texts we need.[4.18] This is true even when the subset of what’s being searched is specifically scholarly material; Google Scholar remains a problematic research resource both because of the uncertainty surrounding the sources that it indexes — Google does not publish a list of the journals or databases that Google Scholar crawls, though its coverage is undeniably skewed toward the hard sciences — and because it similarly uses citation analysis as one means of determining relevance. In other words, both Google and Google Scholar are already relying upon metadata in producing their search results; it’s just not the kind of metadata that we might be most interested in, or that might produce the best results.

5 Leave a comment on paragraph 5 0 As the archives of our scholarship are increasingly stored in digital formats, and as those archives are increasingly accessed through search engines that interact with the metadata we use to describe the texts they contain, it becomes much more important for us to develop trustworthy metadata that enables us to classify our digital texts reliably, giving us confidence that the right texts, and not just the most popular texts, will surface when we search for them. These modes of classification may not bear much in common with the hierarchical, ontological systems long in use, however. As Clay Shirky has argued, traditional ontologies such as library classification systems work best when the corpus they describe is limited and the producers and users of the ontology are a coordinated group of experts; we can trust that new books entered into a library’s cataloging system will be correctly classified because of the finite nature of the data the system organizes and the expertise of those doing the organizing. Such ontologies, Shirky argues, work much less well when the corpus is large, unstable, or blurrily defined, and when the users are a dispersed group of amateurs. The latter situation defines much of the work produced on the internet, which is increasingly user-generated and -published, and it will increasingly come to define our scholarly publishing systems, as our digital networks decentralize them, moving them outside traditional institutional and disciplinary frameworks. Because we cannot define in advance the ways that users will use or want to access the texts we produce — because we can neither know the future nor account for the multiplicity of user perspectives — we need to supplement our expert-produced ontologies with user-generated tagging.

6 Leave a comment on paragraph 6 6 I say “supplement” rather than “replace,” because, contra Shirky, certain kinds of expert knowledge will of necessity continue to govern the systems through which scholarly knowledge is organized. Some of the metadata we need to describe our texts, after all, can be objectively determined — author name, title of text, publisher, date — and some of it is less so. Certain kinds of expert classifications or subject headings will no doubt still be useful to us, even though the “keywords” that apply to a text might differ from user to user, as readers differ in their senses of a text’s important aspects. We can and should thus authoritatively produce certain kinds of metadata, but other kinds cannot be so centralized. For this reason our metadata needs to be not simply extensible but also customizable, drawing upon the best of expert production and what is in current web parlance referred to as “crowdsourced” information, so that we can account for the ways that users actually interact with texts in thinking about the classification systems of the future.[4.19] As an example, we might look at the ways that many online library catalogs are beginning to employ not just traditional modes of classification such as Library of Congress subject headings, but also some form of user tagging. My own institution’s library catalog is linked to LibraryThing, and thus draws in the tags that actual readers of a given text have categorized it within their own virtual libraries. The current implementation of this link allows users of my library’s online catalog to browse the catalog by clicking on a user tag and finding out which texts users have applied that tag do; as of this writing, however, this tag browser does not allow users to add tags to the library’s catalog, nor does it associate tags with users. These two bits of functionality would result in a far more effective crowdsourced system of metadata generation, by enabling scholars to apply tags to texts, to use those tags in the process of filtering their search results, and to see how other scholars with whom they work have likewise tagged texts.

7 Leave a comment on paragraph 7 3 For instance, Zotero, an open-source extension for the Firefox web browser, produced by the Center for History and New Media at George Mason University, allows users to “collect, manage, and cite” their research sources, as its home page indicates. Beyond this, however, Zotero takes advantage of the social aspects of network-based research, allowing users to create profiles on the site, to synchronize their libraries between their local machines and the website, to share their libraries with other users and follow their libraries in return, to join groups of scholars working on similar issues, to create collective libraries within those groups, and so on. In this fashion, Zotero users are not only able to maintain detailed metadata for their own research sources, enabling them to quickly produce bibliographies and other citation information within their writing, but they’re also able to see what other scholars are reading. Future plans for the tool include making the service more commons-oriented, which will allow users to “identify others who are working with or annotating the same content, fostering new collaboration opportunities,” as well as the development of a recommendation engine that will suggest new texts based on those the user already has in her library. (Zotero Development Roadmap). Through tools such as this one, scholars will be able to help in producing and maintaining the kinds of citation-oriented metadata required in order to find important digital resources.

  • Thanks to Barbara Hui and George Williams for this observation, which they shared with me via Twitter on 22 July 2009. That I don’t have an appropriate framework for citing their contributions represents a failure of metadata that’s much to the point; Twitter appears to be ephemeral, and so doesn’t provide means of preservation via persistent archiving or linking, or means of citation. (Which is to say that I could include URLs for the individual posts involved, but those URLs will cease to work in very short order.) This instability becomes a problem as the service trends away from ephemeral status updates and toward the more substantive conversations that are taking place within it, which suggests the ways that metadata requirements change over time. See below.
  • See Cory Doctorow, “Metacrap,” for a discussion dating from 2001 of the reasons metadata usage often breaks down online, including that “People lie.”
  • See Dublin Core Metadata Initiative and “Open Archives Initiative Protocol for Metadata Harvesting.”
  • For more on the ways Google works and some of the problems it poses for the organization of knowledge, see Grimmelmann, “The Google Dilemma.”
  • Thanks to Amanda French for this observation which, as with footnote 17 above, was provided via Twitter on 22 July 2009.
  • Page 37

    Source: http://mcpress.media-commons.org/plannedobsolescence/four-preservation/metadata/