¶ 1 Leave a comment on paragraph 1 0  This paper is the result of an ongoing collaboration: its intellectual content is shared between the authors, which has led us to alternate first author status in all of our publications. Witmore designed the experiments and generated the data, providing the initial readings of the results on his blog, http://www.winedarksea.org. Hope drafted the paper, provided extended readings of passages, and introduced the linguistic context of its discussions. Both authors are responsible for the conclusion. We want to thank Kate Fedewa for her assistance in preparing the image file for the large diagram appearing at the end of this essay.
¶ 3 Leave a comment on paragraph 3 0  All reference to the works of Shakespeare are taken from Open Source Shakespeare, http://www.opensourceshakespeare.org, which has established line numbers for widely available electronic Moby Shakespeare, itself based on the 1864 Globe Edition of the plays, edited by Clark and Wright. We use the electronic text files of the Moby Shakespeare, with certain editorial preparations (removing speech prefixes, act and scene labels, stage directions), for the iterative analyses described throughout the paper. References are to act and scene, followed by a slash, then through-line number.
¶ 4 Leave a comment on paragraph 4 0  For extended considerations of Shakespeare’s letters, see Lynne Magnusson, Shakespeare and Social Dialogue: Dramatic Language and Elizabethan Letters (Cambridge: Cambridge University Press, 1999), and Alan Stewart, Shakespeare’s Letters (Oxford: Oxford University Press, 2008).
¶ 5 Leave a comment on paragraph 5 0  We use the terms “apparently” and “seemingly” here because it is possible to argue that, even in Shakespeare’s other plays, Falstaff never fully succeeds in his linguistic reality-building. His linguistic fantasies are generally understood as such by at least one character (Hal throughout the Henry IV plays, for example) even as they are apparently successful. It is not so much that he effects a change in reality with his words as that the plays connive at, or patronize, his fictions—at several emotionally searing points his fantasies are laid open, and Falstaff is revealed as the one desperately trying to maintain them— Hal’s coronation procession, his account of Gad’s Hill, his lies about the killing of Hotspur. Falstaff does not convince significant characters with his rhetoric: rather, the lies are so apparent, or irrelevant, that they humor him. To this extent, Merry Wives, with Falstaff out-thought by Page and Ford, and ultimately bested by the whole of Winsdor society, replays, more explicitly and more unequivocally, the patterns of the other plays, none of which offers Falstaff anything other than a temporary, and unstable, victory.
¶ 6 Leave a comment on paragraph 6 0  In Renaissance thought, writing is always an artificial technology — desirable and useful because it fixes man’s transient words — but, as the commonly made distinction between “words” (spoken) and “letters” (written) suggests, not conceived of as part of language itself. The Aristotelian formulation, repeated by almost all at the time, held that language (words) represented ideas (mental images). Writing, if it was mentioned at all, featured as a mere representation of words (see Hope, forthcoming, chapter 1, “Ideas about language in the Renaissance.”)
¶ 7 Leave a comment on paragraph 7 0  We recognize Stephen Ramsay’s “algorithmic criticism” in the genealogy of our own thinking on these matters, on which see “Toward an Algorithmic Criticism.” Literary and Linguistic Computing 18: 167–74 and “Algorithmic Criticism” in The Blackwell Companion to Digital Literary Studies, eds. Susan Schreibman and Ray Siemens (Oxford: Blackwell, 2008), http://www.digitalhumanities.org/companionDLS/, accessed 2 March 2010. We like the word iterative because it links the nature of comparisons (which are arbitrary and repeated) to conditions of textuality, whose material supports always imply the possibility of circulation.
¶ 8 Leave a comment on paragraph 8 0  The classic statement of this view of iterability as the sine qua non of textuality is Jacques Derrida’s “Signature, Event, Context”, which can be found in Limited Inc., (Evanston: Northwestern University Press, 1998), 1-24.
¶ 9 Leave a comment on paragraph 9 0  For Docuscope, see David Kaufer, Suguru Ishizaki, Brian Butler, Jeff Collins, The Power of Words: Unveiling the Speaker and Writer’s Hidden Craft (Lawrence Erlbaum Associates: New Jersey and London, 2004). A fascinating discussion of how the program came to be designed and an early précis of its categories can be found at: http://www.betterwriting.net/projects/fed01/dsc_fed01.html, accessed 3 March, 2010. See also the Appendix: Docuscope’s Architecture of Strings.
¶ 10 Leave a comment on paragraph 10 0  We discuss digitally-based research as a prosthetic more fully in Jonathan Hope and Michael Witmore, “The Very Large Textual Object: A Prosthetic Reading of Shakespeare,” Early Modern Literary Studies 9.3 (January, 2004): 6.1-36 and give an example of such research confirming the generic claims of more traditional research in Witmore and Hope, “Shakespeare by the Numbers: On the Linguistic Texture of the Late Plays” in Early Modern Tragicomedy, eds. Subha Mukherji and Raphael Lyne (London: Boydell and Brewer, 2007), 133-53. In the 2007 article we show that the plays identified by traditional criticism on chronological and thematic grounds as “late plays” or “tragicomedies” do indeed form a coherent linguistic group.
¶ 11 Leave a comment on paragraph 11 0  There are other techniques that could have been used to explore the variation in these data — one that has been employed recently by text analysts is Latent Dirichlet Allocation — but we have chosen PCA for two reasons. First, it is a frequently-used procedure in statistics, which means its properties are well-known. Second, it provides groupings of the plays that are often perfectly recognizable in terms of existing literary critical categories and discriminations: because we can work all the way from a component down to the sentence level where its significant elements can be observed, we have not felt the need to engage in more sophisticated statistical techniques that might produce “better grouping” but not be as easily tracked to ground level language effects and strategies.
¶ 12 Leave a comment on paragraph 12 0  The fact that we work on 1000-word chunks of plays rather than whole plays is likely to strike readers as strange and arbitrary. The reasons for this are statistical. The most significant reason is that working with chunks of plays means that we identify features which are consistently used across the whole text of plays: features used at a very high rate at just one point of a play will affect the score for just one or two chunks, and will appear as outliers in a statistical analysis (of course, for some types of literary reading, we might be interested precisely in features which occur at a high rate at one point of a play – but this is to shift back to traditional reading rather than digitally-based research. A second reason is that the mathematics of the statistics demands that populations be made up of more items than are being counted for: since in this case we are counting for 99 linguistic categories (after dropping categories that received all zero scores), we need a population size greater than the 36 plays of the first folio – “chunking,” as this procedure is known, is a recognized and acceptable way to deal with this problem. We should note the chunking method we used for these tests cut the plays into 1000-word units starting from the first word. In each case therefore, a section at the end of the play was discarded as not being 1000-words long. Since all these sections included the end of each play, we have introduced a non-random element into our analysis. In future tests, we will avoid this problem by evenly spacing our 1000 word segments throughout the body of the text, in effect, distributing the “remainder” between these segments.
¶ 13 Leave a comment on paragraph 13 0  In order to indicate the interpretive nature of the definitions of genre in this paper, we capitalize the first letters of Comedy, History, Tragedy and Late Plays when we want to refer to those linguistic features specific to these genres as specified by the First Folio editors and the subsequent editors who called out the so called “late plays” as their own category, which for us are The Tempest, Cymbeline, The Winter’s Tale, and Henry VIII. (We follow the Folio editors’ designations of all the plays except those so designated as Late.) The language of “Comedy,” when it is referred to in this essay, is thus not the language of all comic writing tout court, but rather “comedy” as stipulated by the Folio editors (minus The Tempest, which was subsequently classified as Late).
¶ 14 Leave a comment on paragraph 14 0  The names of the LATs do not contain spaces. This is a requirement imposed by programs that will not tolerate absent or non-existent characters and is thus part of the odd ontology of names in the digital domain.
¶ 16 Leave a comment on paragraph 16 0  Early in our work, for example, we considered revising the Docuscope string definitions and assignments, and higher-level structure, to address them explicitly to Early Modern English, since the program was developed for use on Present-day English. Although this remains an option for the future, we decided against this, largely for practical reasons (the initial construction of Docuscope took Kaufer almost a decade, with almost as much prior thinking and research: he might be justly referred to as the “Samuel Johnson of strings”). In practice, because Kaufer did much of his string-definition using the OED as a template, Docuscope deals with Early Modern English reasonably well (forms such as “thou” are included, for example). This too is an example of a difference between traditional literary research, which tends rightly to be highly punctilious about choice of text, and digitally-based research, where the volumes of data involved tend to make new preparation processes time-intensive, but also mean that low-level “errors” do not markedly affect the final results. We have obtained solid results using the Moby Shakespeare, and have begun working with files from the EEBO TCP database, with the assistance of Martin Mueller, who has modernized enough Renaissance Drama texts for us to begin studying the full corpus of digitized drama from the mid sixteenth- to the mid-seventeenth centuries (see conclusion to this paper). Some of our future techniques are sensitive to tiny variations in infrequent items, and in these cases, choice and quality of edited text may be more important. Indeed, understanding the role of “small dashes” of certain types of words in populations of digitized texts will be one of the subjects of our future research.
¶ 17 Leave a comment on paragraph 17 0  See, for example, P. J. Schwanenfluger and C. R. White, “The Influence of Paragraph Information on the Processing of Upcoming Words,” Reading Research Quarterly 26, 160-77.
¶ 18 Leave a comment on paragraph 18 0  PCA was performed on the correlation matrix, which means that results are scaled and centered. Fluctuations among measured variables in which there is a lot of activity (for example, “Description” strings) does not therefore overwhelm parallel fluctuations in variables where there are relatively fewer items being counted. If we were tracking rocking patterns among boats in a bay, we would thus be able to see waves of movement passing across both small and large vessels (variables).
¶ 19 Leave a comment on paragraph 19 0  It should be noted that we have chosen Principal Components 1 and 4 to graph out of a much larger array of components that explain variation in the Shakespearean corpus. PCs 1 and 4 can be shown to do a statistically significant job of separating out Comedies from Histories using something called the Tukey Test, which we performed on all of the components. Not all components separate out all of the groups equally well: they track different underlying patterns, only some of which correspond to critically accepted genre divisions. What else is being tracked by these components remains to be investigated.
¶ 20 Leave a comment on paragraph 20 0  This raises the question, what is intrinsic to the text and what is para-textual? Some things, like speech prefixes, are dead giveaways for genre, but of course perfectly legitimate to count. A Google search algorithm looking for the fastest way to identify a text of interest would exploit this kind of linguistic “tell.” But we are not interested in finding the “shortest vector” to a text of interest: we are interested in the subtle patterns that underlie critical perceptions of similarity and dissimilarity. So we take the longer route and exclude speech prefixes in our work. We are not Google and Google does not do criticism.
¶ 21 Leave a comment on paragraph 21 0  Examining all of the taggings that Docuscope has made in the Shakespearean Moby corpus, we find the following as the most frequently occurring string in the LAT DenyDisclaim (in order): “not” “no” “nor” “never” “no more” “No” “none” “nothing” “do not” “cannot” “Not” “is not” “ne’er” “Nor” “not the” “neither” “not to” “and not” “not a” “deny” “And not” “is no” “did not” “no man” “none of”. Longer strings under this category include: “there is not” “it cannot be” “it is not so” “this is not” “cannot choose but”. Note that all variants in punctuation and capitalization must be individually identified for the purposes of counting. We find that it is best to look at “strings” or phrases in context rather than trying to think abstractly about how a given typology of words ought to be functioning, although the typologies have at times proven to be helpful.
¶ 22 Leave a comment on paragraph 22 0  Examining all of the taggings that Docuscope has made in the Shakespearean Moby corpus, we find the following as the most frequently occurring string in the LAT RefuteThat (in order of descending frequency): “but” “, but” “. But” “yet” “; but” “But” “, But” “; But” “, yet” “not so” “fight” “and yet,” “rather” “Revenge”. Longer strings include: “It is not” “I will never” “will not let” “for all that”.
¶ 23 Leave a comment on paragraph 23 0  The most frequent strings coded under the LAT DirectAddress (in order of descending frequency): “you” “thou” “thy” “your” “thee” “my lord” “. And” “Thou” “You” “Your” “. Come” “should” “must” “Thy” “you are” “My lord” “you have” “of your” “. Go” “, come” “I cannot” “You shall” “. Now” “I must” “that you” “in your”. Longer strings include: “how say you” “let us know” “move you to” “Where you shall”.
¶ 24 Leave a comment on paragraph 24 0  Common SelfDisclosure strings in the plays: “I have” “I am” “I would” “I think” “to my” “as I” “soul” “to me”. (Note the semantic origin of soul versus the more transactional nature of the other pronoun verb formulations; Docuscope uses heterogeneous criteria for including words or strings in its categories.) Uncertainty strings: “some” “things” “thing” “Some” “something” “seem” “know not” “doubt” “sometime” “kind of” “others” “know not what” “seeming” “guess” “seems” “stuff”. LangRef strings: “[single quotation mark]” “O” “words” “word” “report” “tale” “title” “subject” “a word” “argument” “subjects” “meaning”.
¶ 26 Leave a comment on paragraph 26 0  Common strings classified as “Motions”: “lie” “sleep” “fly” “cry” “draw” “drink” “sing” “walk” “fetch” “shake” “throw” “touch” “stir” “blows” “move” “close” “blow” “carry” “rise”. Common strings under “SenseProperties”: “sweet” “old” “young” “little” “wilt” “long” “light” “sight” “Sit” “sweet” “music” “cold” “sound” “bosom” “hot” “heavy”. Under “SenseObjects”: “well” “hand” “blood” “eyes” “heart” “the king” “bear” “tongue” “head” “eye” “sword” “house” “face” “hands” “bed” “gold”. Under Inclusion: “our” “us” “Our” “of our” “together” “we have” “to our” “in our” “ourselves” “that we”. Under “CommonAuthority”: “lord” “God” “Lord” “unto” “lords” “duke” “majesty” “Duke” “royal” “highness” “warrant” “gods” “command” “he that” “sovereign.” Note that “well” in SenseObjects is obviously not always a noun in Shakespeare’s plays; we choose not to alter the classification on the theory that the category works well enough. We may, however, begin to make changes to Docuscope’s dictionaries to capture historical differences in early modern English once we have worked out some ground rules that will limit our inclination to build our own critical expectations into the device. In the end, it may be only a few words that have any real impact on results.
¶ 27 Leave a comment on paragraph 27 0  See Susan Snyder, Shakespeare: A Wayward Journey (Newark, Del.: University of Delaware Press, 2002), 29-45 and Susan Snyder, The Comic Matrix of Shakespeare’s Tragedies (Princeton: University of Princeton Press, 1979), 70-74. Other critics arguing for the presence of comic elements in the play include Stanley Cavell, Disowning Knowledge in Six Plays of Shakespeare (Cambridge: Cambridge University Press, 1987), pp. 132-33; Michael Bristol, “Charivari and the Comedy of Abjection in Othello,” Renaissance Drama, NS 21 (1990), 3-21; Peter J. Smith, “’A Good Soft Pillow for that Good White Head’: Othello as Comedy,” Sydney Studies in English 24 (1998-99), 21-39; Robert Hornback, “Emblems of Folly in the First Othello: Renaissance Blackface, Moor’s Coat, and ‘Muckender,’” Comparative Drama 35 (2001), 69-99; and Stephen Orgel, “Othello and the End of Comedy,” Shakespeare Studies 56 (2003), 105-16.
¶ 28 Leave a comment on paragraph 28 0  See Barbara Mowat, The Dramaturgy of Shakespeare’s Romances (Athens: University of Georgia Press, 1976), pp. 36, 69 and “’What’s in a Name?’: Tragicomedy, Romance, or Late Comedy” in A Companion to Shakespeare’s Works: The Poems, Problem Comedies, Late Plays, ed. Richard Dutton and Jean E. Howard, 4 vols (Oxford, 2003), 4:129-49, esp. 134. Mowat credits the adaptation of Wittgensteinian “theory of resemblance” to genre theory to Alastair Fowler’s Kinds of Literature: An Introduction to the Theory of Genres and Modes (Cambridge: Cambridge University Press, 1982).
¶ 29 Leave a comment on paragraph 29 0  This vertical integration has been confirmed by experiments performed by Matt Jockers at Stanford, who has simply used the most frequent words in the Shakespearean corpus – which end up being what linguists call “function words” – to produce genre groupings that are remarkably similar to the ones we have produced with Docuscope. See http://www.stanford.edu/~mjockers/cgi-bin/drupal/node/27, accessed 7 March 2010. See also the remarks on the use of function words in author attribution in Stanley Wells, Gary Taylor, John Jowett, William Shakespeare, A Textual Companion (New York: Norton, 1997), 80-89.
¶ 30 Leave a comment on paragraph 30 0  A different way of describing the perspective on language taken by a tagging device like Docuscope — different from saying, for example, that the tagging itself occurs in a time without tense — would be to say that Docuscope counts all instances of language as instances of “mentioning” rather than “use.” This is why it will never read for irony, which for our purposes cannot be factored into a linguistic re-description of genre.
¶ 32 Leave a comment on paragraph 32 0  Franco Moretti, Maps, Graphs and Trees: Abstract Models for a Literary Theory (New York: Verso Books, 2005); Robin Valenza, “How Literature Becomes Knowledge: A Case Study,” ELH 76.1 (Spring 2009), 215-45; Brad Pasanek, “Mining Millions of Metaphors” (co-authored with D. Sculley), Literary and Linguistic Computing 28.3 (2008), 345-360. See also the work of J. F. Burroughs, Computation into Criticism (Oxford: Clarendon Press, 1987) and D. Biber, S. Conrad and R. Reppen, Corpus Linguistics: Investigating Language Structure and Use (Cambridge: Cambridge University Press, 1998). Witmore is engaged in a longitudinal study of Victorian novels using Docuscope with Sara Allison, Ryan Hauser, Matt Jockers, and Franco Moretti.
¶ 33 Leave a comment on paragraph 33 0  See James J. Gibson, “The Theory of Affordances,” in Perceiving, Acting and Knowing, eds. Robert Shaw and John Bransford (Hillsdale, NJ: Laurence Earlbaum, 1977), 67-82.
¶ 35 Leave a comment on paragraph 35 0  On deliberate accidents and early modern notions of experimentation, see Michael Witmore, Culture of Accidents: Unexpected Knowledges in Early Modern England (Stanford: Stanford University Press, 2001).
¶ 36 Leave a comment on paragraph 36 0  See, for example, Parker’s exemplary close readings of Othello and Hamlet, which traces a web of semantic and figurative correspondences between the plays and “larger discursive networks” that structure the language of privacy and accusation: Patricia Parker, “Othello and Hamlet: Delation, Spying, and the ‘Secret Place’ of Woman,” in Shakespeare Reread: The Texts in New Contexts, ed. Russ McDonald (Ithaca: Cornell University Press, 1994), ch. 5.
¶ 38 Leave a comment on paragraph 38 0  We were lucky enough to get usable text files of these plays from Martin Mueller at Northwestern, who has developed some extremely powerful modernization procedures that have resulted in texts that are just as “countable” as those we studied in the hand modernized Moby Shakespeare corpus. Mueller has divided these plays up into generic groups using Harbage’s Annals of English Drama, title pages and his own sense of critical practice. We have begun full-scale study of this corpus with Mueller through a joint research project between the University of Wisconsin-Madison, Strathclyde University and Northwestern University.