2. Gloop and the Banality of Digital Reading: Comedy and History
¶ 1 Leave a comment on paragraph 1 1 We begin with an analogy based on a popular item of English cuisine: the pudding. Many English puddings feature a goopy matrix in which something more substantial is intermixed, for example a piece of fruit like a plumb. In our case, gloop is a useful substance to think with because it is analogous to the linguistic gloop that binds together the more spectacular items—the fabulous turns of phrase or memorable passages—that literary critics are likely to seek out and savor. As readers, we tend to ignore the ubiquitous gloop and prospect for the fruit, which means that we remain largely unaware of a large part of our experience of reading. But if that matrix or gloop can be characterized by a machine, humans can return to the plums with a better sense of just why they taste so sweet. Just as Page and Ford move from the forensic comparison of the identical letters to plotting their revenge on Falstaff, so digitally-based research can provide a jumping off point and even occasional guidance for human-based traditional reading.
¶ 3 Leave a comment on paragraph 3 0 Figure 1 is a plot of 776 pieces of Shakespeare plays – each one containing 1000 consecutive words from a play (we discuss the reasons for chopping plays up so arbitrarily below). Each piece of text has been subject to rhetorical analysis by a Docuscope, whose operations we will discuss in more detail below. The results of this analysis, which comprise frequency counts of just under one hundred linguistic categories, have been put through a complex but very common statistical procedure known as Principal Component Analysis. The procedure makes comparisons between a large number of features within a population, allowing us to identify patterns of similarity and difference within the population based on correlating the presence and absence of features. Thus, if feature A is found in a group of the population, PCA asks if feature B is also or predictably absent. Principal Component Analysis (PCA) thus attempts to relate differences and absences within a population by making associations between them. These associations are expressed by placing members of the population at value-points along a scaled Principal Component. This procedure is good at making sense of complex relationships within large, complex populations – and, as a very excited statistician told us over lunch one day, Shakespeare’s language is one of the most complex and interesting populations around.
¶ 4 Leave a comment on paragraph 4 0 In this instance, the statistical package is making multiple comparisons between the relative frequencies of 99 linguistic categories in the 767 1000-word chunks of Shakespeare. Once it has made these comparisons, it combines the linguistic categories into “Principal Components” of highly positively and negatively correlated features, seeking to construct components that account for as much of the variation within the population as is mathematically possible. Each component is thus an answer to the questions, “Are these bits of plays similar to each other?,” and “Do the bits of plays form any groups with members of the group all sharing, or lacking, the same features?”
¶ 5 Leave a comment on paragraph 5 0 Figure 1 shows the results of running PCA on the Docuscope results from the fragments of the plays of the first folio. The work reported in Hope and Witmore (2004) established that there is a very clear linguistic distinction between Shakespeare’s Comedies and the Histories, and this figure confirms that finding on another level. In the figure we have plotted the two Principal Components which account for most of the linguistic differences between Comedy and History in Shakespeare: Principal Component 1 (Prin1) on the horizontal axis, and Principal Component 4 (Prin4) on the vertical axis (we will discuss exactly what goes into these later in this paper, for the moment, all that is important is to see the separation between genres).
¶ 6 Leave a comment on paragraph 6 0 We begin by noting that chunks of Comedy all tend to score highly on both Prin1 and Prin4: scoring highly on Prin1 pushes them to the right half of the graph, while scoring highly on Prin4 pushes them to the upper half of the graph, with the result that most of the chunks of Comedy group together in the upper right quadrant of the graph. Those readers with a traditional (or post-modern!) literary training may be tempted to focus on the outliers here — for example, one red dot at the extreme left of the graph — and these, as we will see later, can be interesting, but for the moment, remember that digitally-based research is better at the gloop than the plums — the boring conformity, rather than the spectacular maverick.
¶ 7 Leave a comment on paragraph 7 1 Conversely, chunks of History all score low on both Principal Components, resulting in a grouping of these at the lower left quadrant of the graph. We could draw a diagonal line across this graph from upper left to lower right, and leave most of the Comic chunks above it, and most of the History chunks below it. The statistical analysis is telling us that there are highly significant, and consistent, linguistic differences between Shakespearean Comedy and History — but we should remember that all the analytic tools can “see” are 767 individual texts. The ascription of those chunks to the genres “Comedy” and “History” was done by the editors of the First Folio. Our analytic tools (Docuscope and PCA) have identified linguistic similarities and differences in the population of text-chunks, and we have represented these visually, and overlaid the folio genre divisions. The extent to which the most significant linguistic similarities and differences in the population correlate with Renaissance genre divisions is, to our eyes, striking.
¶ 8 Leave a comment on paragraph 8 1 So, one early claim of our work is as follows: Shakespeare’s Comedies and Histories are linguistically distinct from each other. This distinctiveness can be shown statistically, and it is consistent. Let us try to unpick this claim as a way of demonstrating our methods, offering a critical understanding of iterative techniques and revealing the linguistic “gloop” or matrix of Shakespearean Comedy.
¶ 9 Leave a comment on paragraph 9 0 First of all, what are Prin1 and Prin4? What is Docuscope counting, the presence or absence of which is being expressed by these scales? Docuscope is essentially a smart dictionary: it “reads” strings of characters looking for words, and collections of words, it “recognizes.” When it encounters a word or phrase it knows, this string is counted. “Recognizes” means matches: Docuscope consists of a list, or dictionary, of over 200 million possible strings of English, each assigned to one of 101 functional linguistic categories called “Language Action Types” (LATs). When Docuscope encounters a string it recognizes, the associated LAT is credited with one appearance. For example, “I” and “me” are strings which Docuscope assigns to the LAT “FirstPerson”: the occurrence of any one of them in a text is recorded as an appearance of the LAT “FirstPerson” (with one important caveat we will explain below).
¶ 10 Leave a comment on paragraph 10 1 Clearly, we are dealing with human interpretations and definitions based on a particular theory of how language works, which in this case is a model offered by the linguist Michael Halliday. Docuscope works in a mechanical manner in that it counts every string and every text it encounters in the same way, but the decision about what to count (what constitutes a functional string) and how to classify it (which LAT or higher category to put the string in) is not mechanical: ultimately this is based on decisions made by the architect of Docuscope, David Kaufer, and these decisions are open to challenge. Digitally-based research does not offer us the impossible dream of objective humanities research. Yet it does offer us the possibility of applying subjective Humanities-based insights in a consistent way to test their applicability and utility across a large number of instances. Iterative criticism offers a way of being consistently subjective at a certain level of the analysis.
¶ 11 Leave a comment on paragraph 11 0 One aspect of the way Docuscope works is that a word can only be counted in one string, with Docuscope always seeking to include a word in the longest possible string. So all instances of “I” are not automatically included in the LAT “FirstPerson:” those which occur with a tensed verb will be counted as “SelfDisclosure” because these strings are longer. This raises an interesting issue: Docuscope was designed to allow the investigation of rhetorical effects on the assumption that different types of string have different types of experiential effects on readers. Implicit in the way it defines functional strings (a word joins the longest possible string, and only that string) is that individual words have one and only one functional effect on readers. In fact, we know from psycho-linguistic research that linguistic effects can be multiple: words and sounds can “prime” for other words for example. So Docuscope’s definition of “string” (the longest possible string, and only that string) may be necessary from a practical point of view, but on the level of individual words or clusters of words, its heuristic classifications are an oversimplification. This is the type of caveat we need to make explicit in digitally-based research. Such a limitation does not render Docuscope’s findings meaningless: the patterns we have found so far are consistent across our work with Docuscope and make sense in terms of non-iterative work on genre. But no investigative technique is without limitations: counting things is never simple.
¶ 12 Leave a comment on paragraph 12 0 Figure 2 is a graphical representation of the linguistic make-up of Prin1 and Prin4. We can think of it as repeating Figure 1, this time with the linguistic categories used in counting mapped onto the space rather than the chunks of plays. Once again, Prin1 is shown along the horizontal axis, and Prin4 on the vertical axis. The dataspace is centered on the point 0,0 at the graph’s origin, which represents a value of zero on both scaled Principal Components. From this point extend arrows, each one representing a LAT. The length of each arrow indicates the degree of loading that LAT has from being neutral for both graphed PCs. For example, a feature that appeared at the mean value for the whole sample would be graphed at 0,0. indicating that it played no role in distinguishing this group from any other. A feature such as “SelfDisclosure” has a long arrow to the right because it has a high positive loading on Prin1 — play chunks high on Prin1 will have large amounts of “SelfDisclosure.” However, the arrow is horizontal because the LAT is neutral on Prin4 — it plays no positive or negative role in ordering the plays as they appear along this scale.
¶ 13 Leave a comment on paragraph 13 0 As with Figure 1, we can imagine a diagonal line drawn from top left to bottom right, through the 0,0 point. Linguistic features characteristic of Comedy have long arrows above this line; linguistic features associated with History have long arrows below the line. With this in mind, we can start to pull out the linguistic features that are statistically significant in making up the matrix of Shakespearean Comedy. A key point to remember is that we are not only identifying presence: we are also identifying correlated absence. Shakespeare’s Comedies are “high” on both Prin1 and Prin4 (this is why they cluster in the upper right quadrant in Figure 1): they are characterized by those features that show positive scores on one or both of the axes here. For example: “DenyDisclaim,” “SelfDisclosure,” “DirectAddress” and “FirstPerson” are all frequent in Comedy (and we will define these and illustrate these in a moment). Conversely, Shakespeare’s Comedies are also characterized by a lack of those features which show strong negative scores on one or both of the axes here, in this instance: “Motions,” “SenseProperty,” “SenseObject,” “Inclusive,” “CommonAuthorities.”
¶ 14 Leave a comment on paragraph 14 0 And, iterative research can tell us, Shakespeare makes use of precisely those features he avoids in Comedy to constitute the matrix of History: the two variables “SelfDisclosure” and “SenseObject” are almost directly opposed. A loadings biplot like the one below tells us that the use of one type of word (or string of words) seems to preclude the use of its opposite. This would be true of all the longer vector arrows in the diagram that extend from opposite sides of the origin:
¶ 16 Leave a comment on paragraph 16 0 For example, “LanguageReference,” “DenyDisclaim,” and “Uncertainty” strings are used in opposition to those classed under the LAT “Common Authority.” If an item scores high on Prin4 (which most comedies do), it will be high in “Language Reference,” “Uncertainty” and “DenyDisclaim” strings, while simultaneously lacking “CommonAuthority” strings. We can learn a lot by looking at this diagram, since — once we have decided that these components track a viable historical or critical distinction among texts — it shows us certain types of language co-occurring in the process of making this distinction (e.g. “this text is, or is not, a Comedy”). “DirectAddress” and “FirstPerson” thus tend to go together here (lower right), as do “Motions,” “SenseProperties,” and “Sense Objects” (upper left).
¶ 17 Leave a comment on paragraph 17 0 Put another way, what this graph illustrates is what Mistress Ford detects in Falstaff’s “disposition” and “words”: both find a discrepancy among texts that do not “adhere and keep place together” any more than it is possible to set the hundredth Psalm to the tune of “Green Sleeves.” PCA shows us those things which consistently avoid each other, and those things which do “adhere and keep place together” — schooling like linguistic fish.