Open Review: "Shakespeare and New Media"

2. Gloop and the Banality of Digital Reading: Comedy and History

1 Leave a comment on paragraph 1 1 We begin with an analogy based on a popular item of English cuisine: the pudding. Many English puddings feature a goopy matrix in which something more substantial is intermixed, for example a piece of fruit like a plumb. In our case, gloop is a useful substance to think with because it is analogous to the linguistic gloop that binds together the more spectacular items—the fabulous turns of phrase or memorable passages—that literary critics are likely to seek out and savor. As readers, we tend to ignore the ubiquitous gloop and prospect for the fruit, which means that we remain largely unaware of a large part of our experience of reading. But if that matrix or gloop can be characterized by a machine, humans can return to the plums with a better sense of just why they taste so sweet. Just as Page and Ford move from the forensic comparison of the identical letters to plotting their revenge on Falstaff, so digitally-based research can provide a jumping off point and even occasional guidance for human-based traditional reading.

2 Leave a comment on paragraph 2 0 Figure 1: 776 Pieces of Shakespeare’s Plays from the First Folio, each of 1000 words, rated on two scaled principal components (1 and 4). The color key for the dots: Histories (green), Comedies (red), Tragedies (orange) and Late Plays (blue). Late plays are: The Tempest, The Winter’s Tale, Cymbeline, Henry VIIIFigure 1: 776 Pieces of Shakespeare’s Plays from the First Folio, each of 1000 words, rated on two scaled principal components (1 and 4). The color key for the dots: Histories (green), Comedies (red), Tragedies (orange) and Late Plays (blue). Late plays are: The Tempest, The Winter’s Tale, Cymbeline, Henry VIII.

3 Leave a comment on paragraph 3 0 Figure 1 is a plot of 776 pieces of Shakespeare plays – each one containing 1000 consecutive words from a play (we discuss the reasons for chopping plays up so arbitrarily below). Each piece of text has been subject to rhetorical analysis by a Docuscope, whose operations we will discuss in more detail below. The results of this analysis, which comprise frequency counts of just under one hundred linguistic categories, have been put through a complex but very common statistical procedure known as Principal Component Analysis. The procedure makes comparisons between a large number of features within a population, allowing us to identify patterns of similarity and difference within the population based on correlating the presence and absence of features. Thus, if feature A is found in a group of the population, PCA asks if feature B is also or predictably absent. Principal Component Analysis (PCA) thus attempts to relate differences and absences within a population by making associations between them.[11] These associations are expressed by placing members of the population at value-points along a scaled Principal Component. This procedure is good at making sense of complex relationships within large, complex populations – and, as a very excited statistician told us over lunch one day, Shakespeare’s language is one of the most complex and interesting populations around.

4 Leave a comment on paragraph 4 0 In this instance, the statistical package is making multiple comparisons between the relative frequencies of 99 linguistic categories in the 767 1000-word chunks of Shakespeare. Once it has made these comparisons, it combines the linguistic categories into “Principal Components” of highly positively and negatively correlated features, seeking to construct components that account for as much of the variation within the population as is mathematically possible.[12] Each component is thus an answer to the questions, “Are these bits of plays similar to each other?,” and “Do the bits of plays form any groups with members of the group all sharing, or lacking, the same features?”

5 Leave a comment on paragraph 5 0 Figure 1 shows the results of running PCA on the Docuscope results from the fragments of the plays of the first folio. The work reported in Hope and Witmore (2004) established that there is a very clear linguistic distinction between Shakespeare’s Comedies and the Histories, and this figure confirms that finding on another level.[13] In the figure we have plotted the two Principal Components which account for most of the linguistic differences between Comedy and History in Shakespeare: Principal Component 1 (Prin1) on the horizontal axis, and Principal Component 4 (Prin4) on the vertical axis (we will discuss exactly what goes into these later in this paper, for the moment, all that is important is to see the separation between genres).

6 Leave a comment on paragraph 6 0 We begin by noting that chunks of Comedy all tend to score highly on both Prin1 and Prin4: scoring highly on Prin1 pushes them to the right half of the graph, while scoring highly on Prin4 pushes them to the upper half of the graph, with the result that most of the chunks of Comedy group together in the upper right quadrant of the graph. Those readers with a traditional (or post-modern!) literary training may be tempted to focus on the outliers here — for example, one red dot at the extreme left of the graph — and these, as we will see later, can be interesting, but for the moment, remember that digitally-based research is better at the gloop than the plums — the boring conformity, rather than the spectacular maverick.

7 Leave a comment on paragraph 7 1 Conversely, chunks of History all score low on both Principal Components, resulting in a grouping of these at the lower left quadrant of the graph. We could draw a diagonal line across this graph from upper left to lower right, and leave most of the Comic chunks above it, and most of the History chunks below it. The statistical analysis is telling us that there are highly significant, and consistent, linguistic differences between Shakespearean Comedy and History — but we should remember that all the analytic tools can “see” are 767 individual texts. The ascription of those chunks to the genres “Comedy” and “History” was done by the editors of the First Folio. Our analytic tools (Docuscope and PCA) have identified linguistic similarities and differences in the population of text-chunks, and we have represented these visually, and overlaid the folio genre divisions. The extent to which the most significant linguistic similarities and differences in the population correlate with Renaissance genre divisions is, to our eyes, striking.

8 Leave a comment on paragraph 8 1 So, one early claim of our work is as follows: Shakespeare’s Comedies and Histories are linguistically distinct from each other. This distinctiveness can be shown statistically, and it is consistent. Let us try to unpick this claim as a way of demonstrating our methods, offering a critical understanding of iterative techniques and revealing the linguistic “gloop” or matrix of Shakespearean Comedy.

9 Leave a comment on paragraph 9 0 First of all, what are Prin1 and Prin4? What is Docuscope counting, the presence or absence of which is being expressed by these scales? Docuscope is essentially a smart dictionary: it “reads” strings of characters looking for words, and collections of words, it “recognizes.” When it encounters a word or phrase it knows, this string is counted. “Recognizes” means matches: Docuscope consists of a list, or dictionary, of over 200 million possible strings of English, each assigned to one of 101 functional linguistic categories called “Language Action Types” (LATs).[14] When Docuscope encounters a string it recognizes, the associated LAT is credited with one appearance. For example, “I” and “me” are strings which Docuscope assigns to the LAT “FirstPerson”: the occurrence of any one of them in a text is recorded as an appearance of the LAT “FirstPerson” (with one important caveat we will explain below).

10 Leave a comment on paragraph 10 1 Clearly, we are dealing with human interpretations and definitions based on a particular theory of how language works, which in this case is a model offered by the linguist Michael Halliday.[15] Docuscope works in a mechanical manner in that it counts every string and every text it encounters in the same way, but the decision about what to count (what constitutes a functional string) and how to classify it (which LAT or higher category to put the string in) is not mechanical: ultimately this is based on decisions made by the architect of Docuscope, David Kaufer, and these decisions are open to challenge.[16] Digitally-based research does not offer us the impossible dream of objective humanities research. Yet it does offer us the possibility of applying subjective Humanities-based insights in a consistent way to test their applicability and utility across a large number of instances. Iterative criticism offers a way of being consistently subjective at a certain level of the analysis.

11 Leave a comment on paragraph 11 0 One aspect of the way Docuscope works is that a word can only be counted in one string, with Docuscope always seeking to include a word in the longest possible string. So all instances of “I” are not automatically included in the LAT “FirstPerson:” those which occur with a tensed verb will be counted as “SelfDisclosure” because these strings are longer. This raises an interesting issue: Docuscope was designed to allow the investigation of rhetorical effects on the assumption that different types of string have different types of experiential effects on readers. Implicit in the way it defines functional strings (a word joins the longest possible string, and only that string) is that individual words have one and only one functional effect on readers. In fact, we know from psycho-linguistic research that linguistic effects can be multiple: words and sounds can “prime” for other words for example.[17] So Docuscope’s definition of “string” (the longest possible string, and only that string) may be necessary from a practical point of view, but on the level of individual words or clusters of words, its heuristic classifications are an oversimplification. This is the type of caveat we need to make explicit in digitally-based research. Such a limitation does not render Docuscope’s findings meaningless: the patterns we have found so far are consistent across our work with Docuscope and make sense in terms of non-iterative work on genre. But no investigative technique is without limitations: counting things is never simple.

12 Leave a comment on paragraph 12 0 Figure 2 is a graphical representation of the linguistic make-up of Prin1 and Prin4. We can think of it as repeating Figure 1, this time with the linguistic categories used in counting mapped onto the space rather than the chunks of plays. Once again, Prin1 is shown along the horizontal axis, and Prin4 on the vertical axis. The dataspace is centered on the point 0,0 at the graph’s origin, which represents a value of zero on both scaled Principal Components.[18] From this point extend arrows, each one representing a LAT. The length of each arrow indicates the degree of loading that LAT has from being neutral for both graphed PCs. For example, a feature that appeared at the mean value for the whole sample would be graphed at 0,0. indicating that it played no role in distinguishing this group from any other. A feature such as “SelfDisclosure” has a long arrow to the right because it has a high positive loading on Prin1 — play chunks high on Prin1 will have large amounts of “SelfDisclosure.” However, the arrow is horizontal because the LAT is neutral on Prin4 — it plays no positive or negative role in ordering the plays as they appear along this scale.

13 Leave a comment on paragraph 13 0 As with Figure 1, we can imagine a diagonal line drawn from top left to bottom right, through the 0,0 point. Linguistic features characteristic of Comedy have long arrows above this line; linguistic features associated with History have long arrows below the line. With this in mind, we can start to pull out the linguistic features that are statistically significant in making up the matrix of Shakespearean Comedy. A key point to remember is that we are not only identifying presence: we are also identifying correlated absence. Shakespeare’s Comedies are “high” on both Prin1 and Prin4 (this is why they cluster in the upper right quadrant in Figure 1): they are characterized by those features that show positive scores on one or both of the axes here.[19] For example: “DenyDisclaim,” “SelfDisclosure,” “DirectAddress” and “FirstPerson” are all frequent in Comedy (and we will define these and illustrate these in a moment). Conversely, Shakespeare’s Comedies are also characterized by a lack of those features which show strong negative scores on one or both of the axes here, in this instance: “Motions,” “SenseProperty,” “SenseObject,” “Inclusive,” “CommonAuthorities.”

14 Leave a comment on paragraph 14 0 And, iterative research can tell us, Shakespeare makes use of precisely those features he avoids in Comedy to constitute the matrix of History: the two variables “SelfDisclosure” and “SenseObject” are almost directly opposed. A loadings biplot like the one below tells us that the use of one type of word (or string of words) seems to preclude the use of its opposite. This would be true of all the longer vector arrows in the diagram that extend from opposite sides of the origin:

15 Leave a comment on paragraph 15 0 Figure 2: Loadings biplot for scaled Principal Components 1 and 4 used to create the scatterplot in Figure 1.Figure 2: Loadings biplot for scaled Principal Components 1 and 4 used to create the scatterplot in Figure 1.

16 Leave a comment on paragraph 16 0 For example, “LanguageReference,” “DenyDisclaim,” and “Uncertainty” strings are used in opposition to those classed under the LAT “Common Authority.” If an item scores high on Prin4 (which most comedies do), it will be high in “Language Reference,” “Uncertainty” and “DenyDisclaim” strings, while simultaneously lacking “CommonAuthority” strings. We can learn a lot by looking at this diagram, since — once we have decided that these components track a viable historical or critical distinction among texts — it shows us certain types of language co-occurring in the process of making this distinction (e.g. “this text is, or is not, a Comedy”). “DirectAddress” and “FirstPerson” thus tend to go together here (lower right), as do “Motions,” “SenseProperties,” and “Sense Objects” (upper left).

17 Leave a comment on paragraph 17 0 Put another way, what this graph illustrates is what Mistress Ford detects in Falstaff’s “disposition” and “words”: both find a discrepancy among texts that do not “adhere and keep place together” any more than it is possible to set the hundredth Psalm to the tune of “Green Sleeves.” PCA shows us those things which consistently avoid each other, and those things which do “adhere and keep place together” — schooling like linguistic fish.

  • [11] There are other techniques that could have been used to explore the variation in these data — one that has been employed recently by text analysts is Latent Dirichlet Allocation — but we have chosen PCA for two reasons. First, it is a frequently-used procedure in statistics, which means its properties are well-known. Second, it provides groupings of the plays that are often perfectly recognizable in terms of existing literary critical categories and discriminations: because we can work all the way from a component down to the sentence level where its significant elements can be observed, we have not felt the need to engage in more sophisticated statistical techniques that might produce “better grouping” but not be as easily tracked to ground level language effects and strategies.
  • [12] The fact that we work on 1000-word chunks of plays rather than whole plays is likely to strike readers as strange and arbitrary. The reasons for this are statistical. The most significant reason is that working with chunks of plays means that we identify features which are consistently used across the whole text of plays: features used at a very high rate at just one point of a play will affect the score for just one or two chunks, and will appear as outliers in a statistical analysis (of course, for some types of literary reading, we might be interested precisely in features which occur at a high rate at one point of a play – but this is to shift back to traditional reading rather than digitally-based research. A second reason is that the mathematics of the statistics demands that populations be made up of more items than are being counted for: since in this case we are counting for 99 linguistic categories (after dropping categories that received all zero scores), we need a population size greater than the 36 plays of the first folio – “chunking,” as this procedure is known, is a recognized and acceptable way to deal with this problem. We should note the chunking method we used for these tests cut the plays into 1000-word units starting from the first word. In each case therefore, a section at the end of the play was discarded as not being 1000-words long. Since all these sections included the end of each play, we have introduced a non-random element into our analysis. In future tests, we will avoid this problem by evenly spacing our 1000 word segments throughout the body of the text, in effect, distributing the “remainder” between these segments.
  • [13] In order to indicate the interpretive nature of the definitions of genre in this paper, we capitalize the first letters of Comedy, History, Tragedy and Late Plays when we want to refer to those linguistic features specific to these genres as specified by the First Folio editors and the subsequent editors who called out the so called “late plays” as their own category, which for us are The Tempest, Cymbeline, The Winter’s Tale, and Henry VIII. (We follow the Folio editors’ designations of all the plays except those so designated as Late.) The language of “Comedy,” when it is referred to in this essay, is thus not the language of all comic writing tout court, but rather “comedy” as stipulated by the Folio editors (minus The Tempest, which was subsequently classified as Late).
  • [14] The names of the LATs do not contain spaces. This is a requirement imposed by programs that will not tolerate absent or non-existent characters and is thus part of the odd ontology of names in the digital domain.
  • [15] On Halliday’s Functional Grammar, see M. A. K. Halliday, Introduction to Functional Grammar, Second Edition (London: Edward Arnold, 1994).
  • [16] Early in our work, for example, we considered revising the Docuscope string definitions and assignments, and higher-level structure, to address them explicitly to Early Modern English, since the program was developed for use on Present-day English. Although this remains an option for the future, we decided against this, largely for practical reasons (the initial construction of Docuscope took Kaufer almost a decade, with almost as much prior thinking and research: he might be justly referred to as the “Samuel Johnson of strings”). In practice, because Kaufer did much of his string-definition using the OED as a template, Docuscope deals with Early Modern English reasonably well (forms such as “thou” are included, for example). This too is an example of a difference between traditional literary research, which tends rightly to be highly punctilious about choice of text, and digitally-based research, where the volumes of data involved tend to make new preparation processes time-intensive, but also mean that low-level “errors” do not markedly affect the final results. We have obtained solid results using the Moby Shakespeare, and have begun working with files from the EEBO TCP database, with the assistance of Martin Mueller, who has modernized enough Renaissance Drama texts for us to begin studying the full corpus of digitized drama from the mid sixteenth- to the mid-seventeenth centuries (see conclusion to this paper). Some of our future techniques are sensitive to tiny variations in infrequent items, and in these cases, choice and quality of edited text may be more important. Indeed, understanding the role of “small dashes” of certain types of words in populations of digitized texts will be one of the subjects of our future research.
  • [17] See, for example, P. J. Schwanenfluger and C. R. White, “The Influence of Paragraph Information on the Processing of Upcoming Words,” Reading Research Quarterly 26, 160-77.
  • [18] PCA was performed on the correlation matrix, which means that results are scaled and centered. Fluctuations among measured variables in which there is a lot of activity (for example, “Description” strings) does not therefore overwhelm parallel fluctuations in variables where there are relatively fewer items being counted. If we were tracking rocking patterns among boats in a bay, we would thus be able to see waves of movement passing across both small and large vessels (variables).
  • [19] It should be noted that we have chosen Principal Components 1 and 4 to graph out of a much larger array of components that explain variation in the Shakespearean corpus. PCs 1 and 4 can be shown to do a statistically significant job of separating out Comedies from Histories using something called the Tukey Test, which we performed on all of the components. Not all components separate out all of the groups equally well: they track different underlying patterns, only some of which correspond to critically accepted genre divisions. What else is being tracked by these components remains to be investigated.
  • Page 31

    Source: http://mcpress.media-commons.org/ShakespeareQuarterly_NewMedia/hope-witmore-the-hundredth-psalm/2-gloop-and-the-banality-of-digital-reading/