Natural Language Processing (NLP) has come a long way since 1982 when Anthony Kenny published The Computation of Style: An Introduction to Statistics for Students of Literature and Humanities. The methods are more sophisticated, the machines are both cheaper and more powerful, and the learning curve for carrying out experiments has dropped sharply. But this book by an eminent philosopher is still useful because it offers a lucid introduction to basic mathematical concepts for readers with no “mathematical competence beyond a rusty memory of junior school arithmetic and algebra”.  It would be helpful if all people who use NLP techniques had a clear understanding of the underlying concepts.  At least one should have a clear idea of how to read the results with the appropriate sense of confidence.

NLP routines are often used for authorship attribution.  Those who don’t care about that field typically  don’t care at all. Those who do, care a lot and have often gone about it in a contentious manner: think of Homeric and Biblical scholarship in the 19th century.  An NLP routine may tell you with considerable precision about patterns that are found in texts, but it is a big leap from saying that A is like (or unlike) B and asserting that A was or was not written by the author of B.  The latter is a probability judgement or “prediction” that may turn out to be wrong. Some record keeper at Wimbledon probably has a list of all finals where a player had two successive championship points to win the tournament. Those records would probably “predict” that Federer would win in 2019. He actually won more points,  but he lost anyhow.

The  New Oxford Shakespeare Authorship Companion, published in 2017, has led to lively arguments about the (ab)uses of statistical methods for deciding who wrote what. This may be a good time for stepping back from those arguments and focusing instead on the question what features can be identified by algorithms and what degree of difference counts as big or small.  The conventional 5% threshold of “significance” almost certainly is not a good guide.

The EarlyPrint project (https://earlyprint.org) holds about 850 play texts written between 1550 and 1700.  The texts come from the Text Creation Partnership (TCP). They have been tokenized and linguistically annotated.  A word token represents a lexical item in a grammatical state.  The data are kept as the  @lemma and @pos “attributes” of XML word “elements”.  This tediously explicit form of marking up simple words allows you to represent a text at various levels of abstraction. You can anonymize it by mapping all names to placeholders.  You can also strip the texts of orthographic or typographic accidentals that for some purposes may provide critical forensic evidence but are  not part of a text’s essential style (I am aware that the distinction between accident and essence is fraught).

Now consider the following experiment. You represent each text through seven non-contiguous and anonymized 1,500 word chunks. In one version of the experiment you keep the boundaries of individual utterances so that the stage rhythm of taking turns is preserved. You also maintain the difference between prose and verse.  The length of speeches and the distinction between prose and verse are clearly essential features of a play. But in another version you drop even those and treat each chunk as just a sequence of words. Philip Stone, author of The General Inquirer, a pioneer content analysis tool from the 1960’s,  somewhere refers to science in a not quite tongue-in-check way as a way of “systematically throwing away information” in order to get measurable results.  You would expect that dropping speaker turns and the prose/verse distinction would significantly coarsen results. But if the results were similar you would have learned something.

In either case you end up with ~6,000 chunks. Assign a random ID to each and turn them over to a team of Computer Science students who know something about Natural Language Processing (NLP),  statistical routines, and visualization  but have no interest in or knowledge of Early Modern drama. Their task will be  to show various ways in which these chunks are (un)like each other.  The point of the experiment is not to show who wrote what, but to map what is known or knowable. On average, are differences between authors in the same generation larger or smaller than differences of random samples of plays separated by thirty or sixty years? A patient mapping of  answers to such questions in a spirit of “modesty and cunning”  could provide a useful context for occasions where you want to move from describing differences between A and B to making judgments about authorship.

The simplest and most brutal technique turns every text chunk into a “bag of words”. In this BOW model all that remains of a text is a list of words and their frequencies. It works better than it should, as you can see from the popularity of wordles and other word clouds. More complex models explore the relationship of words. Some dozen years ago I listened to a talk by an intelligence officer who reversed the standard technique of throwing away the 120 or so most common “stop words” and instead threw out all the content words, leaving a text as a fabric of stop words. He said that for Arabic texts this was a sure-fire method of discovering the writer’s region of origin. Which is a different thing from identifying a particular author. This approach has since then be formalized as “word adjacency networks” (WAN).

BOW, WAN, or whatever other techniques there may be, the point of the experiment will be to use several tests—let’s say seven—and create outputs in which each experiment ends up as a map with dots scattered across it. At that point the veil of ignorance is lifted, and every chunk is identified by decade, genre, author, and text. I could imagine a visualization routine that lets you select dots by type and display them in different colours.  How well do the seven tests distinguish between chunks from before 1580 and after 1680? Do the tests agree among themselves? Averaging polling results has become a common practice and is a standard feature of Nate Silver’s 538 site. Does it have a role in NLP statistics?

You would expect that the chunks from before 1580 and after 1680 would cluster very heavily in different regions of the display. If they didn’t, you might abandon the experiment right away because a method that flunks this simple test is unlikely to do better on more interesting tasks. What about the ~ 200 Shakespeare chunks (written ~1590 -~1610) and the ~ 200 Shirley chunks (~1630 -~1660). You would again expect clearly separated and tight clusters, but you would not know whether the difference measures authorial or generational distance.

The test results become more interesting if you stay within a generation and look for boundaries and overlaps of authorial clusters. It would not be surprising if authors clustered and a plausible shape emerged from connecting the outer dots of author X. But what if author X and author Y share some space? If the overlap is limited the machine could still tell you something useful about distinctive features of an author. But any overlap raises the bar for making plausible  claims that X is or is not the author of some disputed text or part of it, and the larger the overlap the harder the claim.

In my days as a Homer scholar I treated the books of the Iliad and Odyssey and the long Homeric hymns as good enough samples for various quantitative inquiries. I encountered recurring patterns where the Homeric books occupied a range with considerable overlap between the Iliad and Odyssey. The Homeric Hymns were clearly outside that range, except for the Aphrodite Hymn, which typically sat within but towards the edges of the Homeric range. Karl Reinhard, the greatest literary critic among Germany’s Hellenists and a man who would not have touched quantitative inquiries with a ten-foot pole, wrote a charming  essay about the Aphrodite Hymn attributing it to the poet of the Iliad.  The numbers square with that hypothesis but they hardly prove it. Quantitative differences between the Iliad and Odyssey are compatible with the ancient chorizont (separatist) hypothesis, but they do not rule out the possibility that they were the work of the same poet.  The stronger argument for separate authorship is based on the deeply and pervasively different ethos of the two works.

Returning to Early Modern drama, the mapping of variance within and difference between various categories (period, genre, author, text) is a critical first task. From one perspective, it is a boring and redundant thing that just uses tedious numbers with many decimal points to tell us what we already know. From another perspective it establishes useful thresholds and limits for arguing persuasively that this or that feature of a text is barely within or clearly outside some known boundary.

How well would the tests do in re-assembling the seven chunks of a particular text?  Most play texts can be accurately dated within a three-year range. You could generate random seven-chunk clusters from within a three-year period and then compare them with the text clusters. I am not a mathematician, but I imagine that there would be a way of assigning to each cluster of seven chunks a single score that measures its tightness. A random sample of such scores  for randomly generated clusters would create a distribution against which the scores of the actual texts could be measured. If those scores consistently exceeded the random scores, you would have to credit the machine with non-trivial “reading” skills of a peculiar sort. Your admiration of them would increase if the scores for 1 Henry VI, Titus Andronicus, or Pericles were considerably below average for Shakespeare’s plays.