In the ~600 plays that are the target of Shakespeare His Contemporaries (SHC) there are at least 60,000 errors that should be fixed.  Natural Language Processing folks (NLP) may observe at this point that 60,000 errors amount to 0.4% of the total corpus and that they are not worth fixing because any statistical routine applied to a linguistic corpus has much larger error rates and that there is no routine whose results are seriously affected by so trivial an error rate. They are right, but they ignore the “philological yuck factor.” Serious students of Early Modern texts are like obsessive housewives when it comes to dirt anywhere. Judith Siefring at Oxford ran a user survey of attitudes towards TCP texts. The one message that came through loud and clear was : Fix the mistakes.

The virtues of a clean text are very much like the virtues of a clean house (Written by an author whose wife, very far from obsessive, nonetheless wishes her husband cared as much about a clean house as about a clean text). The texts need to be cleaned up, partly because at the margins less “noise” will improve the “signal” and catch outliers that philologists like, but largely because scholarly readers do not like dirty texts and will not use them.

The US Department of Agriculture has classifications for eight grades of beef, from ‘Prime’ to ‘Utility’, recently and unaffectionately known as “pink slime.” It may be helpful to have a similar hierarchy of seals of (dis)approval for TCP texts and similar digital archives. And for reasons explained below, it might be good to make quality statements not about an archive as a whole but about each text in that archive. If there is a process of collaborative curation, a text would improve its quality rating from an initial rating to a rating beyond which improvement is not necesary or impracticable.

The textual equivalent of Prime Beef would always have to be a text that is fully proofread according to the standards one associates with a scholarly edition. The SHC Project does not aim at such a high standard, except in the cases of a Young Scholar edition, where proofreading of professional quality is a sine qua non. Rather the SHC Project aims at fixing errors that are either explicitly flagged in the texts (more about this below) or can be identified through search routines that are likely to retrieve suspect spellings. A text in which these two types of errors are fixed will not be perfect, but it will be in much better shape, and it is very likely to be close to the quality standards aimed at in TCP transcriptions.

Collaborative curation in the SHC Project is based on AnnoLex and its routines for error identification.  AnnoLex has elaborate background routines for keeping track of errors and their correction. It also allows for the backfitting of corrected errors into the source texts.

Known unknowns

How do you go about fixing 60,000 errors? It helps to know about their “what” and “where.” In Donald Rumsfeld’s parlance, they divide into “known unknowns” and “unknown unknowns.”  The transcribers, working from digital scans of microfilm images of printed pages, with highly variable quality at each stage, were instructed to mark with some gap notation letters, words, lines, paragraphs or pages that they could not decipher or that were missing in the first place. Transcribers were also instructed not to transcribe and mark with appropriate gap notations stuff printed in other alphabets or using musical or mathematical notation.The SHC project does not try to transcribe such materials, which require special skills not possessed by anybody on the SHC team.

These are the known unknowns. If you exclude texts that have missing pages, foreign passages, or passages with musical or mathematical notation, there are 592 texts with known unknowns that in principle could be identified and fixed by the SHC team.  The known unknowns are very unevenly distributed. Some texts are quite clean, many texts are not very dirty, and a few texts are very dirty. I have developed a simple formula that computes the percentage of incomplete or missing words. According to it half the errors are found in 10% of the plays, and half the plays have on average no more than one error per page, which is not great from a proofreading perspective, but it is not terrible either. The great 18th century editor Edmond Malone said something like the following about the textual problems of Shakespeare: “The text of our author is not as bad as it is believed to be.”  Something similar is true of the TCP  texts. But humans have a regrettable to judge any barrel by its worst apples, and the quality reputation of the TCP archive probably suffers from that.

Unknown unknowns

The unknown unknowns consist of typographical errors, sometimes by the printer, and sometimes by the transcriber. Printers were quite aware of their mistakes and sometimes apologized for them in special errata sections or pleaded with the readers to correct them, as in Harding’s Sicily and Naples:

Reader. Before thou proceed’ſt farther, mend with thy pen theſe few eſcapes of the preſſe: The delight & pleaſure I dare promiſe thee to finde in the whole, will largely make amends for thy paines in correcting ſome two or three ſyllables.

Many typographical errors in a corpus can be found by looking for spellings that occur only in one text. AnnoLex has a simple routine for retrieving such spellings and listing them in alphabetical order.  They will typically include names and special purpose vocabulary, but it does not take much time to find clear errors, as in this list from Arden of Faversham: ‘la nds’, ‘inturde’, ‘perturde’, ‘seereete’, ‘staunderous’, ‘thyspels’.

The confusion of long ‘s’ with ‘f’ is another error commonly made by printers and transcribers. Some of them are hard to track.

The SHC Project does not distinguish between printers’ and transcribers’ errors but corrects both. But the logging system of AnnoLex  allows for keeping them apart and dealing with them differently at a later stage.

Where to start cleaning

In a house you would probably start the cleaning in the dirtiest rooms, if only because you don’t want to track more dirt into the less dirty ones. In the SHC Project we take the opposite approach and begin by picking low-hanging fruit first. There are three reasons for this:

  1. If a text has few errors it was probably transcribed from a legible page image, and it is easier for the curators to identify the errors, especially in the early stages of their work.
  2. We want to maximize the number of plays about which by the end of the summer we can say with confidence that certain types of error are absent or very rare.
  3. We want to tackles some of the difficult texts towards the end when the curators have a lot of collective experience. In the case of some plays the page images are so bad that perhaps they should not have been transcribed at all. It is questionable whether they can be fixed. But the number of those cases is mercifully small.