This is a progress report on the basic clean-up of the 504 plays in my current Shakespeare his Contemporaries corpus (SHC).  I hope to release an updated corpus  by the end of November. It will replace the current corpus at https://github.com/martinmueller39/shc

The SHC texts are partially curated versions of the TCP texts, which  have “known defects”.Transcribers were instructed to provide markings for characters, words, phrases, lines, paragraphs, and pages that  were missing or that they could not decipher.  These markings are very accurate, and they support an initial quality assessment of the corpus text by text and indeed page by page. They are also sufficiently accurate to support the use of machine-learning techniques for the correction of incompletely transcribed words. That discovery came too late for the SHC corpus but may prove useful with other corpora.

TCP texts also have “unknown defects”, which I define as manifest errors, whether the printer’s or the transcriber’s. ‘Assliction’, ‘prefumption’, and ‘hnsband’  are manifest errors, as are many cases of words wrongly joined or split. Some scholars might be inclined to think of the first example as a printer’s playfully subversive intervention and as such a textual feature of interest, but I doubt that.

For the 504 texts in the SHC corpus I have divided textual defects into “little gaps” and “big gaps” somewhat along the lines of the division of HTML into “inline” and “block” level elements. Gaps up to and including a line are “little gaps”. Everything else is a “big gap”. As for “big gaps”, they add up to about 100 pages in 35 different texts, with almost half of them occurring in just three texts. Curation of the SHC Corpus has so far focused on the little gaps, with incremental improvement over time. Each iteration has reduced the defect rate by a factor of two or more, as is shown by the following table that lists defect rates per 10,000 words at different percentiles:

text stage 25th Median 75th 90th
uncurated TCP texts 5 14 62 126
SHC texts June 2015 1.5 6 18 41
SHC texts October 2015 0.2 2.5 7.3 20

Last June there were just 59 texts with no known defects. Now there are 118.

Given the work that has already been done, how much more work would it take to bring every text to a level at which

  1. all known little gaps have been filled
  2. all known big gaps have been transcribed
  3. manifest errors in the texts, whether the printer’s or the transcriber’s, have been corrected

If you add up the work required and compare it with the potential labour pool of undergraduates willing to do the work (and learn a lot while doing it) the task of bringing all texts up to that level is entirely manageable. In 2013 and 215 a dozen undergraduates from Northwestern and Washington University in St. Louis fixed approximately 45,000 little gaps and in the process identified and fixed about 10,000 manifest and unknown defects.  The number of “little gaps” in the SHC corpus is ~9,200, of which 6,800 are unidentified punctuation marks. 9,200 defects sounds like at lot, but it only involves 0.07% of the 12.5 million tokens in the SHC corpus. The remaining defects cluster heavily: a third of them occur in just a dozen plays.

It will certainly take work to complete the initial clean-up of the SHC corpus. But it is not an extraordinary amount of work, and it is work of the kind that many hands can make light. The students at Northwestern and Washington University were supported by summer research grants of a kind that is available at many institutions.  If you are a freshman or a sophomore and scholarly work in a text-centric discipline is one of the things you think about, spending a summer doing “lower criticism” of a quite humble kind is not a bad way of doing something useful while learning that the textual ground is hardly ever bedrock.

Most of the clean-up has been done with the help of the AnnoLex curation tool, which gives users access to the EEBO page images–digital scans of microfilms. These are rarely good and often atrocious. But even quite poor images can provide conclusive evidence when used with the quite extraordinary image manipulation powers of modern browsers. As reported earlier, Hannah Bredar, Kate Needham, and Lydia Zoells  separately or together visited the Bodleian, Folger, Houghton, and Newberry, as well as the Rare Book libraries at Northwestern and the University of Chicago, where they fixed many defects for which the EEBO images did not provide good enough evidence.

Most of the remaining clean-up will need to be done by looking a the originals or high-quality images produced from them. The Northwestern Library digitized its copy of the 1616 Ben Jonson folio as well as half a dozen James Shirley playbooks. Students proofread the TCP transcriptions of Shirley’s plays  against the new images, and in some cases, the originals. They enjoyed that and discovered a few, but not many, transcription errors in the process.

The images produced at Northwestern in a routine fashion at an inhouse cost of about a dollar a single page are spectacularly better than the EEBO images. Looking at the originals is of course more fun, but for most purposes the surrogates are easier to use, and in some cases it may provide better evidence. Not to speak of the fact that the image can be consulted anywhere anytime on the Web if no rights restrictions are attached to it.

Kate Needham and Lydia Zoells did a useful (though not complete) census of holdings of Early Modern plays in American libraries. From it you learn that print copies of most of the SHC plays still in need of a basic clean-up are held at a library within the daily foot traffic of undergraduates at Chicago, Columbia,  Harvard, Illinois-Urbana, Northwestern, Penn, Smith, Texas-Austin,  UCLA, Williams, and  Yale.  These are the university libraries that according to Kate and Lydia hold more than two dozen original Early Modern play books.

What if one thought of 1616, the 400th anniversary of Shakespeare’s death and Ben Jonson’s chutzpah of offering his plays as “works” as the occasion for an Eranos (Greek for potluck) where libraries in a loosely coordinated fashion contribute new images of some of their holdings of SHC texts? By the end of that year many of those plays could be available as “digital combos” or digital surrogates in which facsimile images are aligned with versions of the TCP transcriptions that have gone through initial rounds of curation by undergraduates. Students at Northwestern and Washington University have clearly demonstrated–if that ever needed demonstrating–that students do this well and learn much from it. Individual contributions from libraries to such a project would not have to substantial for their aggregate to make a big difference.

The “digital combos” resulting from such a project would require more work to be certifiable documentary editions of a particular text, and they would co-exist for years in varying states of (im)perfection. But that is OK as long as there is an environment, both social and technical, that encourage collaborative, iterative, and incremental curation.