Introduction and Summary
This is a report about “Shakespeare His Contemporaries” of SHC, my project for creating an interoperable digital corpus of plays that in addition to Shakespeare’s include most of the plays written within a generation before and after his active career as a playwright. Its keywords are “query potential”, “digital surrogate”, “algorithmic amenability”, and “collaborative curation”. I see it as a small part of, and demonstrator for, a much larger project, the collaborative curation and exploration of the digital surrogates of Early Modern books created by the Text Creation Partner ship. Which in turn would be a substantial chapter in a Book of English that I have often defined as
- a large, growing, collaboratively curated, and public domain corpus
- of written English since its earliest modern form
- with full bibliographical data
- and light but consistent structural and linguistic annotation
I will say something about each of these larger topics before turning to Shakespeare His Contemporaries (SHC), hoping that these ‘precursions’ will frame the topic properly.
I look forward to releasing a version of all SHC plays later this spring. In the meantime, a sample of 40 plays you can download a sample set of 40 texts from http://panini.northwestern.edu/mmueller/sampleshc.zip. I will be grateful for any comments about them.
Query Potential, Digital Surrogate, and Algorithmic Amenability
All objects have a query potential that can be extended in various ways. Consider Spitzweg’s Bookworm as a telling capture of scholarly habits in a predigital world. Books on shelves are easier to search than a random heap of them on the floor. If you have a lot of books you need a ladder to get at some of them. Spitzweg’s ladder is a very simple but quite effective search tool. Notice that the top of the latter is big enough to stand on so that simple search operations can be performed in situ with the book retrieved and returned in a single-step operation.
A surrogate is by definition not the same as its original. It will fall short of it in some respect, but may exceed it in others. Digital surrogates of all the books in the library of Spitzweg’s scholar—and a great many more—would fit on an iPhone, where you can read them anywhere any time, send them anywhere, and add to them from anywhere in minutes rather than hours. The readability of a digital page has increased enormously over the past two decades. I know people who do not like computers very much but prefer iPads to books, especially when reading in bed.
Mobile devices should be a critical component in framing any discussion of the infrastructure of primary sources for Early Modern Studies in the decades to comes. Printed books make up a high percentage of those primary sources. Their digital surrogates–overwhelmingly from the EEBO-TCP archive–will quickly establish themselves as the most common, and often the only, access to early modern books. They will circulate as epubs of one kind of another on a contemporary version of Hamlet’s table of memory. If a text is half-way readable, convenience and speed will trump quality every time. The text of choice is the text you can get to at 2:00 a.m in your pyjamas. This may be regrettable, but it is so. Like it or not, digital surrogates will be the “face” of our primary sources much of the time. We should therefore care about their quality and our modes of access to them.
When it comes to books, putting more of them within easy reach of anybody anytime anywhere is probably the most valuable addition to the query potential of their digital surrogates. But it is not the only thing. There are things you can do with digital surrogates that are different from but may be supportive of, reading them. Franco Moretti’s “distant reading” , Matt Jockers’ “macro-analysis“, Stephen Ramsay’s “algorithmic criticism“, or Michael Ullyot’s “Augmented Criticism Lab” are terms seeking to stake out their parameters. Ted Underwood’s The Stone and the Shell and the Wine Dark Sea blog by Jonathan Hope and Michael Witmore not only offer good demonstrations of these practices but also suggest in their titles that those practices are or should be proper parts of humanist’s toolkit. I have used the term “scalable reading” because I think that quick and easy changes in the angle of vision may be their most salient feature.
Whatever you call them, all those practices rely on the application of Natural Language Processing techniques (NLP) to the documents that constitute the basic materials for scholarship in text-centric disciplines. The words in a text are segmented, classified, and counted in ways that a machine can process. From text “as” data algorithms can extract data structures that may go through various cycles of selecting, sorting, and quantitative analysis before they are displayed to a human reader for analysis and judgement, whether as text snippets, tables, or visualizations. And once the words have been processed in the right way, they can reach out to other words, whether in the current text or others.
Bibliographical textual data have been treated in this fashion for well over forty years, and even the most traditional humanities scholars are used to working with them—often without knowing just how much NLP has “under the hood” gone even into a simple look-up. But data storage has become so cheap and processing so fast that you can now think about the text “itself” as just another data field in a bibliographical record. Those of us old enough to remember peaceful afternoons rifling through index cards in the wooden shelves of a printed card catalogue would never confuse the card with the book. But we need to remind ourselves that this distinction no longer works: it’s cataloguing or metadata all the way down. Derrida’s reflections on postcards may apply.
NLP has come to the Humanities, and we might as well take advantage of it. If you have a philological turn of mind you will welcome it because you see that those new tools and procedures offer new and more powerful tools for some very old and tedious tasks. You want to find all the stuff you need, only the stuff you need, and you want to get it quickly. Google and others spend billions of dollars and employ extraordinarily subtle and powerful NLP routines to help you find stuff. But they live in the world of Morocco’s choice in the Merchant of Venice: “Who chooses me shall gain what many men desire”.
You can make (and iteratively improve) pretty good guesses about what many men ( or women) desire, and fashion search algorithms accordingly. But much scholarship is not about “what many men desire.” It needs “diggable data” and tools for digging into them that are more open-ended and more targetable than routines optimized to serve the immediate needs of “the many”, which includes all of us whenever we look for a cheap flight to Hawaii or a good Indian restaurant in some town we have never been to.
The great German classicist and literary critic Karl Reinhardt wrote in 1946 about the questionable status of “philology,” which for him was a broad term. It should not exclude itself from the “expansion of its borders” that developments
in the humanities have opened to it. But it is part of philological awareness that one deals with phenomena that transcend it. How can one even try to approach the heart of a poem with philological interpretation? And yet, philological interpretation can protect you from errors of the heart. (My translation)
He pleaded for a “a methodological modesty that is aware of something that must be left unsaid and that with all perceptiveness or intuition you cannot and should not trespass on.” Reinhardt died a decade before computationally based Natural Language Processing became available, but his wonderfully double-edged analysis of a philological ethos is very applicable to NLP. You would be a fool to believe that any method can take you through the door of a full understanding (if there were such a thing), but you should take advantage of expanded borders and new developments to help you along the way.
New forms of “connecting curation” are part of “expanded borders and new developments” since Reinhardt’s days. Curation of retrodigitized texts can be divided into the three imperatives of Connect, Complete, Correct. The second and third are in principle finite. For any TCP transcription it is possible to imagine a state when all errors have been fixed and all lacunae completed. Connecting curation is in principle endless: there will always be new ways of correcting or enriching a document. Think of August Boeckh’s definition of philology as an unending task of infinite approximation.
NLP technologies provide critical tools for connecting curation or “Linked Data,” a new name for a very old thing. Linked Data happens when I sit at my table with a copy of Sophocles’ Philoctetes in front of me and the Liddell-Scott-Jones Greek dictionary (LSJ) to my left. When I come to a word I do not know, I look it up, a pleasant but time-consuming procedure. If I read the Philoctetes in the Perseus environment, a click on a word takes me to the LSJ entry in a second or so, because a link to it has been built into the text. In another example, I read a Thomason tract from the English Civil War with the OED and the NDB on shelves behind me. I look things up in the former if I don’t know a word and in the latter if I want to find out more about a person. If the proper “connecting curation” has been applied to a digital surrogate of the Thomason tract, a click on the word may take me instantly and successively to
- other occurrences of the word (or its lemma) in this text
- other occurrences in other texts of the corpus
- outside bibliographical, biographical, or lexical resources
The fourth of Ranganathan’s five charming Laws of Library Science says: “Save the time of the reader.” There may in the end be nothing more to computers than their ability to save us time (leaving aside their devilish skill at making us waste it). But time is the most precious of all commodities. If the value of computers consists of “nothing more” than their ability to reduce the cost of lookups, that “nothing more” is a great deal. “Distant reading” or “macroanalysis” very largely depend on phenomenal reductions in the time cost of look-ups and in extending the range of phenomena that can be looked up very fast and very accurately. You may therefore say that the query potential of digital surrogates is largely a corpus query potential. But you should not forget that “drilling down” for new forms of close reading is a power that comes and increases with corpus-wide analysis.
“Connecting curation” is the key factor in using the digital medium to enhance as well as extend access. Extended access uses digital technology in an emulatory mode and thinks of it as bringing more books to more readers. Enhanced access thinks of digital technology as providing new tools for the analysis of available materials. With enhanced access the distinction between catalogue information “about” the book and information “in” the books becomes increasingly blurred.
A Digital Book of English
The EEBO-TCP texts provide the obvious foundation for a Book of English, as I have defined it. The first tranche of 25,000 texts has been released into the public domain to be followed by another 40,000 texts in 2020. That is not everything that was printed, not to speak of manuscript materials, but it will include at least one version of most distinct titles published before 1700.
The bibliographical capture of those texts rests on well over a century of scholarship that began when the young Alfred Pollard in 1883 joined the British Museum as an assistant in the Department of Printed Books. His work, and that of Redgrave and Wing, were incorporated into the digital English Short Title Catalogue, which is currently undergoing revisions that will make it more algorithmically amenable and open its data to collaborative curation and enrichment by scholarly communities.
The EEBO-TCP texts were encoded according to the Guidelines of the Text Encoding Initiative (TEI). Think of TEI as a “containerization” of texts, where bits of each text are put in “elements” or virtual and labeled boxes. Each box has rules about what other boxes can go into it and in what order. This does not seem to do much for the human reader, but it has two uses. First, it lets you attach to every box a set of rules about how to format its content for printing or display on a screen. Second, and perhaps more interestingly, it does in principle allow a user to say to a search engine: fetch me words that only occur in boxes with the label ‘l’, i.e. lines of verse. Or, in a more complex and nested query: “Fetch me all words from Hamlet that are spoken by Ophelia in prose.”
I say “in principle” because in practice element-aware searching of the TCP texts has been rare or nonexistent. This may change, or so I hope, with the release of TEI Simple, a project that combines a highly constrained and prescriptive subset of the TEI Guidelines with a formally defined set of processing rules that permit modern web applications to easily present and analyze the encoded texts. It is the major goal of TEI Simple to be of help with the encoding, management, display, and analysis of large collections of early modern printed books. The EEEBO texts were encoded in an SGML version of TEI, but they can be transformed without loss of information into TEI Simple. This has in fact been done. TEI Simple versions of the EEBO-TCP texts in the public are available from the Oxford Text Archive. The processing rules are still under development.
“Light but consistent structural annotation” is a pretty accurate description of the TEI encoding used in the TCP archives. The added query potential of this encoding is very significant, and it is likely to be exploited more fully now that the texts are available in a predictable standard format that modern tools and search engines can work with.
Linguistic annotation is not only, or even mainly, for linguists. Wherever human language is the object of computationally assisted analysis, the machine needs to be told a few things about how language works so that it can reliably establish the boundaries of words and sentences, tell nouns from verbs, identify name phrases, etc. It is a way of explicitly introducing these rudiments of readerly and largely tacit knowledge into the data so that the dumb but fast machine can do some of the things that readers do without even knowing that they are doing it. In many scholarly and professional fields from biomedical research to political campaigns NLP routines undergird forms of analysis by users who have not the slightest interest in Linguistics per se.
In the case of the TCP corpus, lightweight linguistic annotation is also by far the most reliable strategy for dealing with orthographic variance, whether you want to level it or identify it as a topic of interest in its own right. In the summer of 2001 a group of librarians and faculty from five EEBO-TCP subscribing institutions met at Northwestern to discuss questions of interface design for the corpus. The report about the meeting states that “Without question, the taskforce members agreed that early modern spelling remains the biggest and most persistent obstacle to easy access,” and “many advocated the addition of a frontend normalizer, which would automatically look for spelling variants of submitted search terms.” The variant detector (VARD) program at Lancaster University’s UCREL Center and the Virtual Modernization Project at Northwestern, have been two implementations of a frontend normalizer.The latter has been part of the EEBO search environment since 2008. A context-independent mapping of an old to a standardized spelling has its virtues, but you get much more accurate results from a contex-tsensitive approach in which you first identify a word token in its particular context as a combination of lemma and part of speech and then map that combination to a standardized spelling that may vary with different purposes.
Linguistic annotation of EEBO-TCP texts has been done at Lancaster University and at Northwestern, using respectively the CLAWS and MorphAdorner tool kits. Over the past two years, the Northwestern project has focused on refining the tokenization and annotation of Early Modern texts in its work on the SHC corpus.
MorphAdorner and the linguistic annotation of the TCP Corpus
MorphAdorner is a general purpose NLP toolkit designed by Philip Burns. A recent grant from the Andrew W. Mellon Foundation supported its improvement in three areas of special relevance to the TCP texts and their imminent release into the public domain:
- Tokenization and the establishment of a framework for word and sentence boundaries with stable IDs to support collaborative curation and improvement of texts over time
- Perfecting the algorithms and training data for lemmatization and POS tagging of Early Modern texts
- Making MorphAdorner functionalities available as Web services that will allow an individual to do this or that with a particular text but will also support automatic workflows.
The separation of initial tokenization from subsequent procedures is a key feature of MorphAdorner. The program creates for each text a hierarchy of addresses sufficiently robust to serve as the basis for collaborative work but sufficiently flexible to allow for the minor corrections that make up at least 95% of the editorial work necessary to make the texts fully acceptable. In addition to supporting the simple correction of manifest errors this stable but flexible tokenization routine turns every token or token range into an explicitly addressable object suitable for more sophisticated forms of exegetical annotation.
With tokenization separated from annotation, MorphAdorner can be rerun over a part of a document without overwriting corrections that may have been made to another part. In addition different lists of abbreviations can be used with different set of TEI elements. This is particularly helpful for “paratext”, such as notes, lists, tables, or titlepages.
MorphAdorner maps each word token to a lemma, a part ofspeech tag, and a standardized spelling. It assumes that early modern spelling can be reliably mapped to a lemma as it is found in a dictionary like the OED. There are edge cases: ‘gentle’, ‘gentile’, ‘genteel’, ‘human’, ‘humane’ are distinct lemmata with clearly demarcated semantic boundaries in modern English, but the boundaries of those words in 1600 were more fluid.
Differences in spelling between modern and early modern English are partly rule bound: the change from ‘vniuersitie’ to ‘university’ can be turned into an algorithm that works with few exceptions. But the bulk of changes does not yield to rules. Syntax and lexical context will often provide enough clues for algorithmic mappings of spellings that are ambiguous in early modern English, such as ‘loose’, ‘lose’, ‘deer’, ‘boar’, ‘boor’, ‘bore’, ‘heart’, ‘hart’, but such mappings need a pair of human eyes to confirm them.
MorphAdorner is completely at home with XML and the latest version of TEI (P5). It can put out its results in a variety of different formats. A tabular output can be easily moved into a relational database environment and is particularly helpful with data review by programs like AnnoLex discussed later in this report.
Annotated versions of ~2,000 18th century ECCO-TCP texts can be downloaded from the Abbot site at the Centre for Digital Research in the Humanities at the University of NebraskaLincoln. Annotated versions ECCO-TCP texts and 5,000 Evans TCP texts from 17th and 18the century America can also be searched in the corpus search section of the MorphAdorner site. This site has a very rough interface but offers a good example of the query potential of a data set that combines structural with linguistic annotation.
This is a good place to mention that not only MorphAdorner, but the revision of the ESTC catalogue and TEI Simple have been generously supported by the Andrew W. Mellon Foundation.
The triple decker structure of cataloguing a text by bibliographical, discursive, and linguistic criteria
A library catalogue is a paradigm of connecting curation. Six million books scattered in a warehouse are one thing, six million books properly catalogued and shelved quite another. Librarians refer to catalogue data as “metadata.” Such metadata are typically about the object as a whole, and they are gathered in the special genre of the catalogue record, which is distinct from the object it describes. There is, however, no reason why such cataloguing should stop at the water’s edge , so to speak, and not extend into the object itself.
Structural and linguistic annotation are de facto forms of cataloguing, although they are not usually called by that name. When the speeches in a TEI-encoded play are wrapped in separate <sp> elements or lines of verse are wrapped in <l> elements, the parts of a text are catalogued and each part can be marked with a “call number” or unique ID that adds speed and precision to subsequent and machine-based activities. Similarly, when a text is linguistically annotated, every word is identified as a separate token and can be associated with various properties, such as the dictionary entry form of the word (lemma) or its part of speech. Readers will think of a text as words in different locations. For the machine a text is a sequence of locations, each of which can be associated with an arbitrary number of “positional attributes,” including but not limited to the spelling of the word at that position. A word in a text or a book on a shelf each takes up a defined space that you can associate with arbitrary amounts of properties or metadata.
Think of a catalogued and annotated digital corpus as a triple decker structure with metadata at the top or bibliographical level, at the mid level of discursive articulation, and at the bottom level of individual word occurrences. The catalogue of 65,000 titles is also a catalogue of two billion word tokens, each of which has a virtual record that “inherits” its bibliographical and discursive metadata. For these metadata to be useful it must be possible to use them for crosscutting analyses. The texts may be apples and oranges, but the metadata must be interoperable.
From any readerly perspective a tripledecker structure of this kind is a bloated monster. A bibliographical catalogue record will be on average 0.1% of the size of a text, while the aggregate of structural and linguistic annotation may add up to ten times the size of the text. But monstrous as it looks to a reader, from a technical perspective there is no longer anything especially forbidding about a deeply curated corpus of 65,000 primary texts with its two billion or so distinct bibliographical, discursive, and linguistic “catalogue records”. Deep curation of this type greatly leverages the query potential of primary source materials.
Texts with bibliographical, structural, and linguistic metadata support robust searching that include queries such as:
- Limit a search by bibliographical criteria
- Limit a search to particular XML elements of the searched text(s)
- Simple and regular expression searches for the string values of words or phrases
- Retrieval of sentences
- Searches for part of speech (POS) tags (including tags for proper names) or other positional attributes of a word location
- Searches that combine string values with POS tags or other positional attributes
- Define a search in terms of the frequency properties of the search term(s)
- Look for collocates of a word
- Identify unknown phrases shared by two or more works (sequence alignment)
- Compare frequencies of words (or other positional attributes) in two arbitrary subcorpora by means of a loglikelihood test
- Perform supervised and unsupervised forms of text classification on arbitrary subsets of a corpus
- Support flexible forms of text display ranging from singleline concordance output to sentences, lines of verse, paragraph length context and full text
- Support grouping and sorting of search results as well as their export to other software programs
Some these queries can be answered only by search engines that do not yet exist in very userfriendly form. But once there are deeply and consistently curated corpora the search engines will follow.
If your goal is to create clean and readable versions of Early Modern books, the layers of metadata envisaged in this report may strike you as excessive. Librarians may shudder at the thought of extending the creation and maintenance of metadata below the bibliographical levels. Professors of English may think that such metadata come between them and the book, while not remembering that there is always a lot between and the book. But what if you think of the EEBO-TCP archive as a single multivolume “book” or the opening chapter in a Book of English? The corpuswide metadata advocated here have for centuries been part of the envelope of metadata that are constitutive features of certain kinds of books, and in particular of editions and reference books. From a quick look at the Wikipedia entry on the Weimar edition of Martin Luther’s works, you learn that its sixty volumes of texts are complemented by thirteen volumes that index places, people, citations, and topics. No Luther scholar would want to be without them.
German has the useful word Handbibliothek or “at hand library” to refer to books that you have and need really quick access to. It is, if you will, a predigital Linked Data structure, with you as reader doing the linking every time you turn from one book to the other. You can think of a deeply tagged version of the EEBOTCP corpus as an edition that includes not only digital equivalents of traditional indexes but has some parts of a Handbibliothek built right into it with a lot additional hooks on which to hang links that reach outside. It is something quite new from one perspective, and something quite old and familiar from another. If done successfully it helps with “agile data integration”, which, as Brian Athey observed, “is an engine that drives discovery.”
The research potential of a scholarly field is enabled and constrained by the complexity and consistence of its (meta)data. In the Life Sciences, GenBank, an “annotated collection of all publicly available DNA sequences,” provides the basis for much research, and the maintenance of this database is itself a research project. You can think of it as the critical edition of a very peculiar form of text written in a four-letter alphabet. Conversely, you can think of the EEBO-TCP archive as a cultural genome whose research potential will benefit from comparable forms of annotation. Metaphors of this kind take you only so far, but they take you quite a ways towards the identification of collaborative data curation as an important and shared task for humanities scholars, librarians, and information technologists.
Shakespeare His Contemporaries
I turn at last to Shakespeare His Contemporaries (SHC), a small chapter in, and demonstrator for, the collaborative curation of the EEBO-TCP as a foundation for a Book of English. The SHC corpus includes 516 nonShakespearean plays ranging from the mid-sixteenth to the mid-seventeenth century beginning with Ralph Roister Doister (1552) and ending with the dramatic sketches by Margaret Cavendish, many of which were probably written well before their publication in 1662. All the texts come from the EEBO-TCP archive. Where a text exists in several versions, I picked the earlier one, unless there were strong practical reasons for using the later one.
Emily Dickinson said that “we see comparatively.” My goal has been to create an environment that will support the corpus-wide comparative analysis of Shakespeare and his contemporaries. What can we learn about Shakespeare and his contemporaries from an environment that supports rapid contextualizations and recontextualizations?
Work so far done on the SHC corpus has much to with the state of the TCP texts. EEBO-TCP is a magnificent but flawed enterprise. Not many of its transcriptions fully meet the scholarly standards one associates with a decent diplomatic editions in in the print world. Judith Siefring and Eric Meyer in their excellent study “Sustaining the EEBOTCP Corpus in Transition” say more than once that in user surveys “transcription accuracy” always ranks high on the list of concerns. They also say that when asked whether they reported errors, 20% of the users said “yes” and 55% said “no.” But 73% said that they would report errors if there were an appropriate mechanism and only 6% said they would not.
There is a big difference between what people say and what they do. But given a user-friendly environment for collaboration, it may be that a third or more of EEBO-TCP users could be recruited into a five-year campaign for a rough clean-up of the corpus so that most of its texts would be good enough for most scholarly purposes. It is a social rather than technical challenge to get to a point where early modernists think of the TCP as something that they own and need to take care of themselves.
Greg Crane has argued that “Digital editing lowers barriers to entry and requires a more democratized and participatory intellectual culture.” In the context of the much more specialized community of Greek papyrologists, Joshua Sosin has successfully called for “increased vesting of datacontrol in the user community.” If the cleanup of texts is not done by scholarly data communities themselves it will not be done at all. And it is likely to be most successful if it is done along the model of “Adopt a highway,” where scholarly neighborhoods agree to get rid of litter in texts of special interest to them.
The engineer John Kittle helped improve the Google map for his hometown of Decatur, Georgia, and was reported in the New York Times (16 November 2009) as saying:
Seeing an error on a map is the kind of thing that gnaws at me. By being able to fix it, I feel like the world is a better place in a very small but measurable way.
Compare this with the printer’s plea in the errata section of Harding’s Sicily and Naples, a mid-seventeenth century play:
Reader. Before thou proceed’st farther, mend with thy pen these few escapes of the presse: The delight & pleasure I dare promise thee to finde in the whole, will largely make amends for thy paines in correcting some two or three syllables.
For SHC, a small slice of EEBO-TCP (<1%) , a very capable team of undergraduates working with me has made significant progress towards transcription accuracy. We followed a “how bad is good enough?” strategy that will rub many scholarly editors the wrong way. In the debate about data curation there is a tension between a philological and a probabilistic ethos. Hillel the Elder said that “whosoever destroys a soul, it is considered as if he destroyed an entire world. And whosoever that saves a life, it is considered as if he saved an entire world.” The Hillelesque version of the philological ethos at its extreme says something like “He who fails to correct a single error destroys the entire text.” The probabilistic ethos, widely followed in the world of information retrieval and Natural Language Processing, says “the noise level does not matter as long as you get enough signal.”
The probabilists have a point. If you think of the TCP texts as fodder for algorithmic analysis, there is only a very small fraction of texts whose level of defects rises to a point at which it would seriously affect the results of algorithmic analysis. Which tells you that algorithmic analysis may be a quite powerful tool, but it is pretty blunt, and you have to be aware of its bluntness to use it well.
If, on the other hand, you think of the text as a text to be read by a human, the story changes. In an eloquent talk about the 60,000 documents in the University of Virginia Press’ American Founding Era, Penelope Kaiserlian talked about them as things to “cherish and preserve” and as objects of “perpetual stewardship.” These are useful phrases to define an appropriate attitude towards the TCP texts. They are not a quarry to be mined but treasures to be preserved. Some may be “possessions for ever” (as Thucydides claimed for his history of the Peloponnesian Wars), others are ephemera, not infrequently boring, hateful, or otherwise repellent in their “what” and “how.” But all of them are valuable witnesses to a critical period in the history of the English-speaking world.
Careful readers have a very low tolerance for the errors that NLP folks take in stride as “noise.” Annoyance is not triggered by the failure to understand the intended meaning of the word on the page. Rather, the failure to clean up manifest errors is seen as a form of “dissing” both the text and its readers. What kind of cherishing and preserving is it that puts up with gross error?
From this analysis emerged two imperatives for any SHC text. First, the text must be “human readable” and tell its readers that it has received the minimal degree of attention that a cultural heritage object deserves. As part of a corpus it must also be algorithmically amenable and “machine actionable”. Readers and machines respond very differently to error. Readers will feel a “yuck” response long before the machine fails to extract enough signal. The order matters. The correction of manifest defects is certainly not the most interesting thing to be done to a text or corpus, but it seems to be the thing most on readers’ minds. And readers are trumps, if only for the reason that they will never trust algorithmically constructed results from what they “see” as a messy text. So the boring and humble task of correcting obvious defects is a crucial measure for building confidence.
On the other hand, you do not have to get everything right in the first text before you move on to the second. Winnicott’s “good enough” mother puts in a useful appearance here. We adopted a corpus-wide approach to editing, with a first-phase and quite modest goal of a rough clean-up of those manifest defects in the transcriptions that could be unambiguously fixed without a need to correct the printed original The result is a corpus in which texts coexist at different levels of (im)perfection. That seem to me a defensible strategy, especially if the texts live in an environment that enables and encourages collaborative curation.
In the summer of 2013 five Northwestern undergraduates had summer internships to work on the collaborative curation of the SHC corpus. Madeline Burg and Nayoon Ahn were rising sophomores, Hannah Bredar and Melina Yeh were rising juniors, and Nicole Sheriko was a rising senior. They worked their way through the SHC corpus, following a path suggested by distinct features of the corpus and its defects. In terms made famous by Donald Rumsfeld, defects in the SHC corpus divide into “known unknowns” and “unknown unknowns.” The transcribers of the TCP texts were instructed to transcribe what they saw on the page. If they could not decipher something, they were instructed to describe the unknown as precisely as possible. Hence the presence in the source texts of markers like <GAP DESC=”illegible” EXTENT=”1 letter”>. You can count these “known unknowns” and from their frequency and distribution make an initial estimate of text quality. Markers of this type cluster heavily in particular pages of particular texts and are in nearly all cases a function of the quality of the digital scan of the microfilm of the original page to be transcribed. That quality is rarely excellent and often atrocious. The more you see of the page scans the more you marvel at how often the transcribers got it right.
The 516 plays add up to 11,750,00 word and punctuation tokens on 17,351 page images, nearly always images of a double page. The known gaps divide neatly and unevenly into a large class of short, and a small class of long, gaps. To begin with the latter, 23 plays miss a total of 92 page images, and 31 plays are missing 158 lines and 88 paragraphs. Roughly one in 10 plays has a substantial lacuna, and the total number of words (including punctuation) that need transcription is about 70,000, which is not that many words relative to the sizable and global scholarly community whose work centers on Shakespeare and his contemporaries.
There was not much the students could do about the class of long gaps. But they did make a very sizable dent in the class of short gaps, of which there are ~52,000 in the SHC corpus. I asked them to start with plays that had few known errors, mark as cruxes cases that could not be quickly settled, and ignore markers that denoted ambiguous punctuation. When they were done and I had reviewed their work, adding a little here and there, we were down to 18,100 residual defects, which neatly divide into two groups: 11,000 cases (~20 per play) of missing words or words with missing letters, and 7,000 cases of ambiguous punctuation marks.
The great 18th century Shakespeare editor Edward Malone somewhere said something like “the text of our author is not as bad as it is said to be.” Something similar may be true of the TCP transcriptions. Take a look at the following table, which compares the defect rate per 10,000 words in the 25,000 EEEBOTCP texts now in the public domain with the SHC texts before and after curation:
Known Defects per 10,000 words
|percentile||TCP Phase 1||SHC plays before||SHC plays after|
A quarter of the 25,000 have no known defects at all. The median rate is about one defect per double page. Even at the 75th percentile there are at best three errors every other page image, which is annoying but tolerable. Things get bad and rapidly worse at the 90th percentile and beyond. There are about 3.7 million known defects in the 25,000 texts of EEBO-TCP Phase 1, but almost half of them (1.78 million) are “owned” by 10% of the texts. Which is both good and bad. Good because you know where to target curation. Bad because people have an incurable tendency to judge any barrel by its worst apples.
If you compare the defect rates of the SHC corpus with the overall TCP figures you notice that up to the third quartile, the defect rates for play texts are worse. Things begin to level off at the 90th percentile, where “bad” is increasingly “equally bad”. This is unsurprising. No more than a quarter of all the EEBOTCP texts but at least two thirds of the SHC texts were printed before 1630. Broadly speaking, the earlier the text, the harder the transcriptions.
After curation, the SHC texts look a lot better than before and somewhat better than TCP texts at large. Looking at the 90th percentile from a reader’s perspective, there is a noticeable difference between running across a defect almost every 100 words or once every 250 words. If you look at the number of remaining defects from the perspective of the number of individuals who have or ought to have a professional interest in having cleanedup versions of the texts, you face an entirely soluble problem.
The distribution of defects in the texts suggests that some form of a quality assurance system would be relatively easy to implement and that it might be a useful thing to do. Combining the known defect rate with a percentile ranking would be the simplest way to go. It would let readers know about the difference between texts that (almost) deserve some Good Housekeeping seal and texts that need a lot of scrubbing.
While fixing known unknowns, we stumbled across roughly 10,000 ‘unknown unknowns’, often by accident and sometimes by looking for spellings unlikely to be right. No more than a handful of the plays in the project were proofread from the first to the last word. So we do not know whether these unknown unknowns are all, most, or just some of the impossible spellings in the SHC corpus. But there seems to be at least one of them for every five known defects. I had thought that there was a positive correlation between known and unknown defects, on the hypothesis that transcribers faced with hard-to-read texts would make more errors in transcribing what they thought they could read. But I was wrong. There is no clear correlation between known and unknown defects. From which I conclude that impossible spellings in the TCP transcriptions are for the most part accurate renderings of what the transcribers saw: spellings like ‘sortunate,’hnsband’, ‘assliction’, ‘a biectly’, ‘I and somely’, ‘lamestallion’ or ‘suriesrend.’ It was certainly the right policy to ask transcribers not to emend what they saw. On the other hand, these are the kinds of things that show up in the printers’ errata and are the occasions of their effusive and whimsical apologies about the errors of their trade. A digital surrogate should correct these cases and represent the printer’s intention, about which in the overwhelming number of cases there can be no doubt. On the bright side, these textual defects are an eloquent testimony to the conscientiousness of the transcribers.
The tools and workflow of SHC: AnnoLex
How did we do the work of collaborative curation and keep track of it? Explicit tokenization and linguistic annotation, performed by MorphAdorner, were the key steps. MorphAdorner can emit its results in different ways. One of them is a very verbose format in which each token occupies a row in a table, surrounded by left and right context, the next and previous words, lemmata, and POS tags, and its XPath or place in the hierarchy of its XML structure.
This format served as the basis for AnnoLex, a Django based curation, designed by Craig Berry . Building on data provided by MorphAdorner, AnnoLex models every curatorial act as an annotation that is kept in a separate file, but is linked to its target through the explicit and unique IDs that are established by MorphAdorner and support the logging of the “who”, “what”, “when” and “where” of every annotation.
AnnoLex derives its name from “lexical annotation,” but it may be easiest to think of it as a spell checker turned inside out. Instead of the machine correcting your spelling, you tell the machine to correct a spelling. A corrected spelling in the context of the EEBOTCP transcriptions means an improved transcription of a word from an early modern text produced by a user who reconsiders and revises the existing transcription. A user with editorial privileges (which could be the same or a different user) can then approve or reject each correction, and finally, approved corrections are automatically applied to the source text via a MorphAdorner routine.
Clicking on the “Edit” button next to a search result populates an edit form in the lower left of the screen, where a revised transcription of the word instance may be suggested and saved. While it is often easy to infer the correct spelling from the immediate context, the edit form includes a button that, for members of subscribing institutions, will bring up the relevant double page image from EEBO. The image may be panned and zoomed such that up to two full pages of context are visible or a single character fills the entire screen (or anything in between), thus allowing a full reconsideration of the TCP transcription using all of the information available to the original transcriber. Even with quite mediocre page images the proper solution often becomes visible by zooming in on a word or line at just the right degree of magnification.
The TCP transcriptions do not record the line breaks of the printed page. Moving from a word in the transcription and finding its place on the printed page has a much higher lookup cost than with OCR-generated texts, where the lineation of the transcription will always be identical with the lineation of the original. This is a small but important point. The textual defects of the TCP transcription very rarely involve complicated or philologically exquisite choices. Was it a ‘Iudean’ or an ‘Indian’ that threw away that pearl richer than all his tribe? Mostly it is stuff like ‘●ortvnate. ‘ The workflow of correcting such a defect divides into “find it, fix it, log it”. Deciding on the correct reading is typically quick and easy once you see it. Most of the time is taken up by finding it and recording it in a dependable manner. If you are dealing with a handful of corrections, you can ignore the time cost of “find it” and “log it”. If you have thousands or millions, reducing the time cost of finding and logging becomes the crucial design problem.
We still cite Plato by “Stephanus numbers” like “81a” or “371e”. The number refers to the page numbers of the 16th century edition by Stephanus. The letters “a” through “e” mark quintile locations of each page, a crude but effective and quite common “textual GPS” of Early Modern books. MorphAdorner does something similar. Every word token has a “facs” attribute that points to the approximate location of the word on the printed page. A facs attribute like “37b-03250” means “Look for word 325 on the right page of image 37”, which in practice means “towards the bottom of the right page.”
Extrapolating from SHC to all of EEBO-TCP
One professor and five students over the course of six months reduced the average rate of defects in the SHC plays from 54 to 15 per 10,000 words. The number of plays with fewer than ten defects per 10k increased from 96 to 309. For plays with a defect rate of > 30 the number fell from 229 to 69. The SHC corpus is far from perfect, but its texts definitely “suck less.” Most of the remaining defects require consultation of the printed original in a Rare Book Library. There are very few plays for which you cannot find a relevant copy in the Bodleian, Folger, Huntington, or Newbery Library. The current version of AnnoLex has the transcribed text and page image for any page with known defects. For many plays, an hour’s work (or less) with the original in any of those libraries will be enough to fix the remaining defects. If you fire up AnnoLex on the browser of your laptop and enter corrections there, they will quickly find their way into the corrected text via the MorphAdorner feature that takes input from AnnoLex and moves it into the text.
How many professors, students, and years would it take to clean up the remaining 64,000 EEBO-TCP texts? There is of course no clear mathematical answer to this “ditch-digging” story problem, but the question encourages some rough calculations. Worldwide, there are thousands of individuals whose professional lives are spent in scholarly neighbourhoods that depend on the EEBO-TCP texts as their primary sources. Because of centrality of drama and Shakespeare to the English literary canon, the ratio of scholars to texts in that neighbourhood is very high. But it is also quite high in the social and political history of the English Civil War, for which the 22,000 Thomason tracts (an important part of EEBO) provide documentation unparalleled in its density and diversity. Most EEBO-TCP texts are quite short. Based on a census of 44,000 texts available to me, their median length is 6,300 words. The interquartile range is between 1,300 and 25,000 words, and only 10% of them approach or exceed “book length” in our sense (> 70,000 words). For the purposes of collaborative curation, the texts come in manageable chunks.
It is unlikely that a standalone tool like AnnoLex will be the solution to a broadly based clean-up of the EEBO-TCP corpus. The tool works very well for the correction of most of the defects that users complain about, but text correction is all it can do. I do not think that you can recruit enough people into a broadly-based clean-up with a tool that can only do one thing at a time. You need a platform that lets you read and explore EEBO texts but also has built into it the “find it, fix it, log it” functionalities of AnnoLex. Such a platform would support “curation en passant,” fixing defects as you come across them in your work, with minimal interruption to that work.
Consider the not entirely hypothetical case of a graduate student writing a dissertation on John Taylor, the “water poet.” The EEBO texts are important to her. More than once she has come across defects in the transcriptions. She is sure (and right) about the solutions. She could of course propose a solution in an email to Paul Schaffner at Michigan. But it takes too much time, and it is surprisingly difficult to describe that somewhere in the middle of the right page on image 45 there is this defect that should be corrected as follows. If, on the other hand, she can just click on the word and enter an emendation in a popup window, a click on the Save button of that window would log the transaction with all its “who, what, when, and where.” Depending on her user privileges, her emendation could move into the text with or without further review. The time cost of “find it” and “log it” are very close to zero in this scenario.
An environment of this kind supports the seamless weaving of collaborative curation into many scholarly and pedagogical workflows. If I were a seventeenth-century historian I would find it attractive to teach a seminar in which the students read mainly pamphlets from the Thomason tracts. Their final assignment would be a “Linked Data” exercise in which they would contextualize one or two pamphlets. It would not be hard to integrate a cleanup of the pamphlets into that assignment, and a review of their editorial work would be part of the grading I have to do anyhow. This is not unlike what Greg Crane’s second-year Greek students do. Students of Ancient Greek spend a lot of time parsing sentences. Greg Crane’s students enter their parse trees into a treebank of Greek sentences, which contributes to a “cultural genome” of Ancient Greek. Achilles hated the thought of being a “useless burden on the earth” (Iliad 18.104). Greg Crane’s students probably derive some pleasure from the fact that their work, however humble, contributes to a scholarly enterprise and is recognized by name.
Downsourcing correction to a machine
Can the correction of some textual defects be “downsourced” to a machine, using spellcheck methods of one kind or another? Modern spellcheckers have an easy time because modern spelling is very highly standardized. Not so Early Modern spelling. The spread of the printing press did indeed accelerate standardization. Orthographic variance diminished at an increasing rate almost from the beginning. The orthographic habits of a text from around 1630 differ less from modern spelling than they differ from a text of 1530. Bu there is no spellchecker for Early Modern English that can assume stable spellings.
That said, many incomplete spellings have only one possible completion, e.g ‘vn●fortvnat’. ‘●ortvnat’ could be ‘fortunat’ or ‘Fortunat’, but nothing else. In the case of ‘lo●e’, a machine can with very high probability infer from the context, whether the underlying lemma is ‘lobe’, ‘lone’, ‘lose’ or ‘love’. Whether the spelling is ‘love’ or ‘loue’ is a little harder, but if ‘loue’ occurs a lot in that text and ‘love’ never, the emendation ‘loue’ is almost certainly correct.
Machines can be trained. Computer scientists speak of “mixed initiative” projects where iterations of machine processes and human review produce incrementally better results. The SHC mappings of almost 32,000 incomplete to complete spellings are one useful input for machine learning. But so are the ~ six million incomplete EEBO-TCP spellings in a context of some words before and after. It may be that between a quarter and a third of all textual defects in the EEBO-TCP texts will yield to algorithmic correction. Two million corrections would a lot.
The philological conscience may bristle at the thought of turning the machine into a “decider”, but it may be soothed after looking at what this “decision” is really about. When the machine encounters something like
<w xml:id=”A00456095690>reg●rde< /w>
it looks up a dictionary of all the EEBOTCP words that begin with ‘reg’ and after some interruption continue with ‘rd’, When it sees that the spelling ‘regarde’ occurs more than 10,000 times and the pattern ‘regard’ occurs more than 100,000 times it “decides” that ‘regarde’ is the right answer and presents its emendation in something like this form:
<w xml:id=”A00456095690 type=”machine100″>reg●rde<reg>regarde</reg>< /w>
which is its way of saying “I am certain that ‘regard’ is the right spelling.”
But what if the transcriber were wrong? Perhaps there were two letters missing, and the spelling could have been ‘reguard.’ Enter Keitoukeitos, the philological pedant in the Deipnosophist of Athenaeus (~200 CE), and given this nickname because in his passion for the purity of Attic prose he asked of every word it “occurs or does not occur” (keitai ou keitai) in the Golden Age of Attic prose. Well, the sequence ‘reguard’ does not occur in EEBO-TCP. The sequence ‘regaird’ occurs in a number of texts, but never in text A00456 or in any other SHC text.
In his wonderful book On human conduct Michael Oakeshott distinguishes between “processes” and “procedures”. The former are “goings on” that do not exhibit human intelligence. The latter do. A dripping faucet or blinking eye are processes. A wink is a procedure. Was that a blink or wink? The distinction matters. Oakeshott thought that mistaking procedures for processes was the “categorial error” of much modern social science. You could think of machine learning as a way of turning procedures into processes. The benefits are clear. So are the risks. The machine can be instructed to perform what are in fact philological look-up operations and behave in certain ways if certain conditions are met. In the case of ‘reg●rde’ the odds of the machine process getting it wrong are vanishingly small. If there are a million or more cases where the odds are similarly low it is certainly worth taking advantage of the machine.
Dramatic metadata and algorithmic amenability
There is more to curation than fixing manifest errors, and I turn at last to the question of how to increase the query potential of digital surrogates by increasing their algorithmic amenability. Drama is a genre with very explicit and highly conventional metadata. Stage directions consist largely of what we would now call a “controlled vocabulary,” much of it in a “stage Latin”: moritur, manet, ambo, solus, omnes, exeunt. The practice dies out slowly in the seventeenth century, but exit has remained.
Because drama comes with its own metadata, the task of turning it into the “ordered hierarchy of content objects” that constitutes a TEI document is less problematical than with other texts. You can take your cue from the explicit metadata of the texts. There are some quite simple but powerful ways of leveraging the already existing metadata and making the texts a lot more manipulable. Words spoken by a character in a play are wrapped in <sp> elements. It is easy to count them: there are some 330,000 in the SHC corpus. It is not difficult to count the words in each speech and construct from the results a model in which a speech is represented by a bar whose length varies with the number of words. Simple as this model is, it may reveal patterns that are distinctive of subgenres, authors. You can complicate the model by adding the distinction between prose and verse. Certain kinds of stage rhythm begin to appear. If the patterns differ by author, genre, or time that becomes a matter of interest. There are obvious pedagogical uses for such deliberately simplified models.
A meta cast list or census of characters and their relations
The TEI <sp> element has a “who attribute” that allows you to associate a speech with a particular speaker and tie it to a role in the cast list. You would not need such an attribute if the printed text of a play always printed the name of the speaker in the same way or in the same place. Which is notoriously not the case for Early Modern plays. In the SHC corpus I count 29,000 distinct speaker labels for some 12,500 distinct speakers, including first soldiers, second ladies, third courtiers, etc. With the help of Thomas Berger’s Index of Characters in Early Modern English Drama I succeeded in mapping all speaker labels to a unique “who attribute” at what I hope is a tolerable rate of error.
The benefits of that tedious mapping are considerable. You can now model a play in terms of who talks to whom, at what point in the play, and at what length. You still do not what they say, but if you were in the surveillance business you might say that once you know who talks to whom, when, and at what length, you already know a lot. The pedagogical and dramaturgical uses of visualizations derived from these data are considerable. You can easily chart the different ways in which a play appears to the characters by charting the sequence in which they appear.
But things become much more interesting if you construct machine actionable cast lists in which a controlled vocabulary is used to classify a character in terms of sex, age, kinship relations, and social status. The aggregate of such interoperable cast lists can become an input for social network analysis. 550 plays from between 1550 and 1650 may not be a bad mirror of the “struggles and wishes” of that age. August Boeckh defined philology as die Erkenntnis des Erkannten or the further knowing of the already known. It is unlikely that completely new insights will emerge from algorithmically analyzed cast lists. Too many highly gifted scholars have spent life times on Early Modern Drama for this to be a plausible hope. But we can look for new forms of corroborative evidence and for a more nuanced view of the history of the genre and changes its ways holding “the mirror up to nature”. Most importantly, the insights gained from these types of analysis will make it easier for young scholars to see the forest and decide on particular trees or groves that deserve a closer look. Many of the lesser plays in the canon of Early Modern Drama will repay a little more attention than they have received, both in their own right and in what they can tell us about their more famous siblings.
I envisage the construction of this prosopography of Early Modern Drama as a crowdsourcing project, starting with undergraduates but opening it up to “citizen scholars” anywhere. The project will use a revised version of the AnnoLex curation tool. For each character there will be a template with some combination of structured and free form data entry.Undergraduates should play a significant role in the construction of a controlled vocabulary, which should emerge from a lightly supervised version of a bottomup “folksonomy”. A generation whose knowledge of genre is shaped by movies, TV shows, or video games may have fresh ideas about a taxonomy of 16th and 17th century plays.