The following is an abridged and lightly edited version of a blog entry that I first posted in March 2010 on my now defunct Literary Informatics blog.

Here is a small but potentially promising experiment with a group of undergraduates in a Shakespeare class that I taught in the winter of 2010. Its s subtitle was “From Words to Themes and Patterns.” The course revolved around the tensions between what used to be called “Lower” and “Higher” Criticism. However grand, general, or high the ideas, patterns, or themes in a play may be, “a text is built of words, and words alone,” as one of my students put it. In one of my assignments I asked the students to write five observations about individual words. Through WordHoard they had available to them a lemmatized and morphosyntactically tagged corpus of 320 Early Modern plays, including all of Shakespeare — enough data to map the distribution of words by genre, author, or corpus and to argue from low-level distributional to high-level thematic patterns.

I decided to give students the opportunity to substitute an exercise of collaborative data curation for a final paper, on the theory that it would not hurt them to rub their noses in the textual dirt of Early Modern print. 18 of 26 students chose this option. One of them wrote:

I chose to do the data curation because I thought it would take less time than a paper. That was absolutely not true. However, it did engage my brain in a way in which it hasn’t been engaged in a long time, and that is in the critical-analysis-problem-solving sort of way.

More about student response after describing the nature of the exercise. The texts came from the Text Creation Partnership transcriptions of the Early English Books Online archive. These were ‘double keyboarded’ — that is to say, each text was transcribed twice from the digital page image of a microfilm image of a printed original. The transcriptions were collated automatically and only divergences were checked for errors on the theory that experienced typists will only rarely make the same mistake in the same place.

The TCP texts are expected to conform to a standard of 99.995% accuracy. If this claim were true, there would be little point in further data curation: a typical play of 20,000 words would on average have only one transcription error. The reality is, however, quite different. Digital page images of an indifferent microfilm of a hastily typeset and ink-blotched 16th-century text are not easily transcribed, and the 99.995% accuracy rate is seriously compromised by the thousands of cases where the transcribers fell back on a “can’t read this” defence by entering symbols for unreadable gaps ranging from a single letter (by far the most common) to whole paragraphs or pages. In some 280 play texts of approximately six million words, there are 60,000 gap events: on average one in a hundred words is incompletely transcribed. These gaps cluster: for instance in the 1590 edition of Tamburlaine, the error rate is on the order of 2.5%.

The good news about this is that the gaps are explicitly marked and can be extracted for treatment. This is where Annolex comes in, a prototype of a curation tool developed by Craig Berry. It input is a ‘vertical’ representation of a tokenized, lemmatized, and morphosyntactically annotated XML file. Every word token from that file becomes a data row in a MySQL database and is associated with a unique ID, a lemma, a POS tag, the spelling of the word token, and sufficient context left and right to reach a decision in most cases about the spelling, lemma, or POS tag of the word in question.

The input files came from the MONK project, where close to 700 TCP-EEBO files were linguistically annotated with Phil Burns’ . Through Annolex, my students encountered a text that had been heavily curated: it was manually transcribed into an XML format, it was lemmatized, and morphosyntactically tagged , and it was presented to them in a format that focused their attention on residual errors from these earlier curatorial stages. The students were asked to consult the EEBO site and look at the digital facsimile if there was any doubt about the nature of an incomplete spelling. In its prototype Annolex does not permit any automatic line-up of transcription with page images. The students printed out pdf files from the EEBO page images, which worked well enough.

I chose eight plays that I knew had a lot of missing letters or words:

  1. The Arraignment of Paris by George Peele (1581)
  2. Tamburlaine by Christopher Marlowe (1590)
  3. Sir Thomas Wyatt by Thomas Dekker (1602)
  4. Father’s Own Son by John Fletcher (1615)
  5. TheOld Law by Philip Massinger (1618)
  6. The Courageous Turk by Thomas Goffe (1619)
  7. The Beggar’s Bush by John Fletcher (1622)
  8. Love’s Mistress by Thomas Heywood (1634)

I assigned two, and in some cases three, students to each play asking them to attend first to transcriptional errors, and then to lemmatization and POS tags, in that order. How well did they do? You can answer the question in terms of what they learned or in terms of the quality and quantity of work they produced. I asked the students to write a brief essay on their work. In an earlier and technically more primitive version of this experiment some students had written very eloquently about the experience of encountering a text not in the familiar environment of a modern book, but surrounded by the uncertainties of early modern print and the fragmentations of digitally remediated text snippets.

I had hoped students would be equally thoughtful on this occasion, and I was not disappointed. If you ask for reflections as part of an assignment before the final grades are due, you don’t expect students to tell you that this was the dumbest thing they ever did. They are more likely to put a nice face on it. But not always. For one student the project was ‘horrible’ but required no higher level thinking, and he was able to do it while watching the NCAA Basketball tournament and chatting with his friends without any slowdown to his work. On the other hand, he made a lot of mistakes.

The other responses seemed thoughtful and appreciative of the experience. At my cynical worst, I must give my students credit for knowing what to say on such an occasion and for saying it with intelligence and flair, but I prefer to believe that they meant at least some of what they said. One student’s comments begin with the lapidary sentence “Doing this curatorial work has taught me, more than anything, the importance of it.” Here are some other general comments:

I felt like a digital new-age monk or something, having contact with the original texts, or the digital copies of them, and putting them in a database that would be widely accessible and searchable. It was cool to be part of a project so grand.

All in all, to edit a 17th century text was a most thought-provoking and interesting experience. Though it took exponentially longer than writing a paper, it generally makes me value the text I daily have readily available.

Trying to discover missing letters, words, and spans of words amidst the play became a kind of puzzle, and it was rather satisfying when a piece actually fit, a phrase could be completed, and clear meaning finally emerged.

Overall I thought that this project was incredibly interesting and worthwhile. Data curation and error correction are aspects of literature I have never learned about before or really given much thought to. I feel like I have a new appreciation for all the hard work transcribers and monks of the past had to endure.

I found the data curation project to be an excellent (and unorthodox) substitute for the typical final paper. Working with the text of John Fletcher’s The Beggar’s Bush gave me a great sense of productivity, and examining the text was an interesting study in the evolution of the English language. Learning some of the ins and outs of a different computer program is never a bad idea.

Overall, although to me working in AnnoLex was a long and tedious experience, it was also rewarding. I can now say I aided in restoring a 400-year-old text, which I am almost positive none of my friends have done before. I felt like I was actually a crucial part of saving and restoring an old text.

Overall, I enjoyed word corrections. I learned about the development of language, and technology in English literature. But I think the amount of enjoyment and education anyone gets out of a data correction depends on their personality; I know people who would find it tedious. Ultimately, the project was worthwhile for me, and I am glad I got the chance to use Annolex and EEBO.

So I’d say that the biggest thing that doing data curation with Annolex has offered me is some perspective. Ironically, it’s programs like this that lead us to feel like computers perform most tasks, when I’m now really struck by how much manpower goes into getting such an application right. And I have to admit that at times I found the task rather tedious. This is not to say that it wasn’t largely intriguing, or useful to see how it works – just that it certainly wasn’t what I’d call “fun.”

Another change in mindset that this task, and really this class as a whole, has led me to is that a text is built of words, and words alone. It’s not that this is a startling revelation, but it’s easy to let that fact slip from consideration, and we forget that each word has its own important place in a text, and its frequency, associations, and placement are all loaded with meaning. The class as a whole, as I said, has focused on this point, but ultimately it was data curation, and the fact that it compelled me to work through a text and scrutinize every single word as I went, that really drove this point home.

There was more than the occasional comment about the pleasure of being useful. This has been a recurring theme in comments by the Australian newspaper crowdsourcers, although in that context the interweaving of local and family histories has been a powerful and acknowledged motivator. Here are some comments along those lines from my students:

I enjoyed contributing to the collection of Early Modern English knowledge, and will probably continue to look at Tamburlaine and other plays while I’m at home over spring break.

I felt like I was contributing to the “greater good”.

However, I feel that the progress that I did make, chipping away one missing letter or word at a time, was fairly substantial, and will make the reading and analysis of The Araygnement of Paris easier and more accurate from here on out.

The process was long and tedious at times, but overall I enjoyed the edits I made; the fact that my work improved the quality and accessibility of the play is certainly a point to be proud about.

Quite a few comments testified to the insights gained when confronting the words on an Early Modern page in a task-oriented mode and learning about historical difference and change at the microlevel of orthography, grammar, and lexical meaning:

I was uncovering new meanings from the current dictionary, discovering past words, and basically learning about the English language while reading a story.

For example, I learned that the “s” at the beginning of a word often resembles an “f” in early printed versions of texts, and that the possessive is almost never distinguished with an apostrophe, but looks exactly the same as the plural form of a given noun. These distinctions are helpful to me not only in literary analysis but also as a theater major who studies the First Folio of Shakespeare often in researching a scene from one of Shakespeare’s plays.

Working with the PDF scans of early modern text led me, each time I squinted and leaned in to my computer screen in order to try to make out a hard-to-read hand-written letter, to think about the steps involved in getting to an easy-to-read electronic version.

It was surprising to see just how many errors there were in the text I was examining. Many of those mistakes could not be corrected simply through common sense reasoning, as there were often numerous potential ways in which a misspelled word could be fixed… The images proved to be invaluable in my work on The Courageous Turke.

Some pages seemed to be clearer, with better spacing and lines of words, while others were ink blotted and wrinkled. I thought that more than one person printed this document, but perhaps it is a miracle that this document even survived long enough to be photographed for microfilm.

I was struck by the number of missing letters that seemed like they could be corrected without looking at the original source. I asked myself why the original transcriber had not initially been able to figure out the missing letter. In many cases, it was the fault of the original copy. The letters were too smudged, blurry, or otherwise ink-stained to clearly discern what the missing letter was, or, in a few cases, the right letters were printed, but were upside down in their place (likely an error of the original printing press on which the letters were arranged).

Although there were some missing letters or words that I was unable to distinguish and fix, the majority of the errors that I found were fixable with some consideration, investigation and squinting.

After I got used to scanning the text, I could start to determine the spelling of even very smudged words. I began to get a sense for it. This taught me how quickly this language can change and gave me some insight about how it looked and felt before. … For me this is always a thrilling thing to think about—the origin of words and the changes they undergo—because I have been trained as a writer and I think writers ought to be keenly aware of tools they are working with.

Doing this work also acquainted me with a problem inherent in studying early English texts. While correcting, I ran into several cases in which my reading of the manuscript differed from the original transcriber’s. I corrected them, but I had to ask myself, “Who’s to say that I’m right?” … Transcribing and correcting old, ink-blotted texts requires a certain amount of human interpretation and agency. …There will be words I think of as going together. There will be a syntactic logic that feels right to me. But these “gut feelings” are not the same for everybody.

One student even waxed nostalgic about the long ‘s’: “Also, I think we should bring back the Old English use of ‘ƒ’ as ‘s’. I am starting to grow accustomed to it.” For me, the distinction between terminal and non-terminal ‘s’ has always been the height of scribal folly, whether in older English or in ancient Greek, but to judge from the TCP practice, this student is not alone in her preference, and a dear and very sensible colleague of mine detests the lunar sigma that is now the common and single form of ‘s’ in some Greek texts. So there.

From reflective comments like these I conclude that the data curation experiment was a pedagogical success. It worked at least as a well as a paper assignment to challenge the students’ intelligence, imagination, and ingenuity. It probably did a better job than most assignments of making them think a little about textual integrity and why it might matter. More specifically, if it is a valuable pedagogical goal to get students to reflect on the relationship between low-level verbal phenomena and high-level interpretive or integrative activities, there is much to be said for encouraging more use of a digitally oriented “lower criticism.”

How useful was the students’ work? Many of their corrections found their way into the current version Early Modern Drama on WordHoard. Posed in more general terms, the answer to the question depends on a lot of variables. My general conclusion is that students at Northwestern can with relatively light training and supervision be taught to perform many curatorial functions that will add value to texts for scholarly use. But course-based assignments, which live in the endemically ‘last minute’ world of midterms, quizzes, finals, and the like are probably not the best institutional environment for producing reliable work at an affordable time cost. Summer internships, work-study jobs, or more sustained and complex projects like honors essays may be better ways of engaging students in work. More about this at the end of this blog.

For any doubtful word occurrence, the students performed an editorial act that consisted of either doing nothing or making a change in any or all of the following:

  1. the spelling
  2. the lemma
  3. the part-of-speech tag

The editorial act would be triggered by clicking on the ‘edit’ button of the Annolex screen, would consist of entering the appropriate changes in the spelling, Lemma, and POS boxes of the Edit Word box, and would be confirmed by ‘saving’ the changes as an ‘update’. Once ‘saved’, this editorial act creates a new row in an error table containing the old and new values for the word occurrence that has been subject to the editorial act as well as information about who did it and when. This error table is subject to review by an editor with privileges for approving editorial changes and applying them to the main data table.

From error detection to final error correction, there is thus a chain of many steps, not all of them enumerated in this scenario. Some of them are algorithmic. Their cost can be calculated with some precision. The initial costs of an algorithm are likely to be high. The marginal cost of running it another thousand or million times are likely to low or even close to zero.

Some steps involve human judgment, and they are much harder to calculate. In the context of curatorial acts, a human judgment may be divided into three components:

  1. Getting there
  2. Doing it
  3. Recording the judgment and moving on

While the first and third part are algorithmically amenable, the second is not, but the total time cost of the human judgment is critically affected by the ways in which it is embedded in algorithmically supported versions of before and after. The critical second step needs to be subject to some review, which is itself a human judgment and as such subject to the same rules. Many kinds of decisions may be adjudicated as effectively by ‘democratic’ voting as by ‘oligarchic’ or ‘monarchical’ review. Which way to go is largely a pragmatic decision.

I looked at approximately 5,000 individual student judgments and came to the following conclusions

  1. Most of the students who did this project were quite serious about it.
  2. Nearly all of them would have benefited from a structure that would have forced them to spread it out over time rather than do it at the end.
  3. All of the students picked up very quickly on the tasks of identifying incomplete or missing spellings accurately and distinguishing with confidence between easy and hard cases.
  4. All of the students were good at finding the modern lemma, although errors of negligence –most of them algorithmically fixable — were common.
  5. Most students had trouble with getting POS tags right and were quite explicit about it in their comments.

Some practical conclusions follow from this. Three quarters of more of incomplete transcriptions in the Early Modern Drama texts (which in this regard are typical of TCP texts) are words with one or two missing characters. About two thirds of these or about half of all errors can be corrected without reference to the page image by an automatic ‘vote’ of two judges who have been instructed to make judgments only when they are confident.

The time cost of consulting EEBO page images is intrinsically high because the digital transcriptions do not follow the lineation of the printed source. Thus with any word human readers will have to answer the questions “what line is this word on?” and “where is it on the line?” before being able to compare the transcription with the original. Thus, while the student’s encounter with the messy details of Early Modern printed pages is of great pedagogical value, from a perspective of getting things done you want to correct as many errors as possible without looking at the source pages, and if you make a new error for every ten you fix you are still way ahead.

If students are pretty good at recognizing what errors can be fixed without reference to the original and at fixing them, the practical value of their work may consist more in providing training data than in fixing data directly. Consider as an example the common error of a word that is transcribed as ‘●o’, a missing letter followed by ‘o’. In an English text, this can stand for ‘do’, ‘no’, ‘so’ or ‘to’. If you manually fix the ~200 occurrences of this sequence in the Early Modern plays, and keep the trigrams with the preceding and following words, you can probably provide accurate corrections for 90% of the 10,000 to 20,000 occurrences of that error in the remainder of the TCP corpus.

Assigning POS tags was predictably the most challenging part of the enterprise, and I now think I went about it in the wrong way. The tag set used in the EMD corpus has ~250 tags although only three dozen are used with any regularity. But even three dozen tags is a lot to remember and manage if you encounter POS phenomena in a more or less random order, which is the case when the order of correction is determined by incomplete spellings.

If you want to correct POS tags at all –and it is not clear whether there really is much benefit in reducing the error rate of a good automatic tagger like Morphadorner — it is better to use human judgment on sharply focused operations that target known errors and seek to improve training data. For instance, Morphadorner is likely to make mistakes when it comes to distinguishing between the superlative form an adjective (nicest) and the 2nd person singular of a verb (thinkest) . If you select all instances in which Morphadorner assigned the ‘vv2’ tag and ask curators to pick out all cases that are not of the type ‘thou thinkest’, they will on the first page of the Annolex result list encounter ‘your smoothest face’ and ‘his tendrest thoughts’. It takes no knowledge of POS tags to recognize that these are not verb forms. Now you reverse the query and look for all instances where a word tagged with ‘js’ is not a superlative form.

A more consequential or interesting error involves the possessive case. A few years ago Martin Wattenberg demonstrated a fascinating visualization in which he showed how much you learn about a text corpus if you can see the property relations implicit in phrases of the type “the king’s daughter.” Because in early modern English the possessive case is rarely marked by an apostrophe, taggers make many mistakes. But in any text corpus that supports stylistic inquiries you will want to reduce the error rate for that important distinction. Again it is quite easy to frame a question in such a way that it draws on the curator’s tacit knowledge of English rather than an explicit knowledge of POS tags.

It is probably the case that between one and two dozen targeted distinctions of this kind can be used to involve human judgment in the improvement of training data to a point at which automatic tagging will produce results with acceptable error rates for most purposes. The trick will be to perfect collaborative curation tools like Annolex to a point at which their workflows let humans use their precious time to exercise judgments rather than ‘getting there’ or recording their decisions.

Much of this blog has been concerned with data curation of a type that employs a radical division of labour and divides tasks into atomic acts that can be performed intermittently and independently of each other, taking to new and unheard of levels the ancient proverb “many hands make light work.” But there are also more holistic versions of collaboration with the potential for more discretion and integration. In reading through the student comments I was particularly struck by the following:

But with this project came a sense of proportion: I often told myself that making a mistake in a Dekker manuscript was considerably less offensive than making an error in Hamlet. …On a related note, looking through this play has taught me how good Shakespeare really is. This is something I knew already, that Shakespeare is good, but up till now I’ve never read work by one of Shakespeare’s contemporary playwrights. Dekker’s play seems, next to Shakespeare, to be all plot, some clownishness, and zero artistry. So correcting this manuscript has taught me, finally, a more real appreciation of the topic of this course.

This is right, but if this student were not already graduating this spring, I would be inclined to say something like this to her: “Dekker’s Wyatt is one of some 400 plays that have survived from the age of Shakespeare broadly defined. It may not be wonderful, but it tells us something. How would you like to spend part of your senior working this digital text up into a model of a digital ‘interedition’ in the hope that other students here and elsewhere will follow and that a decade from now there will be an interoperable archive of consistently edited and annotated texts of all or most plays from the age of Shakespeare? Such an archive may well confirm our judgment how much better Shakespeare is, but it may also give us a better understanding of how deeply his plays are embedded in the theatrical practices of his day.”

From what I know about the quality of honors work at Northwestern, I have no doubt that talented and interested seniors can create editions that meet scholarly standards, and what is true at Northwestern is true of hundreds of other colleges and universities. I note with interest that the EEBO Introduction Series is an effort to engage younger scholars in adding value to particular EEBO through analytical materials of various kinds. This is clearly a good thing. At the same time it seems to me that at the current moment we should not lose sight of how much can and should be added to digital texts by extending their query potential and interoperability. Much useful, interesting, and challenging work remains to be done at the level of the texts themselves, and much of that work can be done by harnessing the intelligence and enthusiasm of future scholars while they are still students.