The following is a republished blog post originally published on October 1, 2009 on my now defunct Literary Informatics blog. Some of the points made then are not quite true anymore (especially my lament about the lack of concern for quality in the Hathi Trust project), but many of them remain true enough.

Are the Text Creation Partnership texts good enough for research purposes? By ‘good enough’ I mean two things:

1. The texts must support traditional scholarship as practised in print environments
2. The texts must support the distinct affordances of the digital medium

My answer is that they are not yet good enough either by the first or the second criterion. In this blog entry I deal with the fist criterion, but will write about the second in a later entry. The texts can and should be be made good enough, but this will require some change of mind, habits, and priorities. So far the retro-digitization of texts as practised by American research libraries has been largely driven by considerations of quantity. Much less attention has been paid to quality issues or to the question how to exploit the distinct affordances of texts in a digital medium.

The most striking evidence of a lack of concern with quality appears on the splash page of the Hathi Trust, the academic cousin of Google Books. There you find such words as ‘shared digital future’, ‘bold ideas’, ‘big plans’, ‘solution’, and a Herodotean delight in enumeration governs the side bar about the elephant in the library:

• 4,318,443 volumes
• 1,511,455,050 pages
• 161 terabytes
• 51 miles
• 3,509 tons
• 691,253 volumes (~16% of total) in the public domain

But neither on the splash page nor on the pages about Mission and Goals or Functional Objectives do you find the word ‘quality’ or any other words that speak to a sustained engagement with the question whether the texts that are being digitized are good enough for the scholarly purposes to which they might be put. ‘Good enough’ is of course a highly relative concept and varies with the nature of the text and the scholarly inquiries associated with them.

The texts in the Text Creation Partnership are a small subset of all retro-digitized texts, and the project is based on the recognition that uncorrected optical character recognition (OCR) is typically not good enough for historically significant works that serve as the ‘primary’ texts or documentary infrastructure for scholarship in the humanities. But even in this project, where progress is measured in the hundreds or thousands rather than millions, a concern with getting lots done quickly has pushed aside questions about whether the texts are as good as they should be.

Let us take a look at the texts of 284 plays from the sixteenth and seventeenth centuries. I start with the 1587 edition of Tamburlaine. This is a milestone in the history of English drama, and if there is a class of texts deserving of special attention, this text would surely be among them.

There are three kinds of easily measurable errors in the TCP texts. The transcribers worked from microfilm images of highly variable quality. They were instructed to mark with ‘gap’ elements letters or words they could not read. The first type of error consists of words in which one or word letters were marked as illegible. The second kind of error consists of words, phrases, paragraphs or pages that could not be read or could not transcribed for one reason or another. A third type of error consists of words that are wrongly split or joined. I counted these as I reviewed the morphosyntactic tagging of the EMD corpus for the Monk project. These counts are much less reliable and probably undercount by a factor of two or more.

In the two parts of Tamburlaine, there are 625 incomplete words, 267 occurrences of one or more missing words and 26 words that are wrongly split or joined. About 3% of the words are garbled in one way or another.

In the work I am doing with these texts, I need to distinguish between words in verse or prose as well as between words that are spoken by the characters and ‘paratext’ or words that occur in stage directions, speaker labels and the like. These distinctions, which are part of the encoding project, are not observed with sufficient accuracy in this play. Thus there are several dozen lines of verse that are coded as prose, such as “I heare them come, shal we encounter them?” or “I must be pleasde perforce, wretched Zenocrate.” There are at least a dozen stage directions that are coded as if they were speech, such as “Manent Cosroe & Menaphon.” or “Souldan of Egipt with three or four Lords, Capolin Souldan.” I do not know many scholars of Early Modern Drama who would shrug off errors of this kind. The TCP digital version of Tamburlaine simply is not good enough for scholarly purposes.

[Added August 30, 2012: the story of Tamburlaine is both worse and better. Several pages had not been transcribed at all because the digital page images were missing, including the page with the famous line “Riding in triumph through Persepolis” (quoting from memory). This has been corrected by the indefatigable Paul Schaffner. It is a nice story of the use of texts leading to the discovery and correction of error.]

The source text for Tamburlaine does not count among the triumphs of Early Modern typesetting, and from my experience with TCP texts this particular file is way below average. But there are quite a few texts that are as bad or worse. By my computations the error rate for 284 plays goes all the way from a top of 8.8% for Wealth and Health, a text of 7,645 words from 1554, to 0.01% for 7,619 words of The Old Wives’ Tale from 1590. The median value is 0.5%.

There are good news/bad new stories you can tell about this. The good news is that there are 101 texts with an error rate of less than 0.2.%. If Tamburlaine were in that group it would have ~60 rather than ~1,000 errors. On the other hand, there are 90 texts with an error rate of more than 1% and 40 texts with an error rate of 2% or more. A text in which one of every hundred words has something wrong with it is not a thing of beauty: it amounts to three or four error per printed page. Texts with that error rate may support searching and other forms of analysis well enough, but they are simply too disfigured to be accepted by scholars as texts with any claim to reference quality.

There are things that can and should be done about this. The technical or budgetary issues are non-trivial, but they are not crippling. The key issue is a matter of will and priorities. Nothing will be done as long as libraries think that adding new texts is always more important than adding value to the texts you already have. This may have been true for the first two decades of retro-digitization, but it is becoming a more dubious proposition.

Secondly, libraries should think about involving their users in the task of making texts better over time. Greg Crane has raised the provocative question “What to do with a million books?” From one perspective this is a wonderful question and opens up all kinds of vistas. But the question has a flip side in its implication that a million books is an entity of such scale that no individual can do much with any part of it. ‘Error correction’ and ‘a million books’ do not easily coexist in the same sentence.

But think of a big university library with five million books or more as a big city of equivalent scale. Taxi drivers, couriers, and some other people criss-cross all of the city some of the time. But most people live in neighbourhoods most of the time, and the scale of the neighbourhood is measured in hundreds. Similarly, scholars live in the neighbourhoods of their disciplines, and interdisciplinary work typically consists of connecting with only a handful of neighbourhoods.

Scholarly neighbourhoods differ from the ‘real’ ones in one crucial regard. For instance, the scholars who make up the neighbourhood of Greek papyrology do not live next door to each other. Instead, they are scattered all over the globe, but they have their block parties and behave like neighbours in many important respects, not excluding their quarreling.

Replace the neighbourhood metaphor with the concept of a ‘data community’, and you have a model in which the ‘million’ books change into a network of overlapping relationships between limited sets of readers and limited sets of books. In five or seven years the full text collection of EEBO will contain some 50,000 items and will include at least one edition of most books that were published before 1700. The community of scholars who regularly make use of this collection is measured in the thousands, certainly not hundreds of thousands.

Now focus on the relation between those readers and those books. These readers will of course read other books, and the books will sometimes be used by other readers. But it still remains true that there is a privileged relationship between this clearly defined and relatively small user base and the equally well defined and relatively small data set of retro-digitzed texts that will be a major part of the documentary infrastructure for scholarly work on Early Modern England.

Imagine scholars grumbling — as I am doing right now –about the quality of texts. The librarian’s answer should be a modified version of: “Fix it yourself.” The crucial modification lies in the fact that I may end up fixing some text myself, but I will do it in such a way that it will be fixed for others as well. Neighbourhood clean-up parties or ‘Adopt a highway mile’ programs are analogues from the everyday world. The community annotation projects common in genome research are analogues from science. But the basic point is the same. There are neighbourhoods with problems, whether litter on the roads, invasive species of weeds in the meadows, garbled words in texts, or missing parts of genomes. There are enough neighbours to fix the problem if they can get their act together. This is always the big ‘if’, but while it does not happen often, it is not as rare as unicorns.

The Greek papyrologists whom I mentioned earlier are a good example of a scholarly neighbourhood tackling this kind of problem. The Mellon Foundation has recently funded a second phase of the Integrating Digital Papyrology project, which is international, inter-institutional, highly collaborative, and dedicated to “increased vesting of data-control in the user community.”

“Increased vesting of data-control in the user community” is a brilliant phrase and helps us rethink the role of scholarly users in a world of digital data. Users as user-contributors can transcend the role of consumers and add value to collections, whether through simple acts of clean-up or more complex forms of annotation. The scholarly neighbourhood of Greek papyrologists is a very special case. Much of the scholarly work in that field consists of data curation, and in their daily work with data papyrologists might find it hard to say where data curation ends and data analysis begins. But while there are big differences in the ways in which Greek papyrologists and Early Modern scholars work with digital texts, some fundamentals are the same. If incremental improvement of data is not vested in the user community, it will not happen. The users are the people with the motive, the expertise, and the time. No individual has the time to fix a hundred, let alone a thousand texts, and scholars are unlikely to invest time in adding value to dozens of texts unless it bears fruit in their current project. But everybody has the time to fix one text or part of it provided there is a framework that makes it easy to flag errors and suggest corrections as you encounter them in your work with this or that text.

Does this happen? The answer is ‘yes’. The prefaces to second editions of books are full of acknowledgements of the ways in which reviews and other forms of reader response helped to root out error or add improvements. Readers have always been ‘user contributors’,but their contributions occurred in highly mediated form and in what the computer folks call ‘batch mode’. With something like the TCP archive, the libraries that hold it are keepers and publishers at once. Corrections can be suggested and incorporated immediately and incrementally. The technology for this exists, but the social model for it needs development and refining.

Last spring I taught a class on Early Modern drama in which I gave my students the option of cleaning up a text in lieu of writing a paper. Half a dozen students chose this option. The Monk Project provided me with tokenized and and morphosyntactically tagged versions of Early Modern plays from the TCP collection. The students worked with a ‘verticalized’ form of their text in an Excel spreadsheet. For every word occurrence there was a data row in the spreadsheet. You read the text in a downward direction on the Y-axis. On the horizontal X-axis every word is not only surrounded by five words of context to its left and right, but there are columns for the lemma and POS tag, as well as columns for correcting the original spelling, the lemma, or the POS tag. The spreadsheet can be sorted and filtered in various ways so that, for instance, all 625 words with missing letters in Tamburlaine can be selected and listed in an order that simplifies their correction.

The results of my experiment were very illuminating. It became clear that competent undergraduates with an interest in the humanities can quickly pick up the skills required to do useful work. Two students wrote a perceptive account of their experience with data curation, which they charmingly described as becoming ‘fluent in Marlowe’ and which I have included as a separate blog entry.

If you can sort and filter errors in various ways, analysis becomes much easier. At least half the errors can be corrected with 100% certainty even without looking at the page image. Once you look at the page image, most of the remaining problems can be unambiguously resolved. A smallish number of cases –certainly less than 10% –raise problems that cannot be unambiguously resolved. It is more helpful to focus on the many easy and boring things that can be fixed than be sidetracked by the interesting and difficult cases.

The same is true of tagging errors. Whether a given line is a stage direction or part of a speech by a character is a matter that can be unambiguously resolved in all cases. Whether a line is in verse or prose is not always easy to resolve. But even here, the number of cases with a clear answer is much larger than the number of ambiguous cases.

My students worked with spreadsheets in an off-line mode, and my goal in this simple experiment was to determine whether students could do this work and what they might learn from it. I am currently working with Craig Berry, a Spenser scholar with very deep programming skills, on developing a prototype of a network-based data curation tool. In this prototype, a MySQL database will hold the tabular representations of the texts, and a Django-based web interface will let authenticated users call up texts for review and enter suggestions for corrections.

The problem that remains to be solved is how to review such suggestions and incorporate them into the text. A user-contributed suggestion is automatically recorded in a data table as a row stating that User A at time B flagged location C in Text D as deficient and suggested correction E. The incorporation of the correction into the text requires some editorial review, which is likely to blend algorithms with human intervention. The aggregate of an error table is much more than the sum of its parts. The same suggestion made by more than one user can become part of a voting model. Users whose suggestions have a high rate of subsequent adoption can be given some editorial privileges. If an error occurs in one location, one can ask whether similar errors occur elsewhere. A corrected location can also be flagged as such, so that corrected spellings can be optionally highlighted in various ways. ‘Assliction’ (yes, I have seen this spelling in a TCP text) is either a transcriber’s or a typographer’s error, but it certainly is a spelling of the word ‘affliction’, and ‘long s’ errors in both directions are probably the most common type of orthographic error in the texts. If one cares enough (I’m not sure I do) one could distinguish between transcribing and typesetting errors through different forms of highlighting, just in case that somebody wants to make an argument for the typesetter’s -ss spelling of ‘affliction’ as a subtle or subversive pun.

Not a very likely scenario, but the general point is that the most common and annoying forms of textual error lend themselves to hybrid models of algorithmic and human correction. These models rest on well-understood models of data manipulation in relational databases.

In this blog entry I have sketched a plan for intermittent and incremental data curation by scholarly users of data that are fundamental to their work. Any implementation of such a project would be quite similar in its social, technical, and financial aspects to community annotation projects that are common in genome research. There is a speculative ‘build it and they will come’ aspect to this. In order to test the viability of the concept you have to build a model that is sufficiently sturdy and inviting for users to become contributors