Introduction and Summary
This is a report about an experiment with ~ 4,000 texts from the Text Creation Partnership ( TCP). It is more in the spirit of concept cars than production models. There may also be an aspect of changing fro 5.25 to 3.5 floppy disks. The TCP texts are a critical component of the documentary cyber infrastructure for Early Modern Studies: today’s graduate students in English or History do much of their reading on the digital equivalents of Hamlet’s “table of memory”. I have tried to write the following pages in a manner that should be intelligible to students who are users of these texts but have no special interest in or knowledge of their technical underpinnings.
The three TCP corpora (EEBO, ECCO, and Evans) add up to a diachronic corpus of printed English from 1473-1800. Coverage is very dense for England before 1700: for most purposes, EEBO-TCP can be thought of as a deduplicated library of English printed books before 1700. The ~ 5,000 Evans texts cover North America from 1639-1800 and represent about 15% of all imprints. The 2,000 volumes of ECCO are a cherry-picked anthology of 18th century texts, covering no more than 1% of imprints. But 2,000 books are a still a lot of books, and while not a random sample, this multi-million-word corpus adds valuable evidence for many corpus-wide inquiries that look for changing uses of English words on different continents across more than three centuries.
The results of this experiment are available on github at https://github.com/martinmueller39/TCP2ESTC. It is limited to EEBO-TCP texts and focuses on public domain texts from four decades at forty-year intervals: 1500’s, 1540’s, 1580’s, 1620’s, 1660’s. The numbers in each bin (35, 205, 509, 955, 2458) tell a clear story of their own if you assume (not unreasonably) that their proportions reflect the proportions of all printed texts that have survived. I recall Paul Schaffner telling me some years ago that in selecting texts for transcription the project favoured encoding a higher percentage of texts from the early decades where there were far fewer texts to begin with.
The experiment is based on explicitly tokenized and linguistically annotated versions of TCP texts and does three things:
- It moves the texts into a file structure organized by decades and creates filenames and XML id’s that explicitly assign each text and word in each text to a decade.
- It explicitly aligns each text with the English Short Title Catalogue (ESTC) by adding the ESTC call number to the filename.
- It incorporates machine-generated corrections of incompletely transcribed words into the text and explicitly flags these corrections as machine-generated and subject to human review and approval or correction.
A decade-based file structure
I think of a TCP text as a digital surrogate of a printed text for which the English Short Title Catalogue (ESTC) does or should offer the most authoritative summary description. I also think that the diachronic aspect of the corpus deserves special attention. All users, but especially novice users, benefit from an articulation of the corpus that goes out of its way to make temporal difference visible. Consider the following two sentences, chosen pretty much at random from texts printed in 1507 and 1700:
For wyte thou wel a bodely tournynge to god without the hert folowynge. is but a fygure & a lyknes of vertues.
Dear Mother I have ever been a bashfull Lad, but now the Lord hath loosened my Tongue, and now can I sing praises to his Holy name:
Modern readers will have no difficulties with the second sentence, but the first requires adjustments, and most modern readers would find it difficult to read any text of some length from that stage of Early Modern English. You do users a great favour by being very explicit about the degree to which a file is closer to the former or the latter.
The filenames for EEBO TCP texts begin with ‘A’ or ‘B’ followed by 5 digits from A00001 to B44359. There is no difference between ‘A’ and ‘B’ texts, and the sequence of numbers is meaningless, although in practice files with lower numbers were probably transcribed earlier. I understand the argument that random numbers are preferable or necesssary in some environments. Machines don’t care, but humans are meaning-seeking animals, and a rough but firm grasp of spatial or temporal dimensions is a prime requisite for making sense of anything. When characters in Greek or Roman tragedies awake from madness their typical first words are “Where am I?” (ubi sum). Right in the middle of the speedometer of my car I see in big letters ‘NW’ or ‘SE’, telling me which direction I am going. That is very helpful, especially at night, and easier to read than a compass needle that points North, leaving me with the task of figuring out where I am going relative to the position of the needle. In a diachronic corpus, clear signs about “when” offer the most useful basic orientation. It is true that you can always look up the date in the document’s metadata, but the user saves a little time when a necessary unique identifier performs an immediately visible service of temporal orientation. Add up the time of avoided look-ups, and they add up to real savings.
Library catalogues may have processing numbers that are random or based on a numerus currens, but call numbers exposed to users typically are meaningful and often include spatial or temporal coordinates. This is true of both the Library of Congress and Dewey systems.
Are the filenames of TCP texts more like processing numbers or more like call numbers? I can think of honours projects by undergraduates, doctoral projects by graduate students, and book projects by faculty, where a researcher assembles dozens, hundreds, perhaps even thousands of texts that for months or years form a working library. In such environments, it is nearly always helpful if the sequence of filenames provides a basic temporal orientation. It is true that you can use ‘aliases’ or ‘symbolic links’ to transform a random sequence of files into multiple meaningful articulations. But creating and maintaining such an order is not entirely trivial.
The vast majority of TCP files can be assigned with very high confidence to a decade of publication. I can see many upsides and no downsides to giving a file a name that assigns it to a particular decade. There are many projects for which such a system provides sufficient articulation. For projects that are organized by subject or place, temporal order will often be a useful second criterion.
For these reasons, I use decade-based filenames in this experiment, and I have chosen four decades at 40-year intervals to demonstrate the shape that emerges from this simple procedure. Dates consist of numbers, but some forms of unique identifiers, including XM id’s, cannot have a number as their first letter. In the annotated versions of TCP texts, the texts are “tokenized”. Each word is wrapped in a <w> element and assigned a hierarchical id that identifies the position of the word on a a particular page of its text, e.g.
<w xml:id=”A13172-117-a-2480″ >wa●</w>
which means “word 248 of the left part of image 117 of the microfilm from which A13172 was transcribed”. (You cannot use the page numbers of Early Modern books because they may not be unique or may not be there in the first place).
A reliable year-by-year order of dates would be very difficult to achieve and would not add much value beyond an identification by decade. Assigning a text to the decade of its publication is easy and can be done with very few errors. Most of them will be off by just one decade and will not compromise the basic order. Once you have a decade prefix, a simple and good enough solution will add a random three-letter suffix to the prefix, which gives you 26x26x26 or 17,576 combination. Filenames like ‘162-dvr’ are easy on the eye, not hard to remember, and easy to type, but an XML id cannot start with ‘162’. You can use ‘a’, ‘b’, ‘c’, ‘d’ as century prefixes for the period 1473-1799. In that case ‘c2’ is equivalent to ‘162’, and ‘c5dvr’ is a valid XML id. If file and folder names beginning with a three-digit decade code are seen as more user-friendly ( I think they are), they can certainly be mapped internally to letter-initial strings.
Id’s like ‘crdvr’ and or filenames like ‘162-dvr’ provide ample choices for creating unique filenames across the ~60,000 TCP texts. The 4,000 texts from four decades use that ID system. The directory structure is in some ways similar to the directory structure of TCP texts, where directories with the first three letters of a filename (A01, …, B44) hold up to 1,000 files. In this experimental system, a top directory (TCP2ESTC) contains decade directories (1500, …,1660). Each of them contains three-character code subdirectories, each of which can hold up to 676 files.
Some of the advantages of that directory model become apparent if you imagine undergraduates with an interest in things early modern scrolling through the directories. They will learn nothing from looking at directory A01 or A73. They learn a lot from the most casual glance at the directories for 1500 and 1660. I remember a passage from David Cecil’s biography of Melbourne, according to which at some gathering the young Victoria said to Melbourne something like “I see very few viscounts.” Whereupon Melbourne answered: “There are very few viscounts.”
Speaking from personal experience, this scheme is a slightly revised version of a “divide and conquer” approach that I developed for a review of linguistic metadata. It has been enormously helpful to know what is where and have a rough idea of how much of this is here and how much of that is there. The TCP filenames are quite useless for that purpose. Chronologically articulated filenames were an essential tool for my work, and I am confident that many other users with quite different interests would share that experience.
The filenames in this experiment add the ESTC call number to the uniqe file identifier. This call number does not add a further specification but serves the purpose of very explicitly orienting the file towards the catalogue that not only does (or should) contain the most up-to-date basic bibliographical data about a given book but also provides the most authoritative and accessible data about its wider context.
What is special about the ESTC?
A historical text is likely to have many names or call numbers. In a TCP XML file they are all listed in the teiHeader element–a part of the file that you usually don’t see in a Web representation. It relates to the file pretty much in the way in which a catalogue card relates to the book it represents. The different names for A13172 are
<idno type=”STC”>STC 23467</idno>
<idno type=”STC”>ESTC S528</idno>
<idno type=”OCLC”>ocm 22582595</idno>
What is special about the ESTC number? Three things. First, the ESTC is the only catalogue that offers relatively uniform descriptions of all English books before 1800, wherever printed. Secondly, it is freely available to anybody with an Internet connection. Third, the ESTC is currently undergoing a major revision (code name ESTC21), which will over time transform it into a collaborative and “two-way street” environment that will allow user-contributors to add corrections or additional data, subject to appropriate review.
The current ESTC web site was not, I believe, available when the EEBO and TCP projects started in the nineties. The EEBO site, accessible only to subscribers, has ignored ESTC numbers. The TCP has added ESTC numbers over the years. There may be some question whether all TCP file names unambiguously map to ESTC numbers, but most of them do.
The ESTC started as an electronic catalogue of 18th century books and over time incorporated the print-based catalogues of English print to 1640 ( Pollard and Redgrave’s Short Title Catalogue), English print 1641-1700 (Donald Wing’s Short Title Catalogue …1641-1700) and American texts before 1800 (Charles Evans’ American Bibliography…fr om…1639 to…1830).
It is probably the case that each of these catalogues contains some data not in ESTC. It is certainly the case that local catalogues at the Bodleian, Huntington, Folger, or Newberry Library will have richer descriptions of this or that item. But unless you have very special needs or interests, the ESTC will offer convenient one-stop-shopping for most bibliographical needs. Moreover, the ESTC has increasingly comprehensive coverage of the libraries that hold particular copies of a book identified by an ESTC number.
A little more about the relationship between a TCP file and its ESTC call number
In what ways is A13172 or c2dvr or 162-dvr a digital surrogate of S528 (A true relation of Englands happinesse…)? I’ll answer that question via WEMI, a useful acronym from the FRBR world (Functional Requirements for Bibliographical Resources). WEMI concatenates the four essential FRBR concepts of Work, Expression, Manifestation and Instance. Take TCP file A13172. It was transcribed from a digital scan of a microfilm image of a British Library copy of the 1629 edition (Manifestation) of an Expression that was a later version of a Work, whose first Expression appeared under a different title in 1604. The borders between these four concepts are not without disputes.
The file A13172 is not and does not pretend to be a “documentary edition” of the Instance from which it was created. It does not attempt to capture all the physical details of a particular instance of a Manifestation. It is best described as an honest attempt to get the words right in the order that the author or publisher intended. It is faithful to the orthographic details of the Manifestation, but selective and not entirely consistent in its treatment of typographical detail and layout. It is, however, a version of the text represented by S528, and if its author came back from the dead to proofread it he would have no trouble deciding whether and where it got the words right or not.
The mournful or whimsical printers’ apologies that you often find in Early Modern books are a charming sub-genre of that period. They are particularly interesting for their implicit theory of textuality and their understanding of a text as an intentional object (Expression?) whose embodiment inescapably falls short in one way or another. I take the practice of the TCP project to be very much in the spirit of that implicit theory. Deciding that ‘aſſliction’ (yes, there are 23 occurrences of this spelling in EEBO-TCP) should be corrected to ‘affliction’ matters more than worrying about whether the spelling was represented in Antiqua, italics, or Fraktur.
The TCP texts are notorious for their many lacunae, ranging from missing letters in words, to missing words, missing lines, and whole pages. Malone is supposed to have said that “the text of Shakespeare is not as bad as it is thought to be”, and something similar is true of the TCP texts. Their reputation has suffered from the inveterate human tendency to judge a barrel by its worst apples. Given the vagaries of transmission and constraints of the transcription project, we should not complain about the badness of the texts but be gratefully surprised at how good they are. Still there are many things wrong with them. Most of the wrong things are not hard to fix, but there are many of them, and the people who complain about lacunae and mistakes typically think that somebody else should fix them.
For two thirds of the textual blemishes the “somebody else” could well be a
“long short-term memory neural network algorithm” that lets a machine fix defective readings and integrate them in to the texts where they are flagged as such and are subject to human review, acceptance, or further correction. As I reported in an earlier blog, two Northwestern Computer Science students from Doug Downey’s lab used this technique to guess missing letters in incompletely transcribed words–by far the most common type of lacuna in the TCP texts. We call them “blackdot words” because the lacunae are represented on the Web by the black circle (\u25cf) . The results or the students’ work were so good that I decided to move the corrections into the text, but in a manner that would make their status very clear. Take this occurrence of ‘wa●’ , which apppears in the XML file (slightly simplified) as
<w xml:id=”c2dvr-117-a-2480″ >wa●</w>
The provisionally corrected version reads
<w xml:id=”c2dvr-117-a-2480″ type=”mc” cert=”0.94″>wa[s]</w>
The correction is put in square brackets, a @cert attribute states the probability assigned by the algorithm, and the @type attribute with its value ‘mc’ identifies the correction as generated by a machine. In an environment that supports collaborative curation readers can approve or challenge such corrections. The EarlyPrint site offers an example of such an environment.
Early Modern scholars are likely to be skeptical about machine corrections, and with good reason. They may also find it hard to figure out what is meant by this or that accuracy rate. 97% is better than 90%, but is 90% good or not really good enough? Is it OK in some context, but not good enough in others? It is important to remember the nature and scope of the problem. If an editor confronts a single text, at least one pair of human eyes should look at every word and preferably more than once. But there are 60,000 TCP texts. If you require the same editing standards for the corpus as for a single text, the odds are that nothing will ever get done. Moreover, the approximately five million defective tokens hardly ever involve what Mike Witmore has called the “philologically exquisite.” It is not a question of just what Falstaff said at the moment of death or whether Othello mentioned an Indian or Iudean. It is a matter of many occurrences of ‘b●t’ or ‘a●d’ and scattered occurrences of ‘in●olent’, ‘in●inite’. ‘Aus●in’, and countless other words whose meaning readers modestly familiar with Early Modern texts will typically guess correctly.
The LSTM neural network procedure does a pretty good job mimicking that competence. Here are some rough figures on which to base a cost/benefit analysis, unless you are willing to argue that no amount of accurate machine-generated textual corrections is worth the cost of making some errors along the way. The LST algorithm proposed corrections for 4.46 million blackdot words. 946,500 of them could be ruled out as being categorically unlikely to produce results. That leaves about 3.51 million plausible corrections. Initial sampling suggests that the error rate is about 10%.
Which means that machine correction can reduce the number of defective tokens by 3.16 million, leaving 1.3 million defects unfixed and making 310,000 wrong guesses. Without machine corrections you have 4.46 defective tokens. After machine correction without any human intervention you have 1.6 million defects. That sounds like a lot of defects, but remember that there are about 1.5 billion tokens in the EEBO-TCP corpus. Machine correction without additional human checking reduces blackdot defects from 30 per 10K words to 10 per 10K words. That is a signficant advance. It may be a rough textual justice, but it is justice nonetheless.
I suspect that most people worry less about unfixed defects than about new defects introduced in the process of correction. I certainly do. But there are several things to remember. First, al blackdot words fall in the category of Donald Rumsfeld’s “known unknowns”. They are flagged in the text and can be retrieved in a single work or across any subset of them. Second, these new defects do not add to the total number of defects.They do not corrupt a good reading; they just correct a bad reading in a wrong way. Third, they cluster heavily. The earlier the text the higher the error rate. And some words are error prone. Tokens corrected to ‘and’ or ‘but’ are hardly ever wrong. Tokens corrected to ‘in’ have a quite high error rate, higher than ‘if’, ‘it’ or ‘is’. You can target human labor on textual regions where errror hunting is likely to be productive.
Assume Reader X who is interested in text Y, which has 50 blackdot words. 40 of them have been machine-corrected, and 35 of them are correct. Assume further that she encounters the text in the EarlyPrint environment. Without machine correction she would have to enter fifty separate corrections. With machine correction, her task is considerably simpler. She looks at the 40 machine corrections, clicks an Accept button in 35 cases, and on five occasions corrects an error. There remain ten passages where she has to start from scratch.
Some corrections can be approved or made from scratch without looking at the page image. ‘Austin’ is the only possible correction of “St. Aus●in”, and such cases are legion. But in many you must look at a page image. Tokens that involve numbers are an example. There is no way of making a right guess about “1●”. If the text exists as a digital combo in the EarlyPrint environment, the time cost of consulting the image is very low. If you have to go to the EEBO image, it takes more time. In a majority of cases, the EEBO image is good enough to reach a decision. But quite often the image quality is so poor that you wonder how the transcribers did as well as they did.
Now consider text Z, which also has 50 blackdot words with 40 corrections, 35 of which are correct. But nobody looks at the text–a very plausible fate of many texts. What does it matter if the machine made five mistakes if nobody uses the text?
Think of a machine correction as a proposal waiting to be confirmed and as an invitation to user communities to take care of their texts in a manner that substantially reduces the time cost of human “textkeeping”.
There is a famous line in Vergil about “comparing great things with small”. The New York Times recently had an article called “Inside Tesla’s Audacious Push to Reinvent the Way Cars Are Made“. It is partly about moving the Tesla 3 from a concept car to an affordable mass-produced car, but the most interesting parts are about how to decide on the most effective way of combining automation with human labour– what the computer scientists call “mixed initiatives.” In a survey of rating auto plants worldwide, one author claimed that “the most efficient ones use a lot of manual labour. The most automated ones are at the bottom of the list.” A third interesting part of this article was that the optimal allocation of human and machine labour is not a constant but always subject to renegotiation
Turning sub-par TCP transcriptions into respectable editions matters a lot less than mass-producing a car whose widespread adoption would go quite aways towards reducing CO² emissions. But there are lessons to be learned. We need machine-based procedures sufficiently accurate and flexible to persuade Early Modern scholars and their students that it is worth their while to invest some of their time in the humble but consequential task of getting their texts “right” (however one wants to interpret that loaded term). It seems to me that we have come a long way on the machine side of the equation. The main problem will be to motivate contributors to provide the human labour without which this “mixed initiative” will not succeed. Let us find them and prove the wisdom of John Heywood’s 1546 Dialogue conteinyng the nomber in effect of all the prouerbes in the englishe tongue with its observation that “many hands make light wark”.