The title of this blog entry is the title of a keynote address I gave at the Chicago Digital Humanities and Computer Science Colloqium, held November 18-19 , 2012 at the University of Chicago. There is a pdf of the talk at http://panini.northwestern.edu/mmueller/backtothefuture.pdf

The talk was about the challenges and opportunities posed by the TCP EEBO corpus. Between 2015 and 2020 (and beginning with an initial release of ~25,000 texts) TEI-XML transcriptions of ~70,000 texts –at least one version of every title published between 1473 and 1700– will pass into the public domain.  Once this resource is in the public domain it will for most scholarly purposes replace other surrogates of the printed originals. It will be free, it will often be the only, and nearly almost the most convenient source for the many look-up activities that make up much of scholarly work.

EEBO-TCP is a magnificent but flawed enterprise, and few of its transcriptions fully meet the scholarly standards one associates with a decent diplomatic editions in in the print world.  Who will guarantee the integrity of this  primary archive that will be the foundation for much future scholarship?   In a print-based documentary infrastructure there was a simple answer to  the question  “Who provides quality assurance (QA in modern business parlance) for the primary sources that undergird work in your discipline?”  It was “my colleagues,” and it might include “I do some of that work myself.”  From the  nineteenth century well into the middle of the twentieth century, “Lower Criticism” of one kind or another counted as significant scholarly labor and made up a significant, though gradually declining, share of the work of humanities departments.

Consider Theodor Mommsen. In 1853 and 1854 he published the first volume of his Roman History, and he started the Corpus Inscriptionum Latinarum (CIL), the systematic gathering of inscriptions from all over the Roman empire. For the next five decades he  was the chief editor and a major contributor to its sixteen volumes, which transformed the documentary infrastructure for the study of Roman history. Since the early 20th century, a student of Roman history with access to a decent research library has had “at hand” a comprehensive collection of the epigraphic evidence ordered by time and place. That has made a huge difference to the study of administrative, legal, and social history.

The CIL is a majestic instance of the century of curatorial labour that created the documentary infrastructure for modern text-centric scholarship in Western universities. In that world the integrity of primary data rested on what you might call a Delphic tripod of cltural memory with its three legs of scholars who made editions, publishers who published them, and librarians who acquired, catalogued them, and made them available to the public.  Over the sixties and seventies of the last century there was a growing consensus that you no longer needed to worry about data curation because a century of it had succeeded in creating a print-based data infrastructure that from now on you could take for granted. For the last forty years many disciplines in the humanities have lived off the capital of a century of editorial work while paying little attention to the progressive migration of textual data from books on shelves to files on servers or in ‘clouds’. Using some back-of-the-envelope calculations, Greg Crane argued in 2010 that classicists now allocate less than 5% of their labour to curatorial work (using the term in its broadest sense). That sounds about right for departments of English or History that I know some-thing about. It is possible for individuals within fields of activity to make choices that make professional and economic sense within the field but lead the field as a whole astray. The steel industry  of the seventies or the current monoculture of corn in Iowa come to mind.

A decade ago Jerry McGann observed that “in the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination.”  This digital migration has so far made slow progress. The integrity of an emerging cyber infrastructure for text-centric scholarship has received remarkably little attention in the discourse of disciplines that will increasingly rely on digital surrogates of their primary sources.  The current buzz about ‘Digital Humanities’ or ‘DH’ has very little to do with serious work on that front.

Back to the EEBO-TCP corpus and the ~45,000 texts (~2 billion words) that have so far been transcribed.  EEBO-TCP will serve as the de facto documentary infrastructure for much Early Modern scholarship, accessed increasingly via mobile devices that provide each scholar with his or her own “table of memory.”  Montaigne had a couple of thousand books in his tower library.  A little more than two years from now, graduate students will be able to load 25,000 books from Montaigne’s world (and beyond) onto their Apple, Google, or Samsung tables as epubs or raw XML files.

“How bad is good enough” when it comes to the quality of those texts? A lot of work needs to be done if you believe, as I do, that a digital surrogate with any scholarly ambitions should at least meet the standards we associate with good enough diplomatic editions in the print world (I am ignoring here the additional features required to make the digital surrogate fully machine actionable).  There are two interesting properties of the TCP corpus that affect the discussion of data curation and quality assurance. Both of these have analogues in other large collections of primary materials. In fact, the TCP archive exhibits characteristic features of the large-scale surrogates of printed originals that will increasingly be the first and most widely consulted sources.

First, the TCP is published by a library. Second, in a collection of printed books, the boundaries between one book and another or one page and another impose physical barriers that constrain what you can do within and across books or pages.  In a digital environment, these constraints are lifted for many practical purposes. You can think of and act on the current TCP archive as 45,000 discrete files, 2 billion discrete words, or a single file.  This easy concatenability is the major reason for the enhanced query potential of a full-text archive. It also has the potential for speeding up data curation within and across individual texts.

If you come across a simple error in a book it is usually a matter of seconds to correct it in your mind. It takes much longer to correct it for other readers of the book. You must provide the correction in a review or write to the author/publisher. The publisher must incorporate it into a second edition, and libraries must buy the second editions before the corrected passage is propagated to readers at large. That is a typical form of data curation in a world where the tripod of cultural memory rests on the actions of scholars, publishers, and librarians. In a digital world that tripod rests on the interactions of scholars, librarians, and technologists. In a well-designed digital environment scholars (and indeed lay people of all stripes) can directly and immediately communicate with the library/publisher. If I work with a text and come across a phenomenon requiring correction or completion I can right away do the following:

1. log in (if I’m not logged in already) and identify myself as a user with specified privileges
2. select the relevant word or passage and enter the proposed correction in the appropriate form.

If I do not have editorial privileges, my proposal is held for editorial review. If I am authorized to make or approve corrections my proposal is forwarded for inclusion in the text either immediately or (the more likely scenario) the next time the system is re-indexed. The system automatically logs the details of this transaction in terms of who did what and when.

The obstacles to such an environment are not primarily technical or financial. They are largely social. You need substantial adjustments in the ways scholars and librarians think about their roles and relationships. Scholars often complain about the shoddiness of digital resources, but if  they want better data they must recognize that they are the ones who must provide them—although they may find it rewarding in many ways to recruit a lay public for help with those tasks.  And they need to ask themselves why in the prestige economy of their disciplines they have come to undervalue the complexity and importance of “keeping” (in the widest sense of the word) the data on which their work ultimately depends. Librarians need to rethink the value chain in which the Library ends up as a repository of static data. Instead they should put the Library at the start of a value chain whose major component is a framework  in support of data curation as a continuing  activity by many hands in many places, whether on an occasional or sustained basis. Such a model of collaborative data curation is the norm in genomic research, a discipline that from the perspective of an English department can be seen as a form of criticism (both higher and lower) of texts written in a four-letter alphabet.

Some of the best thinking on these issues has come from Greek papyrologists,  a very special scholarly club with highly specialized data, tools, and methods, but with some good lessons for the rest of us.  Papyrologists have for a century kept a Berichtigungsliste or curation log as the cumulative and authorized record of their labours. The Integrating Digital Papyrology project (IDP) is based on the principle of  “investing greater data control in the user community.”  Talking about the impact of the Web on his discipline, Roger Bagnall said that

these changes have affected the vision and goals of IDP in two principal ways. One is toward openness; the other is toward dynamism. These are linked. We no longer see IDP as representing at any given moment a synthesis of fixed data sources directed by a central management; rather, we see it as a constantly changing set of fully open data sources governed by the scholarly community and maintained by all active scholars who care to participate .

He faced on the question: “How … will we prevent people from just putting in fanciful or idiotic proposals, thus lowering the quality of this work?” and answered that collaborative systems

are not weaker on quality control, but stronger, inasmuch as they leverage both traditional peer review and newer community-based ‘crowd-sourcing’ models. The worries, though, are the same ones that we have heard about many other Internet resources (and, if you think about it, print resources too). There’s a lot of garbage out there. There is indeed, and I am very much in favor of having quality-control measures built into web resources of the kind I am describing.

A collaboratively curated Berichtigungsliste or curation log offers an attractive model for coping with the many imperfections of the current TCP texts.  The work of many hands, supported by clever programmers, quite ordinary machines, and libraries acting consortially, can over the course of a decade substantially improve the TCP texts and move them closer to the quality standards one associates with good diplomatic editions in a print world.  Imagine a social and technical space where individual texts live as curatable objects continually subject to correction, refinement, or enrichment by many hands and coexist at different levels of (im)perfection.  You could also imagine a system of certification for each text — not unlike the USDA hierarchy of grades of meat from prime to utility.   But “prime” would always be reserved for texts that have undergone high-quality human copy-editing.  Such a system would build trust and would counteract the human tendency to judge  barrels by their worst apples.

What I have said about collaborative curation of the TCP texts applies with minor changes to other archives. Neil Fraistat and Doug Reside in conversation coined the acronym CRIPT for “curated repository of important texts .” Not everything needs to be curated in same fashion, but high degrees of curation are appropriate for some texts, whether for their intrinsic qualities or evidentiary value.  Large consortial enterprises like the Hathi Trust or the DPL might be the proper institutional homes for special collections of this type. Somewhere in the middle distance I see the TCP collection as the foundation of a Book of English defined as

• a large, growing, collaboratively curated and public domain corpus
• of written English since its earliest modern form
• with full bibliographical detail
• and light but consistent structural and linguistic encoding

It will take a while to get there. It is a lot of work, and like woman’s work, it is  “never done.” But progress is possible.  Here is the challenge of the next decade(s)  for scholarly data communities and the libraries that support them: put digital surrogates of your primary sources into a shape that will

  1. rival the virtues of good diplomatic editions from an age of print
  2.  add features that will allow scholars to explore the full query potential of the digital surrogate.