Back to the Future or Wanted: A Decade of High-tech Lower Criticism

Note:The title of this blog entry is the title of a keynote address I gave at the Chicago Digital Humanities and Computer Science Colloqium, held November 18-19 , 2012 at the University of Chicago. It is lightly edited and shortened. I have added a postscript called “A Decade later”.

This talk is about the challenges and opportunities posed by the EEBO-TCP corpus. Between 2015 and 2020 (and beginning with an initial release of ~25,000 texts) TEI-XML transcriptions of ~70,000 texts–at least one version of every title published between 1473 and 1700– will pass into the public domain.  Once this resource is in the public domain it will for most scholarly purposes replace other surrogates of the printed originals. It will be free, it will often be the only, and nearly always the most convenient source for the many look-up activities that make up much of scholarly work.

EEBO-TCP is a magnificent but flawed enterprise, and few of its transcriptions fully meet the scholarly standards one associates with a decent edition in the print world.  Who will guarantee the integrity of this  primary archive that will be the foundation for much future scholarship?   In a print-based documentary infrastructure there was a simple answer to  the question  “Who provides quality assurance (QA in modern business parlance) for the primary sources that undergird work in your discipline?”  It was “my colleagues,” and it might include “I do some of that work myself.”  From the  nineteenth century well into the middle of the twentieth century, “Lower Criticism” of one kind or another counted as significant scholarly labor and made up a significant, though gradually declining, share of the work of humanities departments.

Consider Theodor Mommsen. In 1853 and 1854 he published the first volume of his Roman History, and he started the Corpus Inscriptionum Latinarum (CIL), the systematic gathering of inscriptions from all over the Roman empire. For the next five decades he  was the chief editor and a major contributor to its sixteen volumes, which transformed the documentary infrastructure for the study of Roman history. Since the early 20th century, a student of Roman history with access to a decent research library has had “at hand” a comprehensive collection of the epigraphic evidence ordered by time and place. That has made a huge difference to the study of administrative, legal, and social history.

The CIL is a majestic instance of the century of curatorial labour that created the documentary infrastructure for modern text-centric scholarship in Western universities. In that world the integrity of primary data rested on what you might call a Delphic tripod of cultural memory with its three legs of scholars who made editions, publishers who published them, and librarians who acquired, catalogued, and made them available to the public.  After World War II  there was a growing consensus that you no longer needed to worry about data curation because a century of it had succeeded in creating a print-based data infrastructure that from now on you could take for granted. For the last forty years many disciplines in the humanities have lived off the capital of a century of editorial work while paying little attention to the progressive migration of textual data from books on shelves to files on servers or in ‘clouds’. Using some back-of-the-envelope calculations, Greg Crane argued in 2010 that classicists now allocate less than 5% of their labour to curatorial work (using the term in its broadest sense). That sounds about right for departments of English or History that I know something about. It is possible for individuals within fields of activity to make choices that make professional and economic sense within the field but lead the field as a whole astray. The steel industry of the seventies or the current monoculture of corn in Iowa come to mind.

A decade ago Jerry McGann observed that “in the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination” (quoted from his essay “Our Textual History” in TLS , Nov. 20, 2009).   This digital migration has so far made slow progress. The integrity of an emerging cyber infrastructure for text-centric scholarship has received remarkably little attention in the discourse of disciplines that will increasingly rely on digital surrogates of their primary sources.  The current buzz about ‘Digital Humanities’ or ‘DH’ has very little to do with serious work on that front.

Back to the EEBO-TCP corpus and the ~45,000 texts (~2 billion words) that have so far been transcribed.  EEBO-TCP will serve as the de facto documentary infrastructure for much Early Modern scholarship, accessed increasingly via mobile devices that provide each scholar with his or her own “table of memory.”  Montaigne had a couple of thousand books in his tower library.  A little more than two years from now, graduate students will be able to load 25,000 books from Montaigne’s world (and beyond) onto their Apple, Google, or Samsung tablets as epubs or raw XML files.

“How bad is good enough” when it comes to the quality of those texts? A lot of work needs to be done if you believe, as I do, that a digital surrogate with any scholarly ambitions should at least meet the standards we associate with good enough editions in the print world (I am ignoring here the additional features required to make the digital surrogate fully machine actionable).  There are two interesting properties of the TCP corpus that affect the discussion of data curation and quality assurance. Both of these have analogues in other large collections of primary materials. In fact, the TCP archive exhibits characteristic features of the large-scale surrogates of printed originals that will increasingly be the first and most widely consulted sources.

First, the TCP is published by a library. Second, in a collection of printed books, the boundaries between one book and another or one page and another impose physical barriers that constrain what you can do within and across books or pages.  In a digital environment, these constraints are lifted for many practical purposes. You can think of and act on the current TCP archive as 45,000 discrete files, 2 billion discrete words, or a single file.  This easy concatenability is the major reason for the enhanced query potential of a full-text archive. It also has the potential for speeding up data curation within and across individual texts.

If you come across a simple error in a book it is usually a matter of seconds to correct it in your mind. It takes much longer to correct it for other readers of the book. You must provide the correction in a review or write to the author/publisher. The publisher must incorporate it into a second edition, and libraries must buy the second editions before the corrected passage is propagated to readers at large. That is a typical form of data curation in a world where the tripod of cultural memory rests on the actions of scholars, publishers, and librarians. In a digital world that tripod rests on the interactions of scholars, librarians, and technologists. In a well-designed digital environment scholars (and indeed lay people of all stripes) can directly and immediately communicate with the library/publisher. If I work with a text and come across a phenomenon requiring correction or completion I can right away do the following:

1. Log in (if I’m not logged in already) and identify myself as a user with specified privileges
2. Select the relevant word or passage and enter the proposed correction in the appropriate form.

If I do not have editorial privileges, my proposal is held for editorial review. If I am authorized to make or approve corrections my proposal is forwarded for inclusion in the text either immediately or (the more likely scenario) the next time the system is re-indexed. The system automatically logs the details of this transaction in terms of who did what and when.

The obstacles to such an environment are not primarily technical or financial. They are largely social. You need substantial adjustments in the ways scholars and librarians think about their roles and relationships. Scholars often complain about the shoddiness of digital resources, but if  they want better data they must recognize that they are the ones who must provide them.  And they need to ask themselves why in the prestige economy of their disciplines they have come to undervalue the complexity and importance of “keeping” (in the widest sense of the word) the data on which their work ultimately depends. Librarians need to rethink the value chain in which the Library ends up as a repository of static data. Instead they should put the Library at the start of a value chain whose major component is a framework  in support of data curation as a continuing  activity by many hands in many places, whether on an occasional or sustained basis. Such a model of collaborative data curation is the norm in genomic research, a discipline that from the perspective of an English department can be seen as a form of criticism (both higher and lower) of texts written in a four-letter alphabet.

Some of the best thinking on these issues has come from Greek papyrologists,  a very special scholarly club with highly specialized data, tools, and methods, but with some good lessons for the rest of us.  Papyrologists have for a century kept a Berichtigungsliste or curation log as the cumulative and authorized record of their labours. The Integrating Digital Papyrology project (IDP) is based on the principle of  “investing greater data control in the user community.”  Talking about the impact of the Web on his discipline, Roger Bagnall said that

these changes have affected the vision and goals of IDP in two principal ways. One is toward openness; the other is toward dynamism. These are linked. We no longer see IDP as representing at any given moment a synthesis of fixed data sources directed by a central management; rather, we see it as a constantly changing set of fully open data sources governed by the scholarly community and maintained by all active scholars who care to participate .

He faced head-on the question: “How … will we prevent people from just putting in fanciful or idiotic proposals, thus lowering the quality of this work?” and answered that collaborative systems

are not weaker on quality control, but stronger, inasmuch as they leverage both traditional peer review and newer community-based ‘crowd-sourcing’ models. The worries, though, are the same ones that we have heard about many other Internet resources (and, if you think about it, print resources too). There’s a lot of garbage out there. There is indeed, and I am very much in favor of having quality-control measures built into web resources of the kind I am describing.

A collaboratively curated Berichtigungsliste or curation log offers an attractive model for coping with the many imperfections of the current TCP texts.  The work of many hands, supported by clever programmers, quite ordinary machines, and libraries acting consortially, can over the course of a decade substantially improve the TCP texts and move them closer to the quality standards one associates with good enough editions in a print world.  Imagine a social and technical space where individual texts live as curatable objects continually subject to correction, refinement, or enrichment by many hands and coexist at different levels of (im)perfection.  You could also imagine a system of certification for each text — not unlike the USDA hierarchy of grades of meat from prime to utility.   But “prime” would always be reserved for texts that have undergone high-quality human copy-editing.  Such a system would build trust and would counteract the human tendency to judge  barrels by their worst apples.

What I have said about collaborative curation of the TCP texts applies with minor changes to other archives. Neil Fraistat and Doug Reside in conversation coined the acronym CRIPT for “curated repository of important texts”. Not everything needs to be curated in that fashion, but high degrees of curation are appropriate for some texts, whether for their intrinsic qualities or evidentiary value.  Large consortial enterprises like the Hathi Trust or the DPL might be the proper institutional homes for special collections of this type. Somewhere in the middle distance I see the TCP collection as the foundation of a Book of English defined as

• a large, growing, collaboratively curated and public domain corpus
• of written English since its earliest modern form
• with full bibliographical detail
• and light but consistent structural and linguistic encoding

It will take a while to get there. It is a lot of work, and like woman’s work, it is  “never done.” But progress is possible.  Here is the challenge of the next decade(s)  for scholarly data communities and the libraries that support them: put digital surrogates of your primary sources into a shape that will

  1. rival the virtues of good enough editions from an age of print
  2.  add features that will allow scholars to explore the full query potential of the digital surrogate.

I use “good enough” in the sense Donald Winniccott used it when he argued against a generation of psychoanalysts who were fond of blaming the mother. He defined a quite modest level of maternal competence.  Going beyond it would not add a lot, but dropping below it would get bad very fast. Much of the digital and increasingly dominant version of our textual heritage will require a fair amount of mothering before it is clearly good enough.

Postscript: A Decade later

The 20212 talk sketched an ambitious agenda. Here are some facts  about the modest progress we have made since then. Between 2013 and 2015 about 20 students from Amherst, Northwestern, and Washington University in St. Louis made over 50,00 corrections in some 510 Early modern plays. The number of textual defects per 10,000 running words is a crude but telling measure of how close texts come to being “good enough” for many purposes. The median rate of uncorrected drama texts in the TCP corpus was 14.5 defects per 10,000 words The work of these students reduced the defect rate by an order of magnitude to 1.4–an improvement visible to a casual reader.

In 2016 Shakespeare His Contemporaries became part of  EarlyPrint, a more broadly based enterprise doubly centered at Northwestern and Washington University in St. Louis.  EarlyPrint currently has close to 60,000 texts. Most of them come from EEBO-TCP.  But there are also ~ 4,500 Early American text (Evans TCP) and ~ 2,000 English 18th-century texts (ECCO TCP). The EarlyPrint versions of these  texts are linguistically annotated, can be searched via a corpus query engine, and close to 650 texts are “digital combos” that offer a side-by-side display of the text and  high-quality  digital images.  Their technical infrastructure supports collaborative curation by anybody anwhere.

The acronym “FAIR” describes data that  meet standards of findabilityaccessibilityinteroperability, and reusability. It is a more elaborate version of Ranganathan’s fourth law of library science: “Save the time of the reader”. It summarizes an ethos well captured by   Brian Athey, chair of Computational Medicine at Michigan, when he said  at a conference about “research data life cycle management” that “agile data integration is an engine that drives discovery”.  For Early Modern studies in the Anglophone world, the creation of the TCP archives has been the most monumental achievement. An important  goal of EarlyPrint has been to make those texts FAIRer.

Since 2017 anybody with a computer and an Internet connection has been able to offer textual emendations via  the EarlyPrint Annotation Module. It makes textual correction as easy as as writing in the margin of a book. I call it “curation en passant”.  As soon as a reader enters and saves a correction in a little data entry field next to the text, the who, what, when, and where of a correction are automatically recorded in a central curation log. Emendations are provisionally displayed in the text, but their final integration into the source text is subject to editorial review.

A “digital combo”  and a computer with a good screen (larger is better) will provide a user with a better than good enough text lab for many basic–and some not so basic– forms of philological labour. Work of this kind follows a “find it, fix it, log it” pattern, where the finding and the logging typically take more time than the fixing. The EarlyPrint environment significantly reduces the time cost “finding” and turns the “logging” into an automated process.

Plays have continued to be a source of special interest. In the EEBO-TCP corpus as a whole the interquartile range of defects  per 10,000 words lies between 1 at the 25th and 48 at the 75th percentile, with 12 as the median range. Today the values for 814 plays in the EarlyPrint corpus range from 0 to 16.2 with a median value of 2.8 defects. In this important subcorpus of Early Modern texts a quarter of the texts have no defects, half of them have at most five defects per play, and three quarters of them have on average at most one defect per page.In practice, defects cluster heavily. Most of the remaining defects in EarlyPrint plays occur on the pages of texts for which there are currently no good digital surrogates on the Internet.

Last summer, three Classics majors at Northwestern tackled a corpus of 120 medical works. These were in much worse shape than the plays. The three figures for the interquartile range before curation were 14,7, 53.2, and 156.3 The students made about 20,000 corrections and reduced  the interquartile range to 3.3, 13.3 and 30. They paid special attention to Thomas Cogan’s Haven of Health, a characteristic late 16th century work. With the help of better images from the Conway Library of Medicine (via the Internet Archive) and the help from a machine-learning experiment they corrected more than 1,000 defects. Its two dozen remaining defects (1.8 per 10, 000 words) are philological “cruxes” that have so far defied solution. But the EarlyPrint version of The Haven of Health is sound enough for most purposes.

\

It has also demonstrated that motivated undergraduates with an interest in Early Modern texts can be easily trained to do most of this low-level but essential philological labour. I close with reflections of two students on their work in this project.  In the summer before her senior year in 2013 Nicole Sheriko  was a member of the first collaborative team that worked on what was then called Shakespeare His Contemporaries. She went on to Graduate School, working at the “intersection of literary criticism, cultural studies, and theory history”, has published a handful of essays  in leading journals, and is now a Junior Research Fellow at Christ’s College, Cambridge:

Having the experience of working with a professor on a project outside of the classroom–especially in digital humanities, which everyone seems to find trendy even if they have no idea what it entails–was a vital piece of my graduate school applications, I think, and other students may see a similar benefit in that.

 

In a less vulgarly practical sense, though, I would say that working on what was then Shakespeare His Contemporaries made a significant difference in how I approach studying the field of early modern drama. The typical college course can only focus on a handful of canonical texts but working across such an enormous digital corpus reoriented my sense of how wide and eclectic early modern drama is. It gave me a chance to work back and forth between close and distant reading, something I still do as I reconstruct the corpus of more marginal forms of performance from references scattered across many plays. A lot of those plays are mediocre at best, and I often remember a remark you once made to us about how mediocre plays are so valuable for illustrating what the majority of media looked like and casting into relief what exactly makes good plays good. The project was such a useful primer in the scope and aesthetics of early modern drama. It was also a valuable introduction to the archival challenges of preservation and digitization that face large-scale studies of that drama. Getting a glimpse under the hood of how messy surviving texts are–both in their printing and their digitization–raised all the right questions for me about how critical editions of the play get made and why search functions on databases like EEBO require a bit of imaginative misspelling to get right. That team of five brilliant women was also my first experience of the conviviality of scholarly work, which felt so different from my experience as an English major writing papers alone in my room. That solidified for me that applying to grad school was the right choice, a sentiment likely shared by my teammate Hannah Bredar, who–as you probably know–also went on to do a PhD. Once I got to grad school, the project also followed me around in my first year because I took a course in digital humanities and ended up talking a lot about the TCP and some of the little side projects I ended up doing for Fair Em, like recording the meter of each line to see where breakdowns occurred. I even learned some R and did a final project looking for regional markers of difference across the Chronicling America historical newspaper corpus. So, in big ways and small, the work I did at NU has stayed with me.

After her freshman year, in the summer of 2022, Lauren Kelley was part of the team that worked on medical texts. In the section about “Academic and Personal Development”  in the final report to the College she wrote this:

    As a premed student, spending the summer learning about the historical tradition of Western medicine has been incredibly valuable. Reviewing the medical corpus allowed me to understand how the field of medicine has evolved throughout the early modern period, and totrack the gradual development of knowledge and medical practice. Although the vast majority of knowledge in these books is outdated, the true value of this summer’s work lies in acquiring an intimate understanding of medical history from primary documents, as well as learning how to better interpret and analyze texts from this period. I also enjoyed having the opportunity to use my knowledge of Latin in a setting outside of the classroom, which reinforced the importance of studying Classics and its multitude of applications. Writing the final report for Haven of Health was an especially fulfilling experience that stimulated my academic growth; I had the opportunity to synthesize my observations throughout the 8 weeks and expand on them, as well as research a subject of interest to me and write about it.

 

Having just finished freshman year, this summer was the first experience that I have had with collegiate research. It was extremely enriching for me to spend eight weeks in a collaborative, research-oriented environment. I feel that coordinating several aspects of my work with my coworkers has vastly improved my teamwork skills. Finally, my confidence in my own academic abilities has increased, especially in my ability to apply knowledge in a real-world setting. Overall, this opportunity was a great introduction into how research is performed in humanities, and I am excited to further develop the skills I acquired this summer throughout my academic career.

 

Against the terms DH or Digital Humanities

As the date for this year’s DH conference approaches, I would like once more to express my  dislike of the terms Digital Humanities and DH. I write this not from the perspective of somebody inside this particular  tent, but as a faculty member in a standard humanities department who has tried (with very varying success) to persuade his colleagues that they should think a little harder about the humanities in a digital world and the headaches as well as opportunities that new technologies bring with them. My experience has been that “Digital Humanities” is a big turn-off. People either stop listening after the adjective, or they think that there is such a thing as Digital Humanities and we should lay our hands on some cheap version of it right now.

At the 2012 MLA Middlebury’s provost Alison Byerly argued along somewhat similar lines.  She worried about the needless opposition set up by such terms as “New Media” or “Digital Humanities” and argued that humanists “by defining technology-enabled research as a separate field” have “both validated and segregated it,” with “important implications for the humanities as a whole.”

In an earlier blog about Stanley Fish’s New York Times triblogy about the Digital Humanities I made a similar argument. If a term is bandied about a lot, it must be about something. For me the something of Digital Humanities “is about the trouble that the humanities have had in absorbing digital technology into their habits of work and recognition. Unlike the natural and social sciences, they have so far put the digital into a ghetto–a mutually convenient practice for those inside and outside, but probably harmful in the long run.”

In an otherwise friendly response to my blog, Brett Bobley, the director of the NEH office of Digital Humanities challenged my assertion that there were no “self-proclaimed digital biologists, chemists, or economists” and pointed to the existence of such fields as  “bioinformatics,” “computational biology,” “computational chemistry,” “computational economics,” to which one could add computational linguistics. But these fields seem to me quite different and rather more specific. In the humanities, something like “digital philology” would be a rough equivalent (the term in quite common in Germany but rare in the English speaking world). All such disciplines are implicitly recognized as helper disciplines where computational power and applied mathematics or statistics are brought to bear on increasingly large or complex data.

To put it differently, in all those cases, using some version of ‘digital’, ‘computer’, ‘informatics’ as a pre- or suffix to discipline X  does not strike the practitioner of discipline X as problematical in some kind of existential sense. It’s a pragmatic matter of getting stuff to work.  In the humanities, there are some subdisciplines that work that way. Whether you edit the thousands of manuscript fragments of the New Testament, use chemical analysis to determine the provenance of a painting, or manage an archaeological dig, you would have to be a fool not to take advantage of the digital resources that let you store, organize, and manage your data.

But most scholars in the humanities do not work on such projects with their strong curatorial and quite practical components. And for them the collision of “digital” and “humanities” is a useless and existential provocation that turns them off. Provocations have their uses.  If after colliding ‘digital’ with ‘humanities’ there were some chance that we understand either of these terms a little better that would be worth doing. But the chances of that happening are very slim indeed, largely because the encounters of the many mansions in the humanities with the digital are so varied that they are not usefully bundled under one term.

Here is a very rough taxonomy of these encounters. Are the humanist’s objects of attention born digital or are they digital surrogates? If the latter, are they surrogates of texts, of material artifacts, of time-based analog media? Is the humanist’s bent of mind more of a historical, philosophical, or rhetorical kind? Do the research questions lend themselves to quantitative analysis?

The important goal is not to advance Digital Humanities or DH. The important goal is to find better ways of integrating digital tools and resources into the working world of humanities scholars, however they define themselves.  “Humanities in a digital world” is a  gentler and more promising phrase than “Digital Humanities.”  It describes the challenges better, it is less likely to be perceived as some unitary claim (whether intended as such or not), and as a tag line it sets goals that are at once firmer and broader.

 

 

Welcome to Scalable Reading

Scalable Reading is a collaborative blog that brings together four literary critics who are interested in quite different topics and approach them with quite different substantive or methodological assumptions but share the belief that digital texts and tools for their analysis have much to offer to the discipline of Literary Studies.

Martin Mueller will shortly be Professor Emeritus of English and Classics at Northwestern University. His books include a monograph on the Iliad (2009, 2nd ed.) and Children of Oedipus and other essays on the imitation of Greek tragedy, 1550-1800.  He is a co-editor of the Chicago Homer and the general editor of WordHoard.

Stephen Ramsay is Associate Professor of English at the University of Nebraska and a fellow of the Center for Digital Research in the Humanities at that university. He is the author of Reading Machines: Toward an Algorithmic Criticism (2011) an co-author (with Patrick Juola) of the forthcoming Mathematics for the Humanist (Oxford). He keeps a blog of his own at http://lenz.unl.edu/

Ted Underwood is Associate Professor of English at the University of Illinois (UIUC). He is the author of The Work of the Sun: Literature, Science, and Political Economy, 1760-1860, and has published widely on Romantic literature and culture.

Matthew Wilkens is Assistant Professor of English at Notre Dame. He  works on contemporary literary and cultural production with particular emphasis on the development of the novel after World War II.  His recently completed book, Revolution: The Event in Modern Fiction, combines these interests with related theoretical issues including allegory, event, and encyclopedism in the 1950s and ’60s. He also  works extensively with new techniques of computational and quantitative cultural analysis, including literary text mining, geolocation extraction, and network analysis. He keeps a blog of his own at Work Product.