data curation – Scalable Reading

Back to the Future or Wanted: A Decade of High-tech Lower Criticism

Note:The title of this blog entry is the title of a keynote address I gave at the Chicago Digital Humanities and Computer Science Colloqium, held November 18-19 , 2012 at the University of Chicago. It is lightly edited and shortened. I have added a postscript called “A Decade later”.

This talk is about the challenges and opportunities posed by the EEBO-TCP corpus. Between 2015 and 2020 (and beginning with an initial release of ~25,000 texts) TEI-XML transcriptions of ~70,000 texts–at least one version of every title published between 1473 and 1700– will pass into the public domain. Once this resource is in the public domain it will for most scholarly purposes replace other surrogates of the printed originals. It will be free, it will often be the only, and nearly always the most convenient source for the many look-up activities that make up much of scholarly work.

EEBO-TCP is a magnificent but flawed enterprise, and few of its transcriptions fully meet the scholarly standards one associates with a decent edition in the print world. Who will guarantee the integrity of this primary archive that will be the foundation for much future scholarship? In a print-based documentary infrastructure there was a simple answer to the question “Who provides quality assurance (QA in modern business parlance) for the primary sources that undergird work in your discipline?” It was “my colleagues,” and it might include “I do some of that work myself.” From the nineteenth century well into the middle of the twentieth century, “Lower Criticism” of one kind or another counted as significant scholarly labor and made up a significant, though gradually declining, share of the work of humanities departments.

Consider Theodor Mommsen. In 1853 and 1854 he published the first volume of his Roman History, and he started the Corpus Inscriptionum Latinarum (CIL), the systematic gathering of inscriptions from all over the Roman empire. For the next five decades he was the chief editor and a major contributor to its sixteen volumes, which transformed the documentary infrastructure for the study of Roman history. Since the early 20th century, a student of Roman history with access to a decent research library has had “at hand” a comprehensive collection of the epigraphic evidence ordered by time and place. That has made a huge difference to the study of administrative, legal, and social history.

The CIL is a majestic instance of the century of curatorial labour that created the documentary infrastructure for modern text-centric scholarship in Western universities. In that world the integrity of primary data rested on what you might call a Delphic tripod of cultural memory with its three legs of scholars who made editions, publishers who published them, and librarians who acquired, catalogued, and made them available to the public. After World War II there was a growing consensus that you no longer needed to worry about data curation because a century of it had succeeded in creating a print-based data infrastructure that from now on you could take for granted. For the last forty years many disciplines in the humanities have lived off the capital of a century of editorial work while paying little attention to the progressive migration of textual data from books on shelves to files on servers or in ‘clouds’. Using some back-of-the-envelope calculations, Greg Crane argued in 2010 that classicists now allocate less than 5% of their labour to curatorial work (using the term in its broadest sense). That sounds about right for departments of English or History that I know something about. It is possible for individuals within fields of activity to make choices that make professional and economic sense within the field but lead the field as a whole astray. The steel industry of the seventies or the current monoculture of corn in Iowa come to mind.

A decade ago Jerry McGann observed that “in the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination” (quoted from his essay “Our Textual History” in TLS , Nov. 20, 2009). This digital migration has so far made slow progress. The integrity of an emerging cyber infrastructure for text-centric scholarship has received remarkably little attention in the discourse of disciplines that will increasingly rely on digital surrogates of their primary sources. The current buzz about ‘Digital Humanities’ or ‘DH’ has very little to do with serious work on that front.

Back to the EEBO-TCP corpus and the ~45,000 texts (~2 billion words) that have so far been transcribed. EEBO-TCP will serve as the de facto documentary infrastructure for much Early Modern scholarship, accessed increasingly via mobile devices that provide each scholar with his or her own “table of memory.” Montaigne had a couple of thousand books in his tower library. A little more than two years from now, graduate students will be able to load 25,000 books from Montaigne’s world (and beyond) onto their Apple, Google, or Samsung tablets as epubs or raw XML files.

“How bad is good enough” when it comes to the quality of those texts? A lot of work needs to be done if you believe, as I do, that a digital surrogate with any scholarly ambitions should at least meet the standards we associate with good enough editions in the print world (I am ignoring here the additional features required to make the digital surrogate fully machine actionable). There are two interesting properties of the TCP corpus that affect the discussion of data curation and quality assurance. Both of these have analogues in other large collections of primary materials. In fact, the TCP archive exhibits characteristic features of the large-scale surrogates of printed originals that will increasingly be the first and most widely consulted sources.

First, the TCP is published by a library. Second, in a collection of printed books, the boundaries between one book and another or one page and another impose physical barriers that constrain what you can do within and across books or pages. In a digital environment, these constraints are lifted for many practical purposes. You can think of and act on the current TCP archive as 45,000 discrete files, 2 billion discrete words, or a single file. This easy concatenability is the major reason for the enhanced query potential of a full-text archive. It also has the potential for speeding up data curation within and across individual texts.

If you come across a simple error in a book it is usually a matter of seconds to correct it in your mind. It takes much longer to correct it for other readers of the book. You must provide the correction in a review or write to the author/publisher. The publisher must incorporate it into a second edition, and libraries must buy the second editions before the corrected passage is propagated to readers at large. That is a typical form of data curation in a world where the tripod of cultural memory rests on the actions of scholars, publishers, and librarians. In a digital world that tripod rests on the interactions of scholars, librarians, and technologists. In a well-designed digital environment scholars (and indeed lay people of all stripes) can directly and immediately communicate with the library/publisher. If I work with a text and come across a phenomenon requiring correction or completion I can right away do the following:

1. Log in (if I’m not logged in already) and identify myself as a user with specified privileges
2. Select the relevant word or passage and enter the proposed correction in the appropriate form.

If I do not have editorial privileges, my proposal is held for editorial review. If I am authorized to make or approve corrections my proposal is forwarded for inclusion in the text either immediately or (the more likely scenario) the next time the system is re-indexed. The system automatically logs the details of this transaction in terms of who did what and when.

The obstacles to such an environment are not primarily technical or financial. They are largely social. You need substantial adjustments in the ways scholars and librarians think about their roles and relationships. Scholars often complain about the shoddiness of digital resources, but if they want better data they must recognize that they are the ones who must provide them. And they need to ask themselves why in the prestige economy of their disciplines they have come to undervalue the complexity and importance of “keeping” (in the widest sense of the word) the data on which their work ultimately depends. Librarians need to rethink the value chain in which the Library ends up as a repository of static data. Instead they should put the Library at the start of a value chain whose major component is a framework in support of data curation as a continuing activity by many hands in many places, whether on an occasional or sustained basis. Such a model of collaborative data curation is the norm in genomic research, a discipline that from the perspective of an English department can be seen as a form of criticism (both higher and lower) of texts written in a four-letter alphabet.

Some of the best thinking on these issues has come from Greek papyrologists, a very special scholarly club with highly specialized data, tools, and methods, but with some good lessons for the rest of us. Papyrologists have for a century kept a Berichtigungsliste or curation log as the cumulative and authorized record of their labours. The Integrating Digital Papyrology project (IDP) is based on the principle of “investing greater data control in the user community.” Talking about the impact of the Web on his discipline, Roger Bagnall said that

these changes have affected the vision and goals of IDP in two principal ways. One is toward openness; the other is toward dynamism. These are linked. We no longer see IDP as representing at any given moment a synthesis of fixed data sources directed by a central management; rather, we see it as a constantly changing set of fully open data sources governed by the scholarly community and maintained by all active scholars who care to participate .

He faced head-on the question: “How … will we prevent people from just putting in fanciful or idiotic proposals, thus lowering the quality of this work?” and answered that collaborative systems

are not weaker on quality control, but stronger, inasmuch as they leverage both traditional peer review and newer community-based ‘crowd-sourcing’ models. The worries, though, are the same ones that we have heard about many other Internet resources (and, if you think about it, print resources too). There’s a lot of garbage out there. There is indeed, and I am very much in favor of having quality-control measures built into web resources of the kind I am describing.

A collaboratively curated Berichtigungsliste or curation log offers an attractive model for coping with the many imperfections of the current TCP texts. The work of many hands, supported by clever programmers, quite ordinary machines, and libraries acting consortially, can over the course of a decade substantially improve the TCP texts and move them closer to the quality standards one associates with good enough editions in a print world. Imagine a social and technical space where individual texts live as curatable objects continually subject to correction, refinement, or enrichment by many hands and coexist at different levels of (im)perfection. You could also imagine a system of certification for each text — not unlike the USDA hierarchy of grades of meat from prime to utility. But “prime” would always be reserved for texts that have undergone high-quality human copy-editing. Such a system would build trust and would counteract the human tendency to judge barrels by their worst apples.

What I have said about collaborative curation of the TCP texts applies with minor changes to other archives. Neil Fraistat and Doug Reside in conversation coined the acronym CRIPT for “curated repository of important texts”. Not everything needs to be curated in that fashion, but high degrees of curation are appropriate for some texts, whether for their intrinsic qualities or evidentiary value. Large consortial enterprises like the Hathi Trust or the DPL might be the proper institutional homes for special collections of this type. Somewhere in the middle distance I see the TCP collection as the foundation of a Book of English defined as

• a large, growing, collaboratively curated and public domain corpus
• of written English since its earliest modern form
• with full bibliographical detail
• and light but consistent structural and linguistic encoding

It will take a while to get there. It is a lot of work, and like woman’s work, it is “never done.” But progress is possible. Here is the challenge of the next decade(s) for scholarly data communities and the libraries that support them: put digital surrogates of your primary sources into a shape that will

rival the virtues of good enough editions from an age of print

add features that will allow scholars to explore the full query potential of the digital surrogate.

I use “good enough” in the sense Donald Winniccott used it when he argued against a generation of psychoanalysts who were fond of blaming the mother. He defined a quite modest level of maternal competence. Going beyond it would not add a lot, but dropping below it would get bad very fast. Much of the digital and increasingly dominant version of our textual heritage will require a fair amount of mothering before it is clearly good enough.

Postscript: A Decade later

The 20212 talk sketched an ambitious agenda. Here are some facts about the modest progress we have made since then. Between 2013 and 2015 about 20 students from Amherst, Northwestern, and Washington University in St. Louis made over 50,00 corrections in some 510 Early modern plays. The number of textual defects per 10,000 running words is a crude but telling measure of how close texts come to being “good enough” for many purposes. The median rate of uncorrected drama texts in the TCP corpus was 14.5 defects per 10,000 words The work of these students reduced the defect rate by an order of magnitude to 1.4–an improvement visible to a casual reader.

In 2016 Shakespeare His Contemporaries became part of EarlyPrint, a more broadly based enterprise doubly centered at Northwestern and Washington University in St. Louis. EarlyPrint currently has close to 60,000 texts. Most of them come from EEBO-TCP. But there are also ~ 4,500 Early American text (Evans TCP) and ~ 2,000 English 18th-century texts (ECCO TCP). The EarlyPrint versions of these texts are linguistically annotated, can be searched via a corpus query engine, and close to 650 texts are “digital combos” that offer a side-by-side display of the text and high-quality digital images. Their technical infrastructure supports collaborative curation by anybody anwhere.

The acronym “FAIR” describes data that meet standards of findability, accessibility, interoperability, and reusability. It is a more elaborate version of Ranganathan’s fourth law of library science: “Save the time of the reader”. It summarizes an ethos well captured by Brian Athey, chair of Computational Medicine at Michigan, when he said at a conference about “research data life cycle management” that “agile data integration is an engine that drives discovery”. For Early Modern studies in the Anglophone world, the creation of the TCP archives has been the most monumental achievement. An important goal of EarlyPrint has been to make those texts FAIRer.

Since 2017 anybody with a computer and an Internet connection has been able to offer textual emendations via the EarlyPrint Annotation Module. It makes textual correction as easy as as writing in the margin of a book. I call it “curation en passant”. As soon as a reader enters and saves a correction in a little data entry field next to the text, the who, what, when, and where of a correction are automatically recorded in a central curation log. Emendations are provisionally displayed in the text, but their final integration into the source text is subject to editorial review.

A “digital combo” and a computer with a good screen (larger is better) will provide a user with a better than good enough text lab for many basic–and some not so basic– forms of philological labour. Work of this kind follows a “find it, fix it, log it” pattern, where the finding and the logging typically take more time than the fixing. The EarlyPrint environment significantly reduces the time cost “finding” and turns the “logging” into an automated process.

Plays have continued to be a source of special interest. In the EEBO-TCP corpus as a whole the interquartile range of defects per 10,000 words lies between 1 at the 25th and 48 at the 75th percentile, with 12 as the median range. Today the values for 814 plays in the EarlyPrint corpus range from 0 to 16.2 with a median value of 2.8 defects. In this important subcorpus of Early Modern texts a quarter of the texts have no defects, half of them have at most five defects per play, and three quarters of them have on average at most one defect per page.In practice, defects cluster heavily. Most of the remaining defects in EarlyPrint plays occur on the pages of texts for which there are currently no good digital surrogates on the Internet.

Last summer, three Classics majors at Northwestern tackled a corpus of 120 medical works. These were in much worse shape than the plays. The three figures for the interquartile range before curation were 14,7, 53.2, and 156.3 The students made about 20,000 corrections and reduced the interquartile range to 3.3, 13.3 and 30. They paid special attention to Thomas Cogan’s Haven of Health, a characteristic late 16th century work. With the help of better images from the Conway Library of Medicine (via the Internet Archive) and the help from a machine-learning experiment they corrected more than 1,000 defects. Its two dozen remaining defects (1.8 per 10, 000 words) are philological “cruxes” that have so far defied solution. But the EarlyPrint version of The Haven of Health is sound enough for most purposes.

It has also demonstrated that motivated undergraduates with an interest in Early Modern texts can be easily trained to do most of this low-level but essential philological labour. I close with reflections of two students on their work in this project. In the summer before her senior year in 2013 Nicole Sheriko was a member of the first collaborative team that worked on what was then called Shakespeare His Contemporaries. She went on to Graduate School, working at the “intersection of literary criticism, cultural studies, and theory history”, has published a handful of essays in leading journals, and is now a Junior Research Fellow at Christ’s College, Cambridge:

Having the experience of working with a professor on a project outside of the classroom–especially in digital humanities, which everyone seems to find trendy even if they have no idea what it entails–was a vital piece of my graduate school applications, I think, and other students may see a similar benefit in that.

In a less vulgarly practical sense, though, I would say that working on what was then Shakespeare His Contemporaries made a significant difference in how I approach studying the field of early modern drama. The typical college course can only focus on a handful of canonical texts but working across such an enormous digital corpus reoriented my sense of how wide and eclectic early modern drama is. It gave me a chance to work back and forth between close and distant reading, something I still do as I reconstruct the corpus of more marginal forms of performance from references scattered across many plays. A lot of those plays are mediocre at best, and I often remember a remark you once made to us about how mediocre plays are so valuable for illustrating what the majority of media looked like and casting into relief what exactly makes good plays good. The project was such a useful primer in the scope and aesthetics of early modern drama. It was also a valuable introduction to the archival challenges of preservation and digitization that face large-scale studies of that drama. Getting a glimpse under the hood of how messy surviving texts are–both in their printing and their digitization–raised all the right questions for me about how critical editions of the play get made and why search functions on databases like EEBO require a bit of imaginative misspelling to get right. That team of five brilliant women was also my first experience of the conviviality of scholarly work, which felt so different from my experience as an English major writing papers alone in my room. That solidified for me that applying to grad school was the right choice, a sentiment likely shared by my teammate Hannah Bredar, who–as you probably know–also went on to do a PhD. Once I got to grad school, the project also followed me around in my first year because I took a course in digital humanities and ended up talking a lot about the TCP and some of the little side projects I ended up doing for Fair Em, like recording the meter of each line to see where breakdowns occurred. I even learned some R and did a final project looking for regional markers of difference across the Chronicling America historical newspaper corpus. So, in big ways and small, the work I did at NU has stayed with me.

After her freshman year, in the summer of 2022, Lauren Kelley was part of the team that worked on medical texts. In the section about “Academic and Personal Development” in the final report to the College she wrote this:

As a premed student, spending the summer learning about the historical tradition of Western medicine has been incredibly valuable. Reviewing the medical corpus allowed me to understand how the field of medicine has evolved throughout the early modern period, and totrack the gradual development of knowledge and medical practice. Although the vast majority of knowledge in these books is outdated, the true value of this summer’s work lies in acquiring an intimate understanding of medical history from primary documents, as well as learning how to better interpret and analyze texts from this period. I also enjoyed having the opportunity to use my knowledge of Latin in a setting outside of the classroom, which reinforced the importance of studying Classics and its multitude of applications. Writing the final report for Haven of Health was an especially fulfilling experience that stimulated my academic growth; I had the opportunity to synthesize my observations throughout the 8 weeks and expand on them, as well as research a subject of interest to me and write about it.

Having just finished freshman year, this summer was the first experience that I have had with collegiate research. It was extremely enriching for me to spend eight weeks in a collaborative, research-oriented environment. I feel that coordinating several aspects of my work with my coworkers has vastly improved my teamwork skills. Finally, my confidence in my own academic abilities has increased, especially in my ability to apply knowledge in a real-world setting. Overall, this opportunity was a great introduction into how research is performed in humanities, and I am excited to further develop the skills I acquired this summer throughout my academic career.

Thou com’st in such a questionable shape: Data Janitoring the SHC corpus from the perspectives of Hannah, Kate, and Lydia

Below are the reflections of Hannah Bredar, Kate Needham, and Lydia Zoells about their adventures in the mundane world of Lower Criticism, about which I wrote in an earlier blog and of which the digital surrogates of our cultural heritage will need a lot in the decades to come. Racine observes in his preface to Bérénice that toute l’invention consiste à faire quelque chose de rien (all invention consists of making something from nothing). These three “inventors”, after spending much time time with commas and stray printers’ marks, came up with excellent insights into the business of criticism and the (un)certainties of making sense of texts, especially old ones.

Kate and Shakespeare’s scepticism

Is this a comma that I see before me,
Its tail hanging down? …Or art thou but
A comma of the mind, false punctuation
Proceeding from the text-oppressed brain?

Correcting transcriptions can sometimes feel like banging one’s head against a massive, impermeable wall. As often as I made a definitive correction, it seemed, I came across something that appeared irresolvable. Is this a period or a badly printed comma? A misaligned end-stop or the remnant of an intended colon? How long can I stare at it before I realize I will never know? We (undergraduates like me, new to the instability of early modern texts) arrive with a conception of textual clarity and authenticity that in many cases is simply not there. Some cases might be answered by looking at more books, more witnesses, visiting more libraries over more hours, but this isn’t conducive to curating an entire literary corpus for digital publication. And even were we to fully collate every text in the database, some of these questions might never be resolved. This means making peace with the unsatisfactory text and setting our aims somewhere less idealistic: closer to “good enough.” We turn towards clarity, functionality, and truthfulness to the text without forcing on it a definitiveness it does not have in every instance.

In a previous post on this blog (“How to fix 60,000 errors” June 22 2013), Prof. Mueller noted that the original 60,000 known errors in the SHC transcriptions constituted just 0.4% of the data in the database. That number is statistically insignificant for computer analysis of the texts, but even a cursory look at the transcriptions themselves confirms that the presence of so many errors is prohibitive to human readers, whatever the statistical significance. Making corrections at this level was our aim this summer, to help propel the transcriptions from masses of computer data to texts readable (and enjoyable) for people. For those who hail (with trepidation) the digital humanities as the end of reading and human response, our work is a reminder that digital texts and projects are ultimately designed with human readers in mind. Our sense of “good enough” is governed not by statistical significance but by the demands of human persnickety-ness, of the desire to immerse oneself in a text that at least appears to be “complete.”

Out damn’d ink blot! Out I say

Why should Macbeth be the play that lends itself most easily to (admittedly quite silly) comparisons with this work? When the instability of the source texts themselves obstructs our own desire for authority, how do we respond? What degree of alteration would be considered “murthering” the text, and how do we square our conscience with these, arguably inescapable, choices about what to transcribe, what to make more legible, and what to leave as crux? This might feel oddly dramatic as written here, but the experience of sitting face to face with a 16th century book, of making choices about how that text is transmitted and transcribed, feels something akin to tragedy for the conscientious and affectionate reader. And while this must be old-hat for those who work with these texts every day, it was entirely new to me. The cruxes I’ve described did not represent a majority of that errors we examined, but they are the ones that stick out in my memory, that solicited a sense of deep frustration strangely at odds with the silent stillness of the reading room. Yet more powerful than this frustration was the feeling of awe at these texts that had, somehow, survived—survived fire and flood and most of all indifference to sit before me, open and ready to survive once more.

Hannah’s Folger Reflections

Washington waxed feverish outside the walls of the Folger Shakespeare Library, but a different atmosphere persisted within. The rooms were chill, verging on icy; the wool-clad scholars were, wittingly or otherwise, alert. I sat with Lydia Zoells and Kate Needham in what Folger regulars call the New Room (ca. 1980), attracted to its abundant natural light. We were there to perform a task: in the course of two weeks, we intended to correct the maximum number of the remaining 20,000 errors in the database of early modern plays transcribed by Annolex. This was Phase Two of the Shakespeare His Contemporaries project, in collaboration with the greater Text Creation Partnership initiative. Previously, Professor Mueller had enlisted a handful of undergraduates, including myself, to check Annolex’s translation of texts with coinciding EEBO images. Unfortunately, due to the fact that these images were microfilm photographs of other pictures, the quality was often too poor to ascertain whether a mark on a page was an exclamation point or an erroneous blot of ink.

Lydia, Kate, and I convened at the Folger in order to determine if the original manuscripts housed there could illuminate any of the troubling instances that the digital tools could not. Previously, we had employed this brass-tacks method of cross referencing on an individual basis, adding the Bodleian Library, the University of Chicago Library, the Northwestern University Special Collections, and the Newberry Library to our list of visited sanctums. The Folger, however, seemed to hold the key to our transcription puzzle. We placed orders for over 50 texts, all of which the Library had in its vaults. Our work did not cease: we did not halt our editing when a tourist set off a fire alarm, and we only glanced up when specialists hung Henry Fuseli’s life-sized painting of the Macbeth witches on the facing wall.

As the days progressed we saw that most of the errors that we were correcting were ambiguous punctuation marks. In former phases of the project it was far easier to discern the meaning of a single word from its context clues than it was to determine whether a faint mark was a semicolon or a comma from its context alone, so the punctuation remained uncorrected. Even at the Folger it was often too difficult to identify such a mark with total certainty. Thus, we faced a recurring dilemma: do we leave the error uncorrected and the play incomplete, or correct the error to the best of our thinking and risk changing the text? This conflict inspired a number of conversations about the ethics of guessing at such a correction and the chance of accidentally transforming a text from its original form. Occasionally, Folger staff and scholars would join our conversations. They cited the movement in the 18th century to “improve” manuscripts such as these, when scholar-editors would add apostrophes and commas with prodigal liberalism in the hopes of clarifying an author’s “intended” rhythms and cadence. Inferring authorial intent of sixteenth century punctuation, when standard punctuation did not exist, was not only impossible but also a time sink, which we could not afford. One wise Folger staff member suggested that at some point an editing effort could be “good enough” and a text set aside.

These conversations were tea time discussions. Each afternoon and with charming inconsistency, a bell would ring: scholars would file out of the reading rooms, descend the stairs to the cafeteria, and revive as they nibbled biscuits and sipped steaming mugs. After witnessing a few days of the animation and conversation that arose during these mid-afternoon gatherings, I realized that tea time was crucial to intellectual life at the Folger. It was here that readers shared with one another their findings, their theories, and their academic mirth. Based on a mutual interest in English breakfast tea and early modern books, a community of scholars took shape. The Shakespeare His Contemporaries project strives to broaden this community. With free access to a cleaned-up database of early modern texts, a greater public can in turn discuss “moral editing,” the risk of drawing a text away from its original form, and the concept of work that is “good enough.” By adding more voices to these conversations, the worlds of both early modern literature and digital humanities will have the opportunity to complicate, broaden, and flourish.

Lydia on the Materiality of the Text

As my undergraduate career has progressed, I have become increasingly aware of, and fascinated by, the material nature of books. This has been facilitated by the fact that my studies tend toward literature that was written before 1700. For a long time, like most people, I took textual stability for granted and never thought about where, or rather what, books came from. But slowly, I became acquainted with EEBO, started reading textual introductions, and began to seek out classes that considered the materiality of texts. In my junior year, I took part in Professor Joseph Loewenstein’s Spenser Lab, where I took part in his project to produce an edition of the collected works of Edmund Spenser, diving headfirst into that rich area of interaction between the digital humanities and book history. When Professor Loewenstein suggested that I become involved with Professor Mueller’s project, Shakespeare His Contemporaries (SHC), I agreed because I was excited by the opportunity to work with early modern books in person as well as to contribute to early modern scholarship in a meaningful way.

Between April and July 2015, sometimes with Kate and Hannah, sometimes alone, I corrected transcriptions using the first edition play texts at the University of Chicago Special Collections Resource Center, the Newberry Library, the Folger Shakespeare Library, and the Houghton Library at Harvard University. Becoming comfortable working in these libraries and handling the delicate books was certainly one of the most valuable parts of my experience. The librarians were very accommodating and patient when it came to instructing a novice in the delicacies of handling the texts, and soon I was at ease with the books and with my surroundings. Each library I visited has its own atmosphere, and each one was a pleasure to get to know (though they were all kept at arctic temperatures). The books themselves offered their own special pleasures. I enjoyed finding the classified advertisements pinned inside front covers, engravings of stiff-looking authors, and the odd annotations left by early readers.

While the work of tracking down and entering punctuation marks, letters, and words was in large part tedious, it would sometimes bring me in contact with interesting passages. One of the great pleasures of working with colleagues who have a similar enthusiasm for early modern theater is that we often shared these moments with one another. This kind of work does not lend itself to a depth of understanding in the body of literature with which we were working, but I do believe that splashing in the pool that is the SHC corpus is valuable at this point in our undergraduate careers. We gained a kind of broad familiarity with the early modern dramatic corpus, and often found plays that interested us that we did know existed before.

It is important to me that our project this summer will contribute to the dissemination of quality transcriptions of early modern plays, especially of little known works. It was exciting when a correction I made felt meaningful: when it made a significant semantic difference in the text, or when it brought up an interesting question. It is my hope that these transcriptions will continue to be questioned and checked, but also that they will make the plays easier to read and more transparent for scholars and students. I have often been frustrated by the difficulty of finding good copies of less canonical plays, and making good transcriptions publicly available is a good start.

The Great Digital Migration

The following was first published on an earlier version of this blog in the spring of 2010. It is republished here with light revisions.

I spent some time with the papers of a recent conference at Virginia: Online Humanities Scholarship: The Shape of Things to Come. Here are some comments on them, with quotations from the papers, which you can download from http://shapeofthings.org/papers/.

I am not trying to cover the whole conference but let my attention be guided by my current interest in the quality and interoperability of the digital surrogates of Early Modern English texts — the approximately half million bibliographical items catalogued in the English Short Title Catalog, of which about 30,000 exist now in full-text XML transcriptions and some 70,000 will exist in that format in 2015, when they will all pass into the public domain. What forms of collaboration could make these digital surrogate as good as they ought to be if they are to serve as the basis for future scholarly inquiry? How did The Shape of Things To Come help me think about that question?

By ‘quality’ I mean the readerly properties of a digital surrogate. If in the context of scholarly work you look at a digital version of Holinshed’s Chronicles, Hobbes’ Leviathan, or Milton’s Paradise Lost you want that version to be as accurate and readable as a standard print edition. By ‘interoperability’ I mean the algorithmic amenability of the digital surrogate, its capacity for being variously divided or manipulated, combined with other texts for the purposes of cross-corpus analyses, having data derivatives extracted from, or levels or metadata added to it. Greg Crane envisages a future in which “digital surrogates for human cultural heritage … flow freely and instantaneously back and forth between humans and machines.” In a Utopian moment at the end of his opening talk Jerry McGann envisages an “online World Library” and lists types of resources that would not fit because

“they meet traditional scholarly standards but are designed in digital formats –typically HTML — that are…unsustainable [and] cannot exploit the integrating functions that web technology such a powerful social network”
they are “internally well-designed but … by choice or circumstance .. do not participate in …second-order integration”
they “lack any online presence at all: university press backlists …or the current an/or back issues of many scholarly journals”
they are “materials being hurled on the Internet in corrupt forms by Google and other commercial agents: materials that are badly scanned, careless or merely randomly chosen, poorly if at all structured. “

This is as good a list of hurdles as the proofs that Lysander and Hermia advance for the claim that “the course of true love never did run smooth.” But like true love, full interoperability remains a worthy goal: by always keeping it in mind you may sometimes fall a little less short of it.

Should we despair or hope? Almost a decade ago, McGann observed that “In the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination. This system, which is already under development, is transnational and transcultural.” (Cited from his article “Our Textual History in TLS, November 20, 2009). In his opening remarks to the conference (“Sustainability: The Elephant in the Room”) McGann does not dwell with satisfaction on progress made. Instead he analyzes the institutional and political obstacles that block progress towards a realization in the digital realm of a goal shared by all scholars: “We all want our cultural record to be comprehensive, stable, and accessible. And we all want to be able to augment that record with our own contributions.”

McGann sees digital technology as a decisive factor in disrupting a value chain of scholarly work in which scholars, publishers, librarians, and patrons, had long established and clearly understood roles. The Great Digital Migration that has been underway for two decades has been a “hotch-potch” and “the community of scholars has played only a minor role in shaping these events. We have been like marginal, third-world presences in these momentous changes – agents who have actually chosen an adjunct and subaltern position.” Elsewhere in this talk McGann speaks even more bluntly:

It’s a fact that most colleges and universities have not formulated comprehensive or policy-based approaches to online humanities scholarship. Resources for the use of media in the classroom, including electronic and web media, are fairly common. But a commitment of institutional resources to encourage digital scholarship is very rare. … But it’s clear that the universities are responding to facts on the ground: i.e., to the scholars themselves and their professional agents. Most scholars and virtually all scholarly organizations have stood aside to let others develop an online presence for our cultural heritage: libraries, museums, profit and non-profit commercial vendors. Funding agents like NEH, SSHRC, and Mellon have thrown support to individual scholars and small groups of scholars, and they have encouraged new institutional agents like Ithaka, Hastac, SCI, and Bamboo. But while these developments have increased during the past seventeen years – i.e., since the public emergence of the Internet – the scholarly community at large remains shockingly passive.

McGann thinks of this gloomy vision as a call to action under the sign of “not what but who” and ends his talk with the question both to the participants at this conference and to humanities scholars at large “What are you prepared to do?”

The most promising answers point to scholarly communities taking charge of the texts about which they care. Paolo D’Iorio, the general editor of Nietzsche Source, looks back to the “tradition of the academic societies of the XVII century” as the model for “open scholarly communities on the Web” but argues that these “do not yet exist and … will be difficult to create. He offers a subtle “Scholarly Ontology” about the ways in which scholarly communities and text corpora mutually constitute each other:

If a scholarly community intends to conduct research on a certain topic, it first needs to define which documents or objects to consider as its primary sources. When a research line is about to be developed and consolidated, a catalogue of primary sources is compiled, usually by archivists or librarians. The catalogue of primary sources lists the relevant classes of objects and often includes the complete list of their instances. … Catalogues of secondary sources come later, and are written by scholars or librarians… The distinction between primary and secondary has a fundamental epistemic value. According to Karl Popper, what distinguishes science from other human conversation is the capacity to indicate the conditions of its own falsification. In scholarship, the conditions of falsification normally include the verification of hypotheses on the basis of a collection of documents recognized by a scholarly community as relevant primary sources. Thus we can affirm that the distinction between primary and secondary sources exhibits the conditions for falsifying a theory in the humanities.

With textual documents, however, the distinction between ‘primary’ and ‘secondary’ varies with the scholar’s inquiry: “an article written by Nietzsche on Plato is a primary source to Nietzsche scholars, but it is a secondary source to Plato scholars.” Primary texts — or more accurately, texts treated as primary — are objects of special care. We “cherish and preserve” them, as Penelope Kaiserlian, the director of the Rotunda Press, says of the 60,000 distinct documents in the “cross-searchable collection of American Founding Era documentary editions.” If you are not given to a rhetoric of pietas you would still care to get those texts right, because they provide your only ground for verification.

Does it matter how good or bad the digital texts are if somewhere in the library there is a printed copy of a critical edition with all the variants in a state-of-the-art apparatus criticus? The answer is ‘yes’. McGann may well be right that the ultimate backup of the “permanent core” in the “scholarly materials” of the Rossetti Archive will be a print-out that will “fill two dozen or more large volumes.” But the sheer convenience of the Web means that texts will be increasingly cited and quoted from digital sources, and the quality of those sources will determine the quality of texts in use. Whose job is it to get them right?

Roger Bagnall’s description of Integrating Digital Papyrology (IDP) is a very instructive account of a digital scholarly community rallying around its data:

Of course, the world of the Web has changed dramatically since 1992, and the possibilities today are much richer than they were then. I would say that these changes have affected the vision and goals of IDP in two principal ways. One is toward openness; the other is toward dynamism. These are linked. We no longer see IDP as representing at any given moment a synthesis of fixed data sources directed by a central management; rather, we see it as a constantly changing set of fully open data sources governed by the scholarly community and maintained by all active scholars who care to participate. One might go so far as to say that we see this nexus of papyrological resources as ceasing to be “projects” and turning instead into a community.

Can one generalize from the experiences of so ‘nichy’ a sub-specialty as Greek papyrology? By virtue of the fragile and fragmentary nature of the sources that constitute their discipline, papyrologists are rarely more than two steps away from the material base of their data. In this regard they are quite untypical of humanities scholars, especially of scholars in English departments. If we are worried about the apathy of humanists when it comes to the transcription of a cultural heritage, what can we learn from the papyrologists? Would we not be better off looking at the crowdsourcing of historical Australian newspapers — a topic about which Rose Holley has written a splendid report with the attractive title Many Hands Make Light Work?

No doubt, collaborative curation of digital surrogates of printed texts will more often be like working on newspapers than working on papyri. On the other hand, Bagnall’s discussion highlights with exemplary clarity the problems of ownership, recognition, and quality control that are central to scholarly digital projects. If I think of colleagues with a philological or editorial conscience, they pretty much operate on the principle that a text cannot be trusted if it is not printed. In practice, there is much to be said for this view. In theory it is wrong, and Bagnall tells you why. You learn from him about

the Berichtigungsliste, a remarkable research tool in papyrology that collects periodically—there have been twelve volumes since its inception in 1915—all corrections proposed to the texts of papyrus documents (the universe of DDbDP and HGV), new datings or provenances suggested, and a fair amount of bibliography about the documents. It has for two generations been a joint project of Leuven and Marburg, now Leuven and Heidelberg. Before corrections are registered now, the editors of the BL do their best to check them to see if they think they are correct; if not, they are reported but with disapproval attached. How, my friend asked, will we prevent people from just putting in fanciful or idiotic proposals, thus lowering the quality of this work?

Bagnall argues persuasively that you can do as well or better in a digital and collaborative environment:

These systems are not weaker on quality control, but stronger, inasmuch as they leverage both traditional peer review and newer community-based ‘crowd-sourcing’ models. The worries, though, are the same ones that we have heard about many other Internet resources (and, if you think about it, print resources too). There’s a lot of garbage out there. There is indeed, and I am very much in favor of having quality-control measures built into web resources of the kind I am describing.

But for Bagnall, the concern with quality control marks often masks a concern for control:

“Ist mein, ist mein!” People who have created or curated projects are possessive. This possessiveness has its good side; it leads to personal investment. But in the end we possess nothing, because we are mortal; and our institutions, even if undying, do not tend to steer straight courses with unvarying purposes and priorities. They abandon our beloved projects when something new comes along. We could all cite examples. Control is the enemy of sustainability; it reduces other people’s incentive to invest in something. The same thing could be said of our books; it’s just easier to rework and reuse digital content.

In his response to Roger Bagnall, the ever practical Greg Crane develops a line of argument that picks up McGann’s demonstration of the mismatch between the priorities of most humanities scholars and the attention they ought to pay to the Great Digital Migration. Looking at some 700 reviews in the 2009 Bryn Mawr Classical Review and at about 100 cv’s, he found 28 reviews of commentaries or editions and three candidates with an interest in those genres of scholarship:

In effect, classicists as a group have made a cost/benefit decision to allocate less than c. 5% of their labor to the production of editions and commentaries. Improving the print infrastructure for the 50 million words of Greek and Latin that survive in manuscript transmission through c. 500CE was not a high priority – the benefits were no longer great enough to justify much scholarly labor. We invested our energy rather in interpretive articles and monographs.

This is probably an accurate statement about English departments as well. It is possible for individuals within fields of activity to make choices that make professional and economic sense within the field but lead the field as a whole astray. The steel and automobile industry come to mind. Would it be a good thing if the balance of what used to be called Lower and Higher Criticism shifted from 5:95 to 15:85 or even 20:80?

Crane glances at the sciences while describing useful scholarly contributions that undergraduates can make even early in their career. Here too the example of Classics may transfer readily to other humanities disciplines. While “the intellectual culture of Classical Studies assumes a long apprenticeship model”:

In a culture of digital editing, our students can begin contributing in tangible ways as soon as they can read Greek – first year Greek students are already able to distinguish text from commentary in the digitized Venetus A manuscript of Homer. Intermediate students of Greek and Latin offer their own analyses of individual sentences for the Greek and Latin Treebanks – contributions that are then compared against each other and then added to public database, with the names of each contributing student attached to each sentence. These contributions can develop seamlessly into undergraduate and MA theses of real value and immediate use. When our students publish unpublished material or contribute to knowledge bases, we find ourselves in a participatory culture of active learning. Pale clichés about citizenship and democratization suddenly become tangible.

There is nothing innovative in having undergraduates contribute to and then conduct research within a field – promising students in the sciences, for example, regularly begin working in laboratories, taking measurements or conducting technical procedures, and then develop experiments of their own. Classics is – or should be – a demanding field but no more so than the sciences.

Scholarly communities taking care of their data in the Great Digital Migration — how does this scenario play itself out in the institutional settings that add up to Scholarly Communications? McGann’s emphasis on the priority of institutional and political decisions has led him to devote much energy to NINES as a framework for scholarly neighborhoods in the nineteenth century. Laura Mandell and others are leading 18Connect as an extension into the previous century. But if these enterprises fully take off and become the 21st century children of the learned societies of the seventeenth century, one would want them to be much more closely allied and perhaps ultimately merge with their older and non-digital siblings. “Digital Humanities” is to some extent the failure of a ‘not yet’: there is no digital Economics, Chemistry, or Biology, although there is the subdiscipline of Bioinformatics. With regard to information technology, these disciplines are much more mature and have simply absorbed the challenges and affordances of new technologies into the everyday lives of their practitioners.

Finally, there is the Library. McGann observes accurately: “When digital scholarship in the humanities thrives at a university these days, the library is almost always a key player, and often the center and driving force.” I cannot think of a counterexample. The conference papers include exceptionally interesting reflection on “perpetual stewardship,” a natural term in a conference that had “sustainability” as one of its major themes. Penelope Kaiserlian talked about confronting the “implications of perpetual stewardship as we look to Rotunda’s future.”

Paul Courant, Michigan’s University Librarian, observed that digital technology has reversed the traditional roles of library and publisher. In the old days, librarians were the keepers of the books they bought. In the new world of digital publications, libraries are subscribers to data kept by publishers, and both libraries and publishers find themselves in unfamiliar roles.

Courant points to a paradox with regard to the ‘keeping’ of digital and printed materials. A book on the shelves of a libraries takes up 0.15 cubic feet forever and year after year incurs the costs of taking up that space and the services associated with it. Libraries are familiar with these costs and at some level they don’t require any special activity. Digital materials require more active and less familiar forms of upkeep. These activities may cost less but they are not yet well integrated into the budgets.

Courant argues that libraries make for better perpetual stewards:

The separation of stewardship from direct provision of access adds an unnecessary and complicating layer to the ecosystem of scholarly publishing. As an alternative, the Press could be a quasi-independent element of an academic library, responsible for its own editorial functions, but relying on the library for provision of the perpetual stewardship that electronic publication requires, and using the library “brand” to advertise its ability to provide such stewardship. Such an arrangement would be efficient in the simplest sense: neither the library nor the publisher would be required to learn how to do things that are not already part of its natural compass of activity.

From this reflection on perpetual stewardship you might look again at Bagnall’s argument about collaborative data curation

We no longer see IDP as representing at any given moment a synthesis of fixed data sources directed by a central management; rather, we see it as a constantly changing set of fully open data sources governed by the scholarly community and maintained by all active scholars who care to participate.

IDP here can be a placeholder for any project that involves collaborative data curation by a scholarly community. Where do you house or provide the technical infrastructure for such collaboration? Should it be seen as part of the library’s perpetual stewardship? That is not only an attractive idea, but it is difficult to see what other institutions in the humanities could play that role. But it will take much talking, thinking, and planning to get there.