“Fluent in Marlowe”: A decade of undergraduates as collaborative curators of Early Modern texts

In a course on Early Modern Drama that I taught in 2009 I gave my students the option of doing editorial work  for some of their assignments. Two of them wrote a perceptive essay on the work that they found both tedious and engrossing. They concluded by saying that they had become “fluent in Marlowe”,  a charming testimony to the value of  exercises from which students learn while doing work that is useful to others. In particular, the essay clearly shows how bright undergraduates move very quickly from humble editorial tasks  to thinking about fundamental philological problems. The practical work has a strong reflective payoff.

The students worked with spreadsheets that were populated with a verticalized output  of TCP texts–more about them below–that had been linguistically annotated with MorphAdorner, a Natural Language Processing (NLP) tool suite developed by Phil Burns in Northwestern’s IT Research division.  In such a table or “dataframe”  each word is the keyword in a row that includes left and right context as well as data about particular properties of the word. The reading order of the text is maintained by a numerical column, but the reading order is only one of several ways of ordering the data for this or that analytical or editorial purpose.

Shakespeare His Contemporaries

A few years later a grant from Northwestern’s IT group enabled Craig Berry to design Annolex, a  Web application with a relational database backend. Annolex could easily keep some 50,000 records of corrupt transcriptions from some 500 plays written within 30 years before Shakespeare’s birth and after his death. Craig earned his PhD at Northwestern with a dissertation on Chaucer and Spenser. His Doktorvater was Leonard Barkan. He has lived a double life as Spenser scholar and a programmer with responsibilities for the accounting software of a kidney dialysis clinic .

Before taking on Annolex Craig had made two significant contributions to computationally based projects in the Humanities at Northwestern. In the mid-nineties he wrote a program that identified the approximately 250,000 occurrences of some 30,000 repeated phrases in Early Greek epic. This inventory has been the basis of the Chicago Homer, which for the past twenty years has helped readers with or without Greek to get a sense of bardic memory by making visible  the network of phrasal repetition that is so distinctive a feature of Homeric poetry. Craig also added the Spenser corpus to Wordhoard, an application for the close reading and scholarly analysis of deeply tagged texts, which includes Early Greek epic, Chaucer, Spenser, and Shakespeare.

Annolex was operational between 2013 and 2015 in a project we called “Shakespeare His Contemporaries.”  During that period a dozen students from Amherst, Northwestern, and Washington U. in St. Louis corrected some 50,000 textual defects in some 500 plays and  reduced the median rate of textual defects per 10,000 words from 14.5   to 1.4. The modal Early Modern play will consist of   20,000 words +/- 4,000. In reading through a play you may not notice three defects. You will notice thirty.

Working in structured environments with light supervision these students fixed over 90% of textual defects in 511 plays.  The remaining distribution of defects looks as follows:

Remaining defects Number of plays
0 284
1 63
2-4 39
5-16 59
17-64 23
> 64 19

All in all, the students did very good work, and the remaining tasks are quite manageable, but most of them require access to better images, not to speak of the 23 plays whose digital scans were missing 67 pages.

If you do NLP work you may  say that the original  median defect rate of 14.5  per 10,000 (0.145%)  would in most cases make n difference to any quantitatively based inquiry. Which is true, but beside the point: Early Modern scholars like their texts clean. In a survey of TCP users 88% ranked “accuracy of transcription” as their first or second criterion, and 70% put it first.

Three students from that project deserve special recognition: Hannah Bredar(BA Northwesterns 2015), Kate Needham(BA, Wash.U. 2016), and Lydia Zoells(BA, Wash.U. 2016). Between April and July of 2015 the three of them, separately or together, visited the Bodleian, Folger, Houghton, and Newberry Libraries as well as the special collections of Northwestern and the University of Chicago. They fixed about 12,000 incompletely or incorrectly transcribed words.  Hannah and Kate are now PhD students in English at Michigan and Yale. Lydia, the valetudinarian of her class at Wash.U, went straight into New York’s publishing world and is currently an editorial assistant at Farrar, Giroux, and Strauss.

The Text Creation Partnership (TCP)

The texts from Shakespeare His Contemporaries came from the Text Creation Partnership (TCP). This is a good moment to give a brief account of what has arguably been the most important infrastructure project in Anglophone Early Modern Studies over the past thirty years. The English Short Title Catalog (ESTC), which aims at being a complete record of all imprints before 1800 from the English speaking world, lists ~ 137,000 imprints before 1701.  An imprint may be  a single sheet broadside or it may contain the 3.1 million words of Du Pin’s 1694 New History of Ecclesiastical Writers, the longest text in the TCP archive. Just about all these imprints were microfilmed between late thirties and end of the 20th century. For many years these microfilms were owned by University Microfilms, a corporation with close ties to the University of Michigan. Name and ownership have changed repeatedly in the past decades. Proquest, the current owner, is a subsidiary of the Cambridge Information Group.

In the early nineties Proquest digitized the microfilms. The digital scans became available as EEBO or Early English Books Online. I once asked colleagues what difference digital tools made to their work. Before I even finished my question one of them answered “EEBO changed everything” . And so it did. Ranganathan’s Fourth Law of Library Science says “Save the time of the reader”. If you can get across the paywall (a non-trivial if) and barring an Internet outage you can get to just about every book before 1700 right away and anytime–including 2am in your pyjamas.

In the late nineties Proquest and a consortium of universities led by Michigan and Oxford  formed the Text Creation Partnership and struck an agreement to create SGML transcriptions of ~ 60,000 ESTC titles– approximately two billion words and for many practical  uses a deduplicated library of Early Modern English books. The work was done in two phases on the understanding that after an initial five-year period  the texts of each phase would move into the public domain. Some 25,000 Phase I texts moved into the public domain in 2015; Phase II texts will follow in 2021.

The transcriptions were for the most part  done by off-shore transcriptions services like Apex Typing from EEBO scans of microfilms of copies printed before 1700 and subject to the vagaries of the intervening centuries. Lots of things could and did go wrong in the long journey from the author’s manuscript to the screen in front of the modern copyist. The contract called for an accuracy level of no more than 1 error per 20,000 keystrokes, but transcribers were not penalized for illegible characters  if they marked the nature and extent of the resulting gap, whether “2 characters” or “1 paragraph”.  Missing letters are typically display by a placeholding character. If you use a dot, the resulting string will be a “regular expression”.

In Donald Rumsfeld’s parlance, the TCP texts include ~ 10 million “known unknowns” or corrupt words where the position and extent of damage are reported with high accuracy.  The texts probably include a roughly equal number of “unknown unknown” in the form of misprints or transcription errors. Some of them are reported in the errata sections  found in some 6,000 texts.  Some of them–especially the notorious confusions of long ‘s’ and ‘f’  or ‘u’ and ‘n’–can be flushed out by targeted searches. ‘Hnsband’ and ‘assliction’ are real examples. But most cases are hidden among 4.5 million spellings that occur less than five times in the corpus.

I doubt whether more than 10% of Early Modern texts have ever received the attention required for meeting minimal editorial standards for scholarly work. A reasonable person could wonder whether the editorial attention lavished on Shakespeare has strayed beyond the point of diminishing return. Could  that attention be  more profitably spent on the thousands of texts that  would benefit greatly from basic  forms of “textkeeping”? For many purposes, including simple lookups and citations, EEBO images are good enough, and their image numbers have the advantage of a global and stable citation system. But images cannot easily searched, and for texts before 1700 OCR remains far too “dirty” to produce reliable results.

The TCP texts were originally encoded in SGML, but also exist as XML versions. They go a long way towards creating searchable texts, but none of them fully qualifies as a scholarly text, and most of them have only gone through very limited proofreading. On the other hand, the coarse but consistent XML encoding across a corpus of 60,000 texts in principle lets users formulate queries that look for (or exclude) text in verse or prose, lists, tables, notes, and prefaces, dedications or other forms of “paratext”. There is currently no interface that makes these affordances available in a user-friendly manner to “non-geeky” Early Modernists, which is most of them. The linguistic annotation of texts extends the query potential of the texts into the micro-level of phrasal structure by supporting queries for patterns like “handsome, clever, and rich” or adjectives preceding ‘liberty’ or ‘freedom’.

The  reputation of TCP texts has suffered from the universal tendency to judge a barrel by its worst apples . Defects cluster heavily in a minority of texts: 15% of them account for 60% of all defects, and two thirds of the texts have defect rates that are low or tolerable. But there is a lot of basic editorial work that can and should be done.  It may be the case that as many as three million defects can be fixed algorithmically with an acceptable error rate. If a million defects can be fixed at an error rate of 3%,  970,000 words would be corrected and 30,000 would be no worse off: they would just be wrong in a different way, That is not a bad bargain, especially if all algorithmically corrected words are flagged appropriately. Philological casualties are easier to bear than military ones.

Towards a cultural genome of Early Modern English

Since 2016 undergraduate work on collaborative curation has extended beyond the scope of Early Modern Drama and tackled the entire EEBO-TCP corpus. Time will whether this will prove to have been a wise or foolish step, but the extended project–which has involved Notre Dame and now involves Northwestern and Washington U. in St. Louis– has received significant support from the Mellon Foundation and the ACLS.  Its roots go back to the multi-institutional 2007-09 MONK project  (Metadata Offer New Knowledge) , which took some steps towards a a multi-genre diachronic and consistently tagged and annotated corpus in the spirit of a  remark by Brian Athey, chair of Computational Medicine at Michigan, that “agile data integration is an engine that drives discovery.”  MONK led me to formulate the idea of a Book of English defined as

  • a large, growing,  collaboratively curated,  and public domain corpus
  • of written English since its earliest modern form
  • with full bibliographical detail
  • and light but consistent structural and linguistic annotation

The parallel with collaborative genomic annotation runs deep.  Early Modern printed English (from 1473 to 1700) would be the first chapter in such a book, and the one with the most realistic chance of being completed. You could call the result a “cultural genome” or “book” of Early Modern English, just as the “book of life” metaphor is often used for the human genome.

“Agile data integration” for a Book of Early Modern English would be a good thing to have, but one must be clear about what it is or is not. It is a good enough record of what has been printed and survived. It does not include what was written by hand and never made it into print. After ‘Augustine’, ‘Luther’ is the most common word in EEBO-TCP that unambiguously refers to a historical person. The 60 volume Weimar edition of Luther’s works has an additional dozen German and Latin index volumes of names, places, subjects, and citations. No Luther scholar would want to be without it. Indexes are a very early device of print culture to make books more “agile”, witness the “diligent, and necessary Index, or Table of the most notable thynges , matters, and woordes contained in these workes of Master William Tyndall” in a 1570 edition of Tyndale’s works.

In a Book of Early Modern English each “chapter” (or separate TCP text) should contain complete, clean, and readable text, and this book should be complemented and surrounded by digital indexes that let users treat it as if it were  a single and well-indexed book.  Getting there will take a lot of work and involve different and mutually reinforcing tasks ranging from basic copy-editing to complex NLP routines. Not all of it needs to be done before parts of it become useful, and as long as duplication is avoided it does not matter in what order things get done.

It is a big step to go from 500 to 60,000 texts.  Think of the simple “textkeeping” tasks in terms of a classic  ditch-digging story problem. A dozen students working full-time in two eight-week summer internships cleaned up 90% of defective tokens in 510 or 0.85%  texts. How many students (or other contributors) would it take to complete the task in seven or ten years?  Roughly speaking, they did about half a percent of the work.

It is easy to be discouraged by those numbers, but there is also a cheerful way of looking at it. A few students working together can significantly improve some cluster of Early Modern texts, whether plays, books about science, gardening, law, witchcraft or whatever, and that work cleans up some textual neighbourhood for all its future readers.

Over the past three years, fixing textual has taken a backseat to improving the tools and environment for doing collaborative editorial work. The EarlyPrint Library  is built on an eXist XML database that adds the following features to a readable text:

  1. For each page of transcribed text it provides immediate access to the corresponding EEBO image
  2. For a growing number of texts it provides access to high-quality and public domain images on IIIF servers at the Internet Archive and elsewhere
  3. It includes an Annotation Module that supports “curation en passant” and allows registered users to offer emendations for corrupt readings. These emendations are flagged and immediately displayed in the text, but their integration into the source texts is subject to editorial review

It has taken two years to make this environment reasonably stable and fast enough for most purposes. It is still a work in progress, but we have a much clearer sense of what it takes in software refinements and more powerful hardware to make it faster and more reliable.

The standard search functionalities of the eXist database  were not designed to meet requirements of editorial work in the EarlyPrint Library. The current plan is to combine the EarlyPrint environment with a BlackLab search engine that implements a corpus query language on top of a Lucene index and also supports XML aware searching.

 

Eng² or Engineering English

Engineering and English are close alphabetical relatives but the people in those disciplines tend not to think of each other as blood brothers. That said, whatever else a book may be, it is certainly an engineering product. Images like the Ramelli wheel testify to an Early Modern fascination with mechanical engineering. In retrospect, one may even see in that image a foreshadowing of  Franco Moretti’s “distant reading”. At least it shows a recognition of reading as fundamentally a “many books activity”.  A modern book is one of many possible ways of representing a digital file, and all stages of the editorial process have been deeply affected by digital technologies. What applies to the making of books also applies to their reading and analysis. The business world with its understandable  interest in profits has been eager to use all manner of NLP techniques to get to get to some bottom line as quickly as possible.  Humanists are leery of bottom lines. Leaving aside self-proclaimed “digital humanists”, scholarly readers or editors remain reluctant to exploring ways in which technology could help them with anything beyond the mundane tasks of typing, printing, copying etc. This reluctance is not very helpful, but it is very powerful.  The thoughtful Andrew Piper in his recent Enumerations wistfully looks ahead  to “imagin[ing] an alternative future where students are not dutifully apportioned into silos of numeracy and literacy but are placed in a setting where these worldviews mix more fluidly and interchangeably” (p.x).  It will be a while before that becomes an everyday reality in humanities departments, but it is worthwhile hoping for and working towards it.

Curating and exploring the Early Modern corpus offers many opportunities for breaking the “silos of numeracy and literacy” and joining them in the increasingly useful skill of “telling stories with numbers.”  Some of those opportunities are very practical, but (remember “fluent in Marlowe”)  practice and reflection can be close neighbours. .The humble task of correcting corrupt readings  is at some level a spellchecking problem, but the pattern matching skills that it calls on are just as important for higher-level operations.

The dismal prospect of manually fixing millions of frequently obvious typographical errors or gaps led me to ask whether a machine could help. I talked with Doug Downey in Northwestern Computer Science Department. One of his students took a first stab at a solution in the context of  the limited drama corpus.  “Dirty Words: Engineering a Literary Cleanup” is a lively report about it. Two years later, two other students of his, Larry Wang and Sangrin Lee, did a more ambitious experiment that targeted the entire corpus and used Long Short Term Memory (LSMT) routines.  The results were so promising that machine-generated corrections were imported into  the EarlyPrint Library but flagged with a colour that marked their algorithmic status.   A closer look has shown the need for a more granular case logic that excludes certain types of defects and clusters subsets of texts for special treatment.  But there is no question in my mind that at least half of the defective tokens can have algorithmically based solutions.

This is a case where engineering students can add substantial value to a humanities project by using sophisticated and familiar techniques of pattern matching. But there are also things for them to learn.  The increasingly powerful NLP routines developed largely for the uses of business and industry make substantial and tacit assumptions about what English is like. But these routines require much tweaking of training data and algorithms to work with data from earlier centuries, and that tweaking requires deep conversations with domain experts to figure out what is or is not within algorithmic reach. Those conversations are a bridging exercise, and there is much to be learned on both sides.

How many different words are there in the Early Modern corpus? This is a question of some interest to a lexicographer. With a modern corpus you can get a pretty good answer by stripping of some suffixes and grouping the results. Not so with a corpus that spans 230 years of considerable orthographic variance and fluctuation.  The EarlyPrint corpus includes has about 4.4 million distinct spellings. 3.5 millions occur less than five times, and 2.4 million occur only once. Programs that map incomplete to known words should also be able to identify very rare spellings as variants or misspelling of more common words.

Named Entity Recognition

From a technical perspective, Named Entity Recognition (NER) is very close to the spellchecking problems discussed above: but instead of matching a string to a standard spelling or lemma you seek to match it to an entity that exists outside that text in some real or imagined space. Is ‘John’ the apostle, the Baptist, the name of one of the gospels, the name of the letter by John, the English king, or the name of some fictional character? There are well over a million distinct character strings that are names or abbreviations of names. Not all of them are as polysemous as ‘John’, but quite a few are.

From an end user’s perspective, clarity about names may be greatest navigational help that a corpus can provide. A recent Northwestern English major, who is now at the UIUC School of Informations Sciences,  worked with Phil Burns and did valuable NER work on Purchas His Pilgrimage, a very large early 17th century compilation that probably contains a high percentage of names published in texts before it.  Getting names roughly right will be a high priority of the project, and it will call on a clever combination of algorithmic analysis and shoe leather journalism to get it done. Collaboration between Computer science and humanities students can do a lot of good in this field and be a very valuable experience for the students engaged in it.

 

 

 

 

c

Collaborative curation of 126 medical texts in the EarlyPrint corpus

Three Northwestern  Classics undergraduates, Ace Chisholm,  Grace DeAngeles, and Lauren Kelley, had summer research grants to work on the curation of medical texts in the EarlyPrint corpus. The following is their report, very lightly edited and supplemented by a few hyperlinks.  For more information about collaborative curation, Early English Books Online, and EarlyPrint see https://sites.northwestern.edu/scalablereading/2022/09/25/collaborative-curation-of-tcp-texts-in-the-earlyprint-environment/ 

The three editors also wrote a detailed summary of the text, accompanied by stylistic and thematic analysis. You can find it at

 

EarlyPrint 2022: What we did and what we learned

Our goal

This summer, we worked on medical texts from the English early modern period. Our goal was to get this medical corpus in polished shape for future researchers to use. We also researched a text that was particularly interesting: Thomas Cogan’s Haven of Health, first published in 1584 (though we worked with the fourth edition, published in 1636).

Our contributions to EarlyPrint

In order to make texts in the Early Print corpus more easily read, searched, and analyzed, a plethora of mistakes in the original transcriptions, both known and unknown, had to be corrected, and gaps in the text needed to be filled.  We took four different approaches to meet these needs: correcting “blackdot” words, proofreading, correcting metadata, and transcribing.  To address the “known unknowns,” we worked through nineteen spreadsheets covering 126 texts and corrected almost 20,000 blackdot words.  Blackdot words are those which the original transcriber was unable to identify a letter or several letters in the word and indicated how many and where those letters occurred.  Those missing letters appear as black dots (●) in the digital transcription.  In many cases, we were able to determine the word based on context clues, but where we could not, we consulted the EEBO  page image for its appearance on the page.  In the cases where the word was Latin, Greek, or even abbreviated Latin, our knowledge of Latin (and Grace’s knowledge of Greek) was particularly useful.  If a letter or two were completely obscured, we could often nonetheless identify the word from our knowledge of grammar and its orthographic realization.

The “unknown unknowns” that we interacted with took the form of either incorrectly transcribed words or incorrectly tagged words.  Because incorrectly transcribed words were unmarked (hence an unknown unknown), we rooted these out by proofreading.  We proofread Haven of Health (A19070), simply correcting as we read, as well as seventeen pages of The vertuose boke of distyllacyon , the earliest English “herbal” or book about plants,  recording the types of errors we discovered in our focused sample in a spreadsheet for analysis. Within the seventeen pages, we identified and corrected 450 transcription errors that were hapax spellings, missing, erroneously added or omitted letters, or letter changes.  We reported the tendencies of each type of error with the intention to inform a future parser about possible mistakes and their remedies.

We also used “known unknowns” to identify the “unknown unknowns” in word metadata.  MorphAdorner works in the Early Print corpus to adorn the text with its appropriate metadata, such as a word’s part of speech tag.  While incredibly useful to researchers by providing this searchable data, MorphAdorner is not always accurate.  Using  Aqua Data Studio, a user-friendly frontend for relational databases, we reviewed the metadata attached to words within medical texts that had been tagged as a name along with every other instance of that word.  We corrected the lemma, standardized form, and token (the word itself) of these 5532 words and changed the part of speech tag to “fla” or “n1/2” or “nn1/2” as appropriate.

A number of texts in the Early Print corpus have entire pages missing, a result of the corresponding EEBO being either illegible or incomplete.  After Grace led a brief training session in XML, the coding language used for TCP texts, we set to work in the Oxygen XML Editor to fill those gaps.  We transcribed 107 missing pages for fifteen texts in both English and Latin, resulting in the completion of eleven of the texts and moving the remaining four closer to that end.

In addition to correcting and developing texts, we also researched and produced a report for Haven of Health (A19070), containing a brief biography of the author, Thomas Cogan, summaries of the six major sections of the text and their subsections, and a collection of analyses considering the topical, linguistic, and theoretical aspects of the work.  The research that went into these analyses expanded our understanding of early modern medical beliefs, practices, and their underlying rationale.

The value of our pursuit

It has been a worthwhile endeavor to work with the medical texts of early modern England because, in a time when disease dominates global discourse, it is essential to consider the roots of modern medicine. There is no value in understanding where we are if we don’t know where we came from, as history informs the decisions that will be made in the future.

Many aspects of early-modern medicine contrast with the present day. We no longer maintain the theory of four humors; pills have replaced herbal remedies; surgical patients are likely to survive their procedures; women’s bodies are no longer considered ‘cold and moist’. Regardless of what issues we may have with modern medicine, we can rest assured that any given affliction is less painful and deadly today than it was in the past.

However, a comparison of early-modern and present-day medicine reveals that there is an essential similarity between the two, in that medicine is always subjective to some degree. In the twentieth century, medical research began in earnest to be as objective as possible, using double-blind procedures, randomization, control groups, placeboes, and statistical analysis, inter alia.[1] This trend has continued and accelerated into the twenty-first century, simultaneously drawing from and contributing to the growing importance that society places on scientific advancement.[2] While the research itself may be objective, objectivity itself is a subjectively defined concept that is shaped by social discourse. Additionally, the supremacy of objectivity, rather than subjectivity, in medicine is an arbitrary choice that society has made.

Studying early-modern medical texts can give us perspective on the state of our current social discourses surrounding medicine. Religion no longer has any part in medical science, but they were more closely implicated in several of the texts we studied. For example, the 1636 work Lord have mercy upon us the world, a sea, a pest-house, the one full of stormes, and dangers, the other full of soares and diseases reports that “the principall cause of all the Diseases of the body, are those of the Soule, which is Sinne.”[3] Many of the authors that we read wove together objectivity and subjectivity. For example, an author might first discuss the potential procedures for setting a broken bone, then state his opinion that a particular procedure is best because he himself saw that it healed a man’s arm in three days.

Religion and one’s own observations are both subjective, for they present personal perceptions and interpretations which cannot be verified and may not necessarily be shared by others. Such subjectivity was commonplace and acceptable in early-modern medicine, as the authors’ contemporary society did not require them to be objective and methodical. Upon reading early-modern medical texts, we now have a much more clear perspective about how cultural values and social discourse shape the human experience of, and opinion on medicine.

Moreover, given our status as Classicists, we were able to read these texts with an awareness of ancient medicine. This allowed us to consider the impact of ancient medicine on early-modern and present-day medicine. The authors that we read often paraphrased and quoted directly from writers of ancient Greece and Rome, such as Hippocrates, Galen, and Cornelius Celsus. They also usually made sure to cite their sources, demonstrating that the teachings of ancient medicine were esteemed in the medical culture of early modern England. Many pieces of information about diseases, humors, injuries, and body parts (inter alia) were transmitted directly from the ancient sources into the early modern texts that we worked on. Ancient medical knowledge was generally unquestioned, which is comprehensible from the Humanist lens that prioritized learning from classical antiquity. On the other hand, modern-day medical science prefers research-based advancement that constantly overrides and disproves ideas that were credible in the past.

Our work this summer has demonstrated that the culture surrounding modern-day medicine is a relatively new phenomenon and marks a break from the history of medicine. Gaining perspective on how medical information has operated and currently operates in diverse societies and time periods helps us understand where the future of medicine might lead, as the world renegotiates its relationship to medicine in the wake of pandemic.

What we have learned about early-modern medicine

Through proofreading the medical text corpus, we were exposed to a wide breadth of texts written throughout the early modern period. Upon reading these texts, there was an immediate and almost staggering realization of the monumental developments within the field of medicine in just a few centuries. Even the texts written near the end of this period purport an outdated view of health based on Humorism and the 2000 year old teachings of Hippocrates. In a relatively short time frame, the advent of the Scientific Revolution and subsequent developments in science have transformed the way that disease and health are understood. Although the information in these texts is outdated, they have the invaluable ability to remind us of the often-slow progression of knowledge, as well as stand as a testament to the never-ceasing desire of humanity to learn more about itself and the world.

The medical text corpus contained texts that spanned just over a century, between the early 16th century and the mid-17th century. Although medical theory remained largely unchanged in this timeframe, there was still a notable difference in individual works from the early end of corpus to the latter end. Earlier texts tended to stick primarily to English, while later texts tended to quote more frequently from Latin and Greek authors. This would suggest a pattern of increasing exclusion within medicine, since extensive schooling would have been required to understand those languages. In earlier centuries, folk medicine seemed to have been more dominant; medical knowledge was passed around within communities, and a formal degree would not have necessarily been required to gain authority or to practice medicine within a community. However, later authors seem to derive their authority from their education. As discussed in the formal report for The Haven of Health, Cogan quotes from ancient authors largely to establish his own expertise in medicine. The ability to read these authors in their original language was a skill that only educated men had, and if medical authority was restricted to those who had a firsthand intimacy with the ancients, then the pool of people who can derive medical authority is limited.

Additionally, the reliance on ancient authors suggests that there was an increased importance of having a source for medical information. Earlier texts tended to cite where they got recipes and information from less often, while later texts had more consistent citation about precisely where they got information from. While this is additional proof for the gatekeeping of medical knowledge, since being able to cite sources requires an education, it also suggests a movement towards analysis and antiquarian research. Authors of later medical texts had to have an impressive grasp on a vast body of medical knowledge, and selectively choose which information they considered most accurate and relevant. There are several examples in Haven of Health where Cogan would present conflicting information from two different sources, and then reconcile them with his own opinion and thoughts. This behavior demonstrates a remarkable ability to analyze and synthesize information from a massive corpus of medical information, as well as build upon established knowledge and tweak it if necessary. This behavior became much more prevalent as time went on, and built the foundation for a modern culture of peer-reviewed research and collaboration in medical settings.

One aspect of the medical corpus that remained fairly consistent throughout time was its treatment of women. With the exception of two or three texts concerning midwifery, the vast majority of texts contained little or no information about medicine that was specific to women. As a pertinent example, John Banister’s 1578 The Historie of Man explicitly refuses to discuss the reproductive anatomy of women (at the end of a chapter dedicated to male reproductive anatomy) since he would “commit more indecencie agaynst the office of Decorum, then yeld needefull instruction to the profite of the common sort.”[4] There was a pervasive sentiment throughout the corpus that women’s issues were secondhand to men’s issues, or that they weren’t worth mentioning. Of course, at this point in history, the formal medical profession was essentially restricted to men, and the vast majority of these texts were aimed at an educated, male audience. Therefore, it is no surprise that this patriarchal institution would dedicate little time or effort into understanding issues that affected only women. Additionally, the strong presence of the Church in society undoubtedly rendered discussions about female anatomy and reproduction taboo and possibly offensive. The lack of mention of these topics in the medical corpus is further proof of the importance of alternate medical authorities, such as midwives, in the lives of women.

The role of religion in the medical corpus demonstrates another tradition that is largely gone from the modern world of medicine. A considerable number of texts make some mention of God in the title, or begin with a preface that explains the author’s dedication to God. This can be explained with the central Christian ideology that the body is a temple to God, and that maintaining health is a way to remain faithful and preserve oneself so as to better serve the Lord. Additionally, the act of helping the poor and needy is another central teaching of Christianity that is fulfilled by the profession of physicians. This is very consistent with the prevalence and authority of the Church in early modern society, and explains the close intertwining of religion and medicine. An interesting observation that is somewhat more surprising to the modern reader is the role of sin in medicine. Since medicine was not secular, health encompassed both physical and spiritual health; therefore, if someone were out of favor with God, they were not considered truly healthy since they were in a state of deep sin. Depending on the source, spiritual sickness could even lead to physical ailment. The large role that religion used to play in diagnosis and prognosis further demonstrates the cultural impact of Christianity on all aspects of early modern life.

Becoming familiar with these medical texts has certainly reinforced the vast progress that has been made in medicine from the early modern period to the current era. In many ways, the medical landscape of this time is unrecognizable; these works were written before germ theory, when the prevailing medical philosophy was essentially unchanged for 2000 years. The reliance of physicians on ancient authors instead of empirical evidence is almost unimaginable to modern readers. Perhaps of most importance to half the world population, modern medicine has expanded its bounds to include women and their specific health needs. However, there is some continuity in the actual hierarchical structure of medicine; in the current era, there is an increasing barrier to entry in the medical profession, with higher levels of education and experience needed to become certified. Similarly to the 17th century, medical authority tends to rest in the hands of relatively few experts who bestow treatment upon the larger population. Therefore, while the content and understanding of medicine has changed over time, the man-made structures and hierarchies within the field remain constant.

Conclusion

Before this summer, the three of us had only a baseline knowledge of ancient medicine and no knowledge of early modern medicine. Nonetheless, we managed both to learn from and to contribute to the medical corpus of EarlyPrint. We hope that our efforts inspire others to continue to research the medical corpus, or to work on other topical corpora that spark their interest.

[1] Bhatt, Arun. “Evolution of clinical research: a history before and beyond James Lind.” Perspectives in Clinical Research vol. 1,1 (2010): 6-10.

[2] “Society” here refers to Western society—we do not have enough familiarity with the history and current practices of Eastern medicine to analyze it here.

[3] https://texts.earlyprint.org/works/A68989.xml?page=010-b.

[4]The Historie of Man, 89. https://texts.earlyprint.org/works/A03467.xml?page=098-a

A list of texts for which missing pages were transcribed from digital facsimiles on the Internet Archive of the printed originals.

 

S121369 1533 Fabyan, Robert, Fabyans cronycle newly prynted, wyth the cronycle, actes, and dedes done in the tyme .. Henry the vii.
S106976 1596 Le Sylvain, The orator: handling a hundred seuerall discourses, in forme of declamations:

 

S102357 1598 Florio, John, A vvorlde of wordes, or Most copious, and exact dictionarie in Italian and English
S106753 1599 Hakluyt, Richard, The principal nauigations, voyages, traffiques and discoueries of the English nation
S103824 1599 Harsnett, Samuel, A discouery of the fraudulent practises of Iohn Darrel Bacheler of Artes
S117760 1624 Camden, William The historie of the life and death of Mary Stuart Queene of Scotland.
R15125 1652 Selden, John, Of the dominion, or, ownership of the sea two books.
R9064 1660 Bartoli, Daniello The learned man defended and reform’d. A discourse of singular politeness, and elocution
R19153 1661 [no entry] Mathematical collections and translations… Galileus his system of the world.
R42174 1673 Milton, John, Poems, &c. upon several occasions. By Mr. John Milton: both English and Latin
R5715 1673 Ray, John, Observations topographical, moral, & physiological; made in a journey through .. the Low-Countries…
R27278 1678 Cudworth, Ralph, The true intellectual system of the universe
R16918 1699 Haudicquer de Blancourt, Jean, The art of glass. Shewing how to make all sorts of glass, crystal and enamel.
R31983 1700 Dryden, John, Fables ancient and modern; translated into verse,

Stanley Fish and the Digital Humanities

This blog reprints my response to three essays about the Coming of the Digital Humanities that Stanley Fish published in the New York not quite a decade ago:

  1. The Old Order Changeth
  2. The Digital Humanities and the Transcending of Mortality.
  3. Mind your P’s and B’s: The Digital Humanities and Interpretation

My response appeared originally on a blog of the Northwestern Library that was discontinued some years ago.

Is there or should there be a Digital Humanities? My very short answer to both questions is “no” and “no.” In a slightly longer answer I concede that a phrase must be about something if it is gaining currency. For me the something of the term is about the trouble that the humanities have had in absorbing digital technology into their habits of work and recognition. Unlike the natural and social sciences, they have so far put the digital into a ghetto–a mutually convenient practice for those inside and outside, but probably harmful in the long run.

Finally I wrestle with the term by engaging Stanley Fish’s recent tri(bl)logy about the Digital Humanities in the New York Times. Fish actually says very little about the use of digital technology in other humanities fields but focuses on literature departments. He is an eminent Miltonist and was a major force in the world of English departments during the turbulent quarter century from the late sixties into the early nineties. On that account alone, he is worth reading.

Digital Insurgents?

English departments for Fish are a story of embattled regimes, insurgencies with a martyr’s and a prophet’s face, the domestication of triumphant insurgencies into a new orthodoxy, and the repetition of the cycle with the emergence of a new insurgency. He remembers “with no little nostalgia” the era of “postmodernism in all its versions.” Now he writes with a benevolent serenity spiced with dashes of cynicism.

The title of his first blog, “The Old Order Changeth”, might as well be plus ça change. The new insurgents are the Digital Humanists, who all of a sudden are all over the annual convention of the MLA. Whereas in the previous seven years, the sessions dedicated to things digital fluctuated between six and fifteen, with a barely discernible trend line, in 2012 there were 27. Something is going on.

In the second blog with the ironic title “The Digital Humanities and the Transcending of Mortality” Fish describes the new insurgence and its major promise (or threat) as the transformation of a “hitherto linear experience — a lone reader facing a stable text provided by an author who dictates the shape of reading by doling out information in a sequence he controls — into a multi-directional experience in which voices (and images) enter, interact and proliferate in ways that decenter the authority of the author who becomes just another participant.” He quotes Kathleen Fitzpatrick, author of Planned Obsolescence: Publishing, Technology, and the Future of the Academy and the first director of the MLA’s recently established Office of Scholarly Communication:

we need to think less about completed products and more about text in process; less about individual authorship and more about collaboration; less about originality and more about remix; less about ownership and more about sharing.

Fish the Miltonist gleefully points out the theological resonances of such “All in All” talk (Paradise Lost 3.341). He doubts whether the digital prophets would like that but is sure they will agree with it as a “a left agenda (although the digital has no inherent political valence) that self-identifies with civil liberties, the elimination of boundaries, a strong First Amendment, the weakening or end of copyright and the Facebook/YouTube revolutions that have swept across the Arab world.”

As a program director and department chair during the eighties I interviewed hundreds of candidates for positions in English and Comparative Literature. To the extent that they shared a collective sensibility, I don’t remember it as being very different from the values and voices Fish imputes to today’s digital insurgents. I remember that some candidates in those days picked up the non-trivial text processing skills that it took to babysit a dissertation through a mainframe computer. They did so because the machine would automatically and accurately renumber their footnotes. For this they would do anything. This shows that Fish is right when he says that the digital has “no inherent political valence” –or any other valence for that matter. But it also shows that Fish probably is not right when he sees the problem of the Digital Humanities as a “we/plural/text and author detesting” ethos challenging an “I/singular/text and author fetishizing” ethos. English departments are full of folks who love plurals in titles and have doubts about the identity of texts or authors but for good and bad reasons want nothing to with the digital.

What does the digital do?

In his final blog , Fish asks how “the technologies wielded by digital humanities practitioners either facilitate the work of the humanities, as it has been traditionally understood, or bring about an entirely new conception of what work in the humanities can and should be.” He takes a single sentence from Milton’s Areopagitica: “Bishops and Presbyters are the same to us both name and thing,” a prose version of the famous Miltonic line “New Presbyter is but old Priest writ large.” Fish points out that in the surrounding sentences “b’s” and “p’s” proliferate in a “veritable orgy of alliteration and consonance.” A brilliant and entirely manual little exercise in stylometry, drawing inferences from a perceived discrepancy between expected and observed occurrences of bilabial plosives.

Fish sees this exercise as an example of hypothesis-testing criticism. He begins with a “substantive interpretive proposition”—Milton believes that the former martyrs have become oppressors. Guided by that proposition he notices formal patterns and elaborates their correlation with the proposition. In his final paragraph he speaks approvingly of “a criticism that narrows meaning to the significances designed by an author, a criticism that generalizes from a text as small as half a line, a criticism that insists on the distinction between the true and the false, between what is relevant and what is noise, between what is serious and what is mere play.”

From the perspective of such a criticism Fish argues that there is not much to love in two quite different avenues of digital criticism. There is the ludic criticism whose most eloquent advocate is Stephen Ramsay. Far from seeing the critic’s duty in narrowing meaning, Ramsay celebrates the power of algorithms to proliferate meaning through playful de- and transformations of texts. And then there is text mining, where “first you run the numbers, and then you see if they prompt an interpretive hypothesis.” There is no QED or conclusion in either method.

I share Fish’s admiration for Stephen Ramsay’s playful imagination. I also share his skepticism about how far to push a ludic element in the business of interpretation, although Fish surely underestimates the power of play, at least in the severe stance he adopts in this blog. As for text mining, Fish is not quite fair to its claims and methods. To stay with the theological language that he seems to both like and dislike, proper understanding is a form of Anselm’s fides quaerens intellectum. You start with some belief and seek to support it with argument and evidence. Without such “faith”, inquiry is just a boat aimlessly drifting at sea. The larger the ocean of data, the more aimless the drift. Have I seen text mining that answers to this description? Yes. Is it a fair account of text mining done competently? No. Take the example of Matthew Wilkens’ analysis of place names in American novels of 1851. I have not read this essay but heard the author give a talk on a different version of the same project. Fish describes searches that are not “interpretively directed” as follows: “You don’t know what you’re looking for or why you’re looking for it. How then do you proceed? … The answer is, proceed randomly or on a whim, and see what turns up.”

But that is not how Wilkens proceeded. Instead he asked the quite precise question ” What can we learn about a group of related novels by looking at the distribution of place names in them?” This question rests on the well-tested hypothesis that the distribution of proper nouns in a document will tell you quite a bit about it. In the digital realm “named entity extraction” is an important subfield of Natural Language Processing, but it has a venerable manual equivalent in the genre of the Index Nominum, which is almost as old as the printed book. Many a book has been read on the principle of “Tell me whom you quote, and I tell you what you wrote.”

Extracting place names from a set of novels is a form of “distant reading.” The term, which is Franco Moretti’s, is clearly a polemical challenge to “close reading.” Pierre Bayard’s amusing How to talk about books you haven’t read provides ample evidence that “not-reading” is an ancient and inescapable practice. Fish is quite comfortable with it himself when he bases his analysis of the Advent of Digital Humanities on a reading of the titles of MLA sessions and papers. To vary Fish, “Don’t you have to actually read the papers, before saying what the patterns discovered in them mean?”

“Yes and no,” the answer might be. Fish looks at “distant reading” and says “no thank you.” Wilkens makes a more nuanced and modest case. He presents a scenario in which the members of the profession either practice close reading on the same few dozen novels over and over again or develop new practices in which you use methods developed in Natural Language Processing to perform rough mapping operations that are then followed by a targeted examination of selected examples. I have called this technique “scalable reading.”

How these practices will shape literary analysis remains to be seen. We are very much at the beginning of an era. Speaking for myself and as a former Miltonist, the uncertainty of methods, tools, goals, and outcomes in the enterprise of digitally assisted literary analysis is captured in the comparison of Satan’s shield to the moon as seen by Galileo through his telescope, a wonderfully prophetic image of the power of search tools :

his ponderous shield Ethereal temper, massy, large and round
Behind him cast; the broad circumference
Hung on his shoulders like the Moon, whose Orb Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe. (Paradise Lost 1.284-91)

 

Useful tools for mapping

There are two additional points. First, when it comes to the analysis of canonical texts by highly skilled readers with decades of experience, it is not likely that machines will add much insight, although they may help in producing new forms of confirming evidence — digital helpers in August Boeckh’s definition of the philological enterprise as “the further knowing of the already known.” I have been reading Kahneman’s Thinking, Fast and Slow about the System 1 and System 2 of our minds, how some skills become second nature and move from System 2 to System 1 where they are practiced automatically. Thus an attendant in an indoor garage will drive my car at speeds that make my hair stand up. So it is with Stanley Fish, a superbly gifted reader who draws on decades of his own experience and that of the Milton guild when he reads a sentence in Areopagitica. His System 1 just “sees” the pattern of a sentence and its expanding context.

If the computer is not likely to be of much use in the kind of situation that is exemplified by a Stanley Fish turning to a page of Areopagitica, it may nonetheless be an increasingly useful tool in helping with the mapping operations that lay the groundwork for deeper understanding. The German classicist Karl Reinhardt, torn all his life between Wilamowitz’s Alterumswissenschaft and Nietzschean hermeneutics wrote that

it is part of philological awareness that one deals with phenomena that transcend it. How can one even try to approach the heart of a poem with philological interpretation? And yet, philological interpretation can protect one from errors of the heart.

This is the best statement I know of the necessary modesty that is so important an element of good literary criticism. Philological tools and techniques, whether digital or not, operate within the limits of their domain. But if you use them well and with an acute awareness of their limits, they offer some protection against error and may help you look beyond those limits. Like other tools, computers may open doors, but walking through them will always remain your task. The “last mile” of Boeckhian understanding is forever receding and will always need to be walked.

Diggable and re-diggable data

My second point is a quibble with the sentence “Digitize the entire corpus and you can put questions to it and get answers in a matter of seconds .” An algorithm that takes seconds or minutes to execute may depend on data that it took weeks to prepare, and it may spit out results that it takes days to analyze. Computers may save time, but they also create a lot of new work. “Digitize the entire corpus” is easily said, but quite hard to do. Several years ago I served on a review panel for the NEH competition “Digging into Data.” There were very few “diggable data” then, and there are still very few diggable data now if you think of the range of questions literary scholars are likely to address to textual data of various kinds.

Digitization projects must make some assumptions, tacit or explicit, about the uses to which the data will be put. The default assumption in most digitization projects is that the texts will be served up as surrogates for human reading. Such texts will support simple keyword searching, but they do not add up to machine-actionable data sets that support complex forms of manipulation or analysis.

In 2001 Jerry McGann wrote: “In the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination. This system, which is already under development, is transnational and transcultural.” In such a system, you would hope for a high degree of “interoperability” in the sense that machines can perform at least a few of the things that human readers do when they pick up one book from one shelf, another book from another shelf, and put things together in the serendipitous and messy manner that Stephen Ramsay calls “Hermeneutically Screwing Around.” In the classic American research library of the 20th century the Library of Congress classification guaranteed the degree of interoperability that made it easier to find books on shelves. Interoperability beyond that point was left to the remarkable skills and caprice of the all-terrain vehicle known as ‘human reader.’

A decade into the half-century of digital editing, it is, alas, not possible to say that we have come 20% of the way. A reading (whether close or distant) of the MLA sessions on things digital is likely to lead to the melancholy conclusion that the profession has not yet focused on the challenges of rebuilding the documentary infrastructure of primary data in ways that will let scholars do new things with old data in digital form.

In projects of all kinds, digital or not, you must often do a lot “to” your stuff before you can do much “with” it. Scale or “Big Data” are a common challenge to maximizing the power of the computer in any domain The Economist in a piece about the “data deluge” reported that it took the Nestle corporation an entire decade to get their disparate data into a shape that allowed analysts to do useful things with them. In the life sciences enterprises like GenBank, an ” annotated collection of all publicly available DNA sequences,” speak to the commitment of an entire discipline to the collaborative construction of sharable data sets that provide the framework for the development and testing of new hypotheses.

When Theodor Mommsen in 1854 published the first volume of his famous Roman History he had already begun the massively collaborative project Corpus Inscriptionum Latinarum, which by the beginning of World War I had created the highly systematic and “interoperable” edition of Roman inscriptions that fundamentally changed the documentary infrastructure for the study of Roman legal and administrative practices. Within the scope of existing technologies this project was the work of many hands and minds, doing things “to” data in such ways that other hands and minds could do different and unforeseen things “with” them. It made Latin inscriptions “diggable” and “re-diggable” in ways that they had not been.

In a similar way, creating digitally rediggable data will be a big challenge for humanities disciplines. It is, if you will, a Falstaffian task, in which the individual hands and minds can each say of themselves: “I am not only witty in myself, but the cause that wit is in other men” (2Henry IV 1.2.9). So far, truly re- diggable and multiply recombinable data in the humanities remain few and far between. There is a chicken-and-egg problem here: what comes first, the insights that organize the data or the data in a format that prompts questions and creates the hope of answering them within a time frame that makes their pursuit quite literally “worthwhile.”

No Messiahs, please

As I said earlier, Fish talks about a very small slice of the domain poorly encompassed by the phrase “Digital Humanities.” It happens to be a slice I am interested in, but it is worth repeating that archaeologists, art historians, epigraphers, historians, linguists, musicologists, or papyrologists would find little in his blog entries that speaks to the many ways in which they find the ‘digital’ helpful or indispensable to their projects.

Like Fish, I worked my way through the MLA Program. I was most taken with the abstract of a talk by Alison Byerly, the Provost at Middlebury College. She observes the internal conflict of the two most commonly used terms, “Digital Humanities” and “New Media.” The implicit stance of such rhetoric is “Marcionite” (my term). Media that are “new” and humanities that are “digital” have a New Testament that makes the old one superfluous. I am not sure how many “DH folks” actually think that way. But some do, and the rhetoric has its own dynamic, with mostly unhelpful consequences.

It is different in most other disciplines. There are no self-proclaimed digital biologists, chemists, or economists, but for many practitioners in those disciplines digital tools and methods have become essential parts of their engagement with the primary data in their fields — leaving aside the matter of writing and publishing research results, which is going digital in all fields, including the humanities, albeit at different rates.

Byerly and Fish seem to be at one in their distrust of the Messianic, but Byerly, if I extrapolate correctly from the abstract of her paper, may argue for a patient, practical, and incremental engagement of ‘old’ and ‘new’, ‘digital’ and ‘analog’ with a view to a future in which those distinctions fade away. Messianic impulses are hard to curb. Some years ago the historian Dan Cohen gave a talk in which he asked whether you could think of a digital project that could compare with Jenner’s discovery of the smallpox vaccine. Implicit in the question is the idea that a new technology must legitimate itself with some spectacular breakthrough. But that may not be best way of measuring the impact of technology over time.

If you need a prophet, the anti-Messianic Douglas Engelbart may be the better guide. In his famous essay about Augmenting Human Intellect he said:

You’re probably waiting for something impressive. What I’m trying to prime you for, though, is the realization that the impressive new tricks all are based upon lots of changes in the little things you do. This computerized system is used over and over again to help me do little things –where my methods and ways of handling little things are changed until, lo, they they’ve added up and suddenly I can do impressive new things.

“Lots of changes in the little things you do.” If “you” are a scholar in some humanities discipline, there will be a lot of difference in the little things that stand in the way of getting on with your project. Overcoming them one by one singly or collaboratively may at some point add up to Hippolyta’s vision:

But all the story of the night told over,
And all their minds transfigured so together, More witnesseth than fancy’s images
And grows to something of great constancy; But, howsoever, strange and admirable.
(A Midsummer Night’s Dream 5.1.23-17)
But it will take a while.

Re-mediating the Documentary Infrastructure of Early Modern Studies in a Collaborative Fashion

The following is a hypothetical and introductory lecture to students who have shown an interest in the Early Modern world, whether its art, history, literature, music, politics, religion, or science. I wrote it in 2016. It has been lightly edited since,

My goal in this talk is to tell you a little about the documentary infrastructure of Early Modern Studies in the Anglophone world and about the changes that digital technology is making to it. Jerome McGann, one of the most distinguished editors of his generation, observed in 2001 that “in the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination.” That “re-editing” or “re-mediation” is a big enterprise. Its tasks range from mundane chores to operations drawing on highly specialized knowledge. Undergraduates have done a lot of useful work on the simple — and sometimes not so simple — side of that spectrum, and wherever they have laid their hands on particular texts they have noticeably improved them. I would like to persuade you become engaged in an enterprise of collaborative curation where you as the future users of the documentary infrastructure for Early Modern Studies participate in its production.

The allographic journey of texts

Early Modern Studies used to be known as The Renaissance, but that name has fallen out of favour, perhaps because it smacks too much of an attitude that David Bromwich somewhere characterized as “we are so smart now because they were so dumb them”.  Early Modern is less judgemental and more in keeping with Ranke’s view that “every generation is equidistant from God”. In German usage “Early Modern” marks a period that begins somewhere around 1400 and ends somewhere around 1800. In the English-speaking world, those four centuries tend to be divided into two parts, known as “Early Modern” and “The Long 18th Century.” 1660, the end of the English Civil War and the restoration of the monarchy, serves a convenient divider. The “Long 18th Century” is roughly coterminous with what Americanists call “Early American”. From a digital perspective, the documentary infrastructure problems of Early Modern and Early American are quite comparable.

Many changes over time turned Chaucer’s ‘medieval’ into Shakespeare’s ‘early modern’ England. The invention and rapid adoption of printing are of particular significance in the context of this discussion. Nelson Goodman in his Languages of Arts distinguished between autographic and allographic objects. The former, e.g. Michelangelo’s David, are uniquely embodied, but there is no privileged system of writing down a Shakespeare sonnet or a Bach fugue.

The history of texts is an allographic journey with stages of re-mediation where texts are written down (‘graphic’) in a different manner (‘allo’). Consider the history of the Iliad. Rooted in oral poetry, it was probably first written down around 700 BCE in an alphabet the Greeks had quite recently adapted from a Semitic alphabet used by Phoenician traders. Around 400 BCE this alphabet was modified to provide a more nuanced representation of Greek vowels. Alexandrian and Byzantine scribes added breathing marks and accents to help with pronunciation and disambiguation. If your first encounter with a Greek Iliad was via an Oxford Classical Text you would have seen a type face derived from the handwriting of Richard Porson, an 18th century English scholar. A page of that text would have been ‘Greek’ to Plato because it looked nothing like what he was used to. A page from Venetus A, the 10th century Byzantine manuscript and most important source of the text, would have been just as Greek to him. He would have found it a little easier to make sense of the opening line of the Iliad in ‘betacode’ a workaround for representing Greek letters with Roman capital letters on an IBM terminal keyboard. Here is the first line in Greek letters and in betacode:

μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος
MHNINAEIDEQEAPHLHIADEWAXILHOS

I say this because in an Early Greek vase painting of a pretty girl, the legend “hē pais kalē” (written vertically) looks much like H PAIS KALH.

I dwell on this in such detail because readers through the ages have had difficulty with the allographic nature of texts. Readers and writers are deep “tool conservatives”, are apt to think of change as a loss of authenticity and feel that the original (or most familiar) encoding is an essential part of the text. What Joel Mokyr in The Gift of Athena celebrates as lowering the access costs to knowledge, they have considered a profanation of sacred knowledge. Thus an Italian 16h century writer said that “the pen is a virgin, the press a whore.” In this regard digital texts are to books what books were to manuscripts, and low-status codices to high-status scrolls. In the end, however, the low upstart has always won. For you, a laptop or mobile device will usually provide the most convenient and often the only access to an Early Modern book, especially if you stray beyond the safe limits of the canonical.

The English Short Title Catalogue

Traditional London taxi drivers have to pass an examination in which they demonstrate “The Knowledge”, their command of some 25,000 streets over a 100 square mile area, including points of interest and good ways of getting from here to there. You can’t have The Knowledge unless somebody has named the streets, given numbers to the houses on them, and kept a public register of them. A street without a registered name might as well not exist. Ditto for books. The Cartesian cogito of books might read “I am catalogued, therefore I am.” No cataloguing, no scholarship. Cataloguing is itself a scholarly activity of considerable complexity and has a non-trivial effect on scholarly field boundaries.

In the early 1880’s Alfred Pollard, a young man excluded from a teaching career by a very bad stutter, found a job in the British Museum’s Department of Manuscripts. Some forty years later he published, together with G. R Redgrave,A short-title catalogue of books printed in England, Scotland, & Ireland and of English books printed abroad, 1475–1640 , the first systematic census of English print before 1640. That is why every imprint before 1640 has an STC number. Twenty years later Donald Wing at Yale extended this work, and books between 1640 and 1700 got “Wing” numbers. Fast forward another generation — we are now in the early days of personal computers — and there is a new and digital project to create a census of all 18th century texts, the Eighteenth-century Short Title Catalogue. Then the editors of that catalogue decided to combine Pollard and Redgrave, Wing, and the ESTC in a new and digital animal called the English Short Title Catalogue, which aims at being the authoritative description of the roughly half million books published in the English-speaking world before 1800. About a quarter of them are Early Modern; three quarters belong to the Long 18th Century. The ESTC is built on a foundation of more than 120 years of bibliographical labour. If you have read a little Virgil in its original Latin, you might remember tantae molis erat Romanam condere gentem.

Early English Books Online (EEBO) and the Text Creation Partnership (TCP)

In the late 1930’s libraries began to create microfilm copies of books. For many years, University Microfilms, an offshoot of the University of Michigan, was the leader in this effort. By the sixties many early modern texts were available in that effective but unloved format. Instead of combining business with pleasure in an expensive trip to the British Museum you could now go into the basement of a provincial university library and spend hours reading some 16th text while operating the microfilm reader in a dusty and windowless room. Variable in quality, with pages missing here or duplicated there, microfilm was nonetheless a powerful gift of Athena and broadened access to rare books.

University Microfilms changed owners several times before ending up in the hands of the Proquest corporation, which around 2000 digitized the microfilms of English books before 1700 and made them available over the Web, where they were “free” for the privileged members of institutions that could afford the expensive subscription. When some years ago I asked a colleague “What difference have digital texts made to your work?” I barely had time to finish my question before he shot back: “EEBO has changed everything.” The digital scans are not better and sometimes worse than the microfilm images, but you can get at them at 2 am in your pyjamas. Access is king and often trumps quality.

The Text Creation Partnership (TCP) is a close contemporary of EEBO. Broadly speaking, the project, completed in 2015, aimed at creating a deduplicated library of books before 1700 in transcriptions that were faithful to the orthographic practices of the printers and used protocols of the Text Encoding Initiative (TEI) to articulate structural metadata. A raw page of TCP text is full of the markup in angle brackets you are familiar with from HTML. It “containerizes” text in a way that lets a machine identify chunks of text as lines of verse, paragraphs, list items, tables, titles, signatures, epigraphs, trailers, postscripts, quotations, etc.

The TCP corpus of not quite two billion words in ~60,000 titles is not a very useful resource if you are interested in the intratextual variance of different editions of the same work, but it is an unmatched resource when it comes to pursuing the intertextual filiations of the first two centuries of print culture. Its structural markup lets you extract text chunks across thousands of documents. The tools for doing so have become more user-friendly, but it still takes a little while to acquire moderately sophisticated text processing skills.

The Query Potential of the Digital Surrogate: The Digital Combo

The form in which you read a text is always a surrogate, an arbitrary embodiment of an intentional object. The wistful and whimsical prefaces with which early printers apologize for their errors and encourage readers to correct “slips of the pen or press” (quae praelo aut penna lapsa vidisses) in odd ways reflect their understanding of texts as objects always a little beyond our grasp.

If a text is always encountered through a surrogate that is in some respects arbitrary, what is an appropriate re-mediation of an Early Modern text in an increasingly digital world? I am not talking about the end of the printed book, which is likely to remain unchallenged as the best way of reflective engagement with a single text. But even in the print world the modal form of scholarly reading involves constant movement among many books. A lot of mechanical ingenuity over centuries has gone into cutting the time cost of moving from one book to another, from the often aggressively technological book wheels of the Early Modern world to Jefferson’s elegant Lazy Susan.

Any medium has ‘affordances’ or things that you can do with it more readily than in another medium. Ranganathan’s fourth law of library science says “Save the time of the reader.” Many new affordances are about reducing the time cost of some activity. Compared with a scroll, the codex is a random access device that allows rapid movement across the pages of a book. Finding devices, such as tables of content or indexes are impracticable in scrolls. The digital medium is even more agile than the codex. It supports rapid and precise alignment of text and image, and it supports rapid “search and sort” operations across millions of pages of texts as if they were pages from one very long book.

A few years ago there was a Princeton conference about “research data lifecycle management”. I read about it in a blog that had this quotation from a talk by Brian Athey, the chair of computational medicine at Michigan: “Agile data integration is an engine that drives discovery.” That’s not the way humanists talk, but we’re familiar with the idea. The Weimar edition of Luther’s works runs to sixty volumes, not counting a dozen German and Latin index volumes of people, places, things, and citations. Those index volumes have been engines of discovery for generations of scholars.

Theodor Mommsen’s Corpus Inscriptionum Latinarum (CIL) offers an even more striking example from a predigital world. Before 1850 many Latin inscriptions were known, but knowledge of them was scattered. Around the time that Mommsen wrote the first volume of his Roman History (which won him the second Nobel Prize in Literature), he started an edition of Latin inscriptions based on new ‘autoptic’ transcriptions. He directed the project for half a century and edited many of the volumes himself. By the early twentieth century, a good Latinist in a then quite provincial Midwestern university (Indiana or Northwestern) had access to its sixteen volumes on eight feet of library shelving, with the corpus of inscriptions clearly organized in a time/space continuum. That was a transformative event for the study of Roman administrative and legal history. A lot of agility was built into those heavy tomes.

“Research data life cycle management” and “agile data integration” are helpful terms in thinking about a complex digital surrogate that I call the “digital combo”. This surrogate combines three different aspects of a text. First, the digital facsimile of a page in an Early Modern took gives you access to a privileged (and often the only) witness to the words in a work. The look and feel of the page also provide a lot of information about the milieu from which the text originates or the manner in which it addresses its audience. Much can be learned from such “paratextual” features.

Secondly, a careful digital transcription is much more agile than the printed text or its digital facsimile. You can cut and paste from it, and you can search within it. Thirdly, if the transcription is part of a corpus and if the transcriptions of each text have been done in a reasonably consistent manner, the agility of each text becomes an agility of the corpus. You can read and search the text within the context of the others. You can also read the corpus as if it were one very large book. This digital Book of Early Modern English is much bigger than the Luther Book or the Book of Roman Inscriptions. It adds up to a re-mediation of the Early Modern print heritage with affordances that we have only begun to explore.

Digital combos are nothing new. On the Internet Archive and in the Hathi Trust library there are now millions of digital surrogates that combine page images with automatic transcriptions produce via “optical character recognition” (OCR). Searching uncorrected or ‘dirty’ OCR across vast corpora is a crude but often successful way of finding stuff in books printed since the 1800’s . But while OCR has made giant strides in the past two decades, for texts before 1700 it is still a pretty hopeless enterprise because the printed lines too often resemble crooked teeth.

The Text Creation partnership offers digital combos of a much higher quality. They combine TEI-XML transcriptions with EEBO images. But these digital combos have several problems. The image quality is often poor and well below the standards of contemporary images. Secondly, poor image quality has led to many errors and lacunae in the transcriptions. Completion and correction of the transcription has been a frequent user request. Finally, the images are behind a paywall, and in North America there is not much access to them outside research universities counted in the low hundreds.

Can you have digital combos that combine high-quality and public domain images with TCP transcriptions in an environment that allows for the collaborative curation and exploration of the texts? Take a look at https://texts.earlyprint.org. The site includes ~ 52,500 texts, including some 800 plays and a large portion of the Thomason tracts, a famous collection of books and pamphlets from the period of the English Civil War (1640–60). There are currently 630 digital combos. Not quite half of them are plays.

Over the past few years many Rare Book Libraries have begun to make some digital surrogates of their holdings publicly available. Not all of their Early Modern holdings are mapped to ESTC numbers, and very few of them are mapped to TCP texts. But if a catalogue record lists an ESTC, STC, or Wing number, the mapping to a TCP text is trivial. One could imagine a loosely coordinated enterprise in which libraries give priority to digitize image sets that map to TCP text and avoid overlap with already existing digital combos. Do this steadily for five years, and the results will be significant.

Special Features of Early Modern Studies

Remember “The Knowledge” of the London taxi driver, 25,000 streets across a hundred square miles. What would it mean to have “The Knowledge” of Greek tragedy or Arthurian legend? Here the equivalents of the street names and numbers belong to the past. A mapping of that knowledge depends on answers to three questions. How much has survived? How much of the surviving materials have been mapped? How carefully and consistently have they been mapped?

From the perspective of those questions the systematic digital re-mediation of the Early Modern print heritage is especially promising. First, much of what was printed has survived. We know the titles of more than 1,000 Greek tragedies, but only three dozen have survived. Martin Wiggins in his magisterial census of British Drama 1533–1642 (Oxford, 2014-) counts 543 plays that have survived and were printed between 1567 and 1642. This list is not radically different from a1656 “Exact and perfect CATALOGUE of all the PLAIES that were ever printed; together, with all the Authors names.”

One must be cautious in extrapolating from drama to other genres. One should also remember that the print heritage is a quite different and more formal animal than the written heritage. But it appears that a high percentage of what was printed has survived and a high percentage of what survived has now been digitally transcribed in a remarkably consistent format. There may not be another epoch of comparable scope and significance where the printed record has been transcribed so completely and into a digital format that offers rich opportunities for agile data integration.

What about the quality and consistency of ‘mapping’ or transcription? This is a particularly important question for digitally encoded materials. An IBM executive once observed that humans are smart but slow, while computers are fast but dumb. Human readers adjust easily and tacitly to a wide variety of textual conditions. Computers will travel at lightning speed across vast stretches of text, but they will fail as soon as they encounter a textual condition about which they have not been told in advance. Given the variety (and plain inconsistency) of early modern print practices, the TCP archive has achieved considerable success in maintaining a level of consistency that support complex analytical operations across the corpus as a whole or subsections of it.

Finally, by current computing standards, this corpus of less than two billion words is no longer particularly large. The textual data fit comfortably on a smart phone that may cost less than a replica of Jefferson’s revolving bookstand. Computationally assisted operations of scholarly interest on a corpus of this size no longer require an expensive infrastructure but can be done on quite ordinary laptops.

Natural Language Processing (NLP) and Augmented Ways of Reading

When I reviewed the writing samples of job candidates in the eighties, candidates from Yale could be spotted right away because their writing samples were printed on Yale’s mainframe in a quite distinctive style. Those were the days of Deconstruction, and in interviews we would joke about ghosts in the margin if the machine put notes in odd places. The candidates had no special interest in computers, but the mainframe would automatically renumber their footnotes. For this they would do anything, and they acquired the non-trivial text processing skills that it took to babysit a complex document like a dissertation on a mainframe computer of the eighties. The GUI interface of Microsoft Word relieved users of the need for such knowledge. Today the text processing skills of the modal humanities student or scholar do not extend beyond very simple wild card and right-truncated searches.

This is a pity. A modest amount of Natural Language Processing ( NLP) goes a long way towards increasing the speed and accuracy of basic operations involved in text-centric work. The word ‘computer’ is misleading in its suggestion that the machine is mainly about numbers. The name of the famous early computer language ‘lisp’ is an abbreviation for ‘list processor’ — a much better name for a machine that spends much of its time making, sorting, and comparing lists or extracting items from them. These are basic operations in scholarly work. There is nothing particular digital about them, and you need not sell your humanistic soul to some technological devil to take advantage of the fact that the machine can do some simple things much faster and with fewer errors than you can.

The digital re-mediation of the Early Print heritage will benefit greatly from the systematic application of Natural Language Processing technologies (NLP) that are widely used in Linguistics and many social sciences. There are persistent misunderstandings about the role of such technologies in the humanities. In the second part of Shakespeare’s Henry VI the peasant rebel Jack Cade indicts the Lord Say with these words:

It will be proved to thy face that thou hast men about thee that usually talk of a noun and a verb, and such abominable words as no Christian ear can endure to hear. (2 Henry VI, 4.7.35ff.)

This is an excellent example of the deep resistance that people have to the ‘explicitation’ of the tacit knowledge that humans bring to the task of “making sense” of language. A text written in the Roman alphabet is a very sparse notation that depends heavily on the skills that the reader brings to the task of making sense of it. Computers cannot make sense of anything. They can only follow processing instructions. If you want the machine to simulate some forms of human understanding, you must introduce the rudiments of readerly knowledge in a manner that the machine can process. These rudiments are very crude, they are added through machine processes, and they have error rates up to 4%, but they significantly increase the query potential of documents, especially when done “at scale”. Typically you give every word a unique identifier, and add its ‘lemma’ or dictionary entry form of the word, and a “part of speech” tag. Such annotation increases the size of the text by a whole order magnitude. It produces what a witty colleague has called a ‘Frankenfile’ that is human-readable in principle, but not in practice. On the other hand, it can be processed very fast by a machine. Here is a sample of a few words from Jack Cade’s indictment:

<w xml:id=”sha-2h640703704″ lemma=”talk” pos=”vvb”>talk</w>
<w xml:id=” sha-2h640703705″ lemma=”of” pos=”acp-p”>of</w>
<w xml:id=” sha-2h640703706″ lemma=”a” pos=”d”>a</w>
<w xml:id=” sha-2h640703707″ lemma=”noun” pos=”n1″>noun</w>
<w xml:id=” sha-2h640703708″ lemma=”and” pos=”cc”>and</w>
<w xml:id=” sha-2h640703709″ lemma=”a” pos=”d”>a</w>
<w xml:id=” sha-2h640703710″ lemma=”verb” pos=”n1″>verb</w

Most readers are familiar with searches where you put in a string of characters and retrieve matches that contain the string. But quite often you may be interested in unknown strings that meet some criteria. An annotated corpus supports such queries. It can retrieve a list of all nouns (or proper names), sentences that begin with a conjunction or end with preposition, or phrases that match the pattern “handsome, clever, and rich” in the opening sentence of Jane Austen’s Emma. Run against a corpus of Early Modern drama, such a search yields results like “The Scottish king grows dull, frosty, and wayward.”

If you have data of this type for an individual text, there are some primitive but powerful ways of aggregating them across a corpus such as the 55,000 TCP texts. You count the total occurrences of a word in a corpus, which gives you its “collection frequency.” The number of texts in which it occurs, the “document frequency” often provides more useful information. If you know which words in a text occur in only one or two other texts, it may be a good idea to look at them. Procedures of this type do not replace reading, but they augment it. They also provide ways of making visible aspects of a text that reading does not easily reveal. Some of these count data are best thought of as lexical metadata that are added as appendices to a corpus somewhat in the same manner in which indexes are added to books. Aristotle writes about relative frequency as an important property of words (Poetics 21).

A Digression about Counting

Humanities scholars often distrust quantitative arguments. Or so they say, but in practice their discourse relies heavily on such concepts as ‘none’, ‘few’, ‘some’, ‘many’, ‘most’, ‘all’, ‘much more’, or ‘a little less’. ‘hapax legomena’ is one of the oldest critical concepts and originally referred to words that re found only once in the Homeric corpus.. Athenaeus in the Deipnosophist, a third-century compilation of literary and culinary gossip, pokes fun at a pedant, who asked of every word whether it occurred (keitai) or did not occur (ou keitai) in Attic Greek of the classic period. Hence his nickname Keituoukeitos. The great classical scholar Wilamowitz turned a German children’s phrase einmal ist keinmal (once doesn’t count) into Einmal ist keinmal, zweimal ist immer (once is never, twice is forever). The English linguist J. B. Firth is famous for his dictum that “you shall know a word by the company it keeps.” My mother liked to say that “three hairs on your head are relatively few, three hairs in your soup are relatively many.”

Humans in fact are inveterate, skillful, but informal statisticians, forever calculating the odds. The French mathematician La Place observed that probability theory is just common sense reduced to a calculus (le bon sense réduit au calcul). It makes you appreciate more exactly ce que les esprits justes sentent par une sorte d’instinct, sans qu’ils puissent souvent s’en rendre compte. It is very hard to capture the meaning and tone of esprits justes. A colloquial rendering might be “what savvy minds just ‘get’ without being able to say why.”

A sense of a word’s frequency or currency is an important component of your knowledge of it. Money and philology mix wittily in Love’s Labour’s Lost when Costard reflects on the tips that pretentious superiors give him as ‘remuneration’ and ‘guerdon’, words well outside his base vocabulary (LLL 3.1.145ff.). If you have a properly annotated corpus it is not difficult to quickly extract quite detailed information about the ‘currency’ of words in texts or groups of them. Visualizations of such data have become quite popular. Think of the Goggle Ngram browser or the word clouds that are likely to accompany a New York Times report about a State of the Union address.

Close, Distant, and Scalable Reading

August Boeckh, another philological giant from the 19th century , said that “philology is, like every other science, an unending task of approximation.” You never quite get there, but you have to start somewhere. As a first-year graduate student you are like the would-be London cabbie and have to get “the” or at least “enough” knowledge of the Early Modern world to find your way around. Your dissertation may be about the equivalent of Highgate, but you need to have some idea where that is in relation to Elephant&Castle or Ladbroke Grove. For these crude mapping operations, it is extraordinarily helpful to combine bibliographical data from the ESTC with quite primitive frequency data about the distribution of words in works of interest to you.

It is important to be very clear about what such computationally assisted routines can and cannot do. They will not tell you much about canonical authors that you cannot get more quickly from other sources. There is a simple reason for this. The works of canonical authors have been crawled over for generations by thousands of the slow but smart computers also known as human brains. That is why findings of statistical studies about famous authors often tempt you to say with Horatio “There needs no ghost, my lord, come from the grave / To tell us this” (Ham. 1.5.125).

In some cases, such inquiries generate fascinating second-order results. Take J. F. Burrowes’ 1987 classic Computation into Criticism : A Study of Jane Austen’s Novels. From one perspective, Burrowes does not change your view of any of the novels or characters in them. From another perspective, he tells you how much of your view of this or that character is shaped by the relative frequency of the thirty most common words in that character’s speech. You don’t learn much about the ‘what’, but you learn a lot about the ‘how’. That said, Jane Austen probably was not planning to increase the percentage of first-person pronouns when putting words in the mouth of Sir Walter Eliot.

Most authors are not canonical, and much scholarly work turns on knowing enough about texts that you would rather not read very closely or at all. Over the two decades of the English Civil War the bookseller George Thomason collected all pamphlets and books published during that period. The 22,000 texts miraculously stayed together and ended up in the British Museum, providing extraordinarily dense eyewitness coverage of a transformative period of English history. Thomas Carlyle called them

the most valuable set of documents connected with English history; greatly preferable to all the sheepskins in the Tower and other places, for informing the English what the English were in former times; I believe the whole secret of the seventeenth century is involved in that hideous mass of rubbish there.

There are valid and rewarding ways of cherry-picking your way through this “hideous mass” looking for salient detail, breathtaking in its brilliance or idiocy. But there are equally valid ways of looking for geographical or temporal differences in the distribution of quite ordinary phenomena that will not draw attention to themselves when encountered separately here or there. “Hideous mass of rubbish” could be a way of describing an average week of Twitter or Google queries. I read somewhere about a dissertation whose author was interested in what you could learn from Twitter about regional differences in attitudes towards gay people. Unsurprisingly, the author found that openly gay Twitter stuff was much more common in California than Southern states. Also unsurprising, but striking and haunting was the fact that in Georgia (if I remember correctly) the most frequent completion of the Google question “is my husband” was “gay.”

A finding of this kind is a good example of what Franco Moretti has called “distant reading”, a term that challenges the “close reading” introduced by New Critics and whose ethos is epitomized in the title of Cleanth Brooks’ Well Wrought Urn. Moretti’s “distant reading” can be seen as a digitally enhanced return to an earlier mode that surveyed an entire era from a distance. In thinking about the distinctive affordances of the digital medium I prefer the term “scalable reading.” A properly annotated digital corpus lets you abstract lexical, grammatical, or rhetorical patterns from a distance, but it also allows you to “drill down” almost instantly to any particular passage. If the time cost of zooming in and out is very low you can afford to follow up many leads in the hope that not all of them are wild goose chases.

It is true that reading from a distance offers you a shallow understanding of a text or corpus. But so does reading about it in an introductory survey that keeps you at a safe distance from any detail. In scalable reading you are at least potentially close to real words in real books, and “drilling down” is a second’s work. There will be never be a substitute for looking very closely at the words and context that matter to your argument. But scalable reading is much better than a survey at helping you “look for” what you should “look at” closely. It is also true that some patterns are more readily seen from a distance. I know of no better introduction to the Iliad than the list of the three most common words ordered by descending frequency: man, ship, god.

Cultural Analytics and Data Janitoring

‘Augmented’, ‘distant’, or ‘scalable’ reading are terms for techniques of analysis characteristic of an emerging discipline called ‘Cultural Analytics’. The term is a portmanteau of ‘Cultural Studies’ and of ‘Analytics’, which is computer science jargon for stuff you do with data. The tool kit of analytics has become more powerful over the years, and the size of an annotated TCP corpus (~ 150 GB) is no longer an obstacle for running iterative ‘analytics’ across the entire set within minutes or hours rather than days or weeks. In terms of subject matter, a high percentage of the TCP corpus belongs to History, Law, Political Science, Religion, and Sociology. Much of that corpus lends itself to forms of analysis that have been practiced in the Social Sciences. Prominent is the use of lexical analysis to “predict” the ideological formations and likely choices of individuals or groups. I put the word “predict” in quotation marks because it has very little to do with a real future. It is the technical term for the results that machine produces when it executes some statistical routine to a data set.

The application of such techniques to the Early Modern corpus is clearly promising. But computers are very fussy about the format in which they are fed their data. You typically must do a lot ‘to’ the data before you can do anything ‘with’ them. Curation (doing to) and exploration (doing with) are two sides of the coin of working with digital data. A 2014 article in the New York Times gives an eloquent account of ‘data janitoring’ and reports a data scientist as saying “Data wrangling is a huge — and surprisingly so — part of the job… It’s something that is not appreciated by data civilians. At times, it feels like everything we do.” “Welcome to the Club of Invisible Work”, a scholarly editor might say.

The TCP texts still need a lot of “data janitoring — remedial work in the common use of that term. Some of it has to do with their appeal to readers. “Plain Old Reading” will remain the most important form of interaction with the texts. Human readers have a low tolerance threshold for typographical errors. They are annoyed by them even or especially if the error does not obscure the meaning of the word. The texts will be our main windows to the past for a long time, and we should keep those windows as clean as possible out of a sense of respect both to the texts and their readers.

Other forms of data curation have more to do with making it possible for the machine to process the data in a manner that will give scholars trustworthy results in response to their queries. The goal of collaborative curation in that regard is to give the data an ‘interface’ that different types of analytics can use with as little need as possible for additional data janitoring . Given the variance of Early Modern orthography a data layer of standard spellings is of equal value both to readers and machines. The benefits to modern readers are obvious. For the machine, standardized spelling is a way of both erasing and articulating orthographic variance. Most of the time a user looking for ‘lieutenant’ will not be interested in the 145 different ways in which that word is spelled in Early Modern texts before 1640. But if the variant spellings are dependably mapped to ‘lieutenant’ there are straightforward analytics that count, group, and sort the variants by time or frequency. Standardized spellings also make it significantly easier for an algorithm to find shared phrases of two or more words, a critical source of evidence for intertextual research.

Algorithms can do a pretty good job of creating standardized spellings, but if you want perfection you need some human intervention, and the earlier the text the greater the need. If this happens through “dispersed annotation” of a central resource by many hands on the basis of a shared protocol, consistency is more readily achieved, and corrections offered in one location may be propagated algorithmically to other locations. “Dispersed annotation” is a technical term from the life sciences where the curation of genomes and their subsequent placement in a shared repository is a standard practice. The Early Modern corpus as a cultural genome offers a useful metaphor for some purpose. At least it draws attention to practices in the sciences, widely approved if not always followed, of iteratively curating and sharing research data so that they can be used by others for other purposes.

Engineering English or “Only Connect”

In a 2004 paper Philip Lord defined curation as

the activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of curation will also involve maintaining links with annotation and other published materials.

Like the proverbial woman’s work, curation is never done. The digital corpus of Early Modern texts is stable if you think of it as a fixed number of texts, with occasional additions or replacements. It is ‘dynamic’ if you think of the layers of metadata that are in practice associated with digital data. It is in Boeckh’s phrase an “unending task of approximation” to ensure that the data and metadata are “fit for contemporary purpose, and available for discovery and re-use.” Contemporary purpose changes, and so do the tools available to scholars. New tools change the calculus of the possible, but taking full advantages of a now potential requires making the data accessible to the tools. Tools and data are not always to tell apart. Are metadata tools or data or both?

The responsibility for keeping scholarly data “fit for purpose” ultimately rests with the scholarly communities that base their work on them. Not so long ago the tripod of cultural memory rested on the work of scholars, publishers, and librarians. Increasingly it rests on collaboration among scholars, librarians, and IT professionals. The technical changes have been dizzying. What has not changed is the special responsibility that scholars have for the ‘fitness’ of their data.

“Had we but world enough and time”: the fame of this counterfactual from Marvell’s To His Coy Mistress depends on the fact that we don’t. Time is the most precious of all commodities. Our decisions are governed by our sense of what is quite literally ‘worthwhile’. “Save the time of the reader” is a domain specific application of the art of ‘engineering’, the exercise of ingenium (the Latin root of ‘ingenuity’) to create ‘engines’ that will reduce the time cost of getting from some ‘here’ to some ‘there’.

The maintenance of a collaborative environment for curation is in a very literal sense an engineering problem whose solution can take advantage of the phenomenal decrease in the time cost of many activities. Computer scientists use the term “mixed initiative” for tasks that divide work between a human and a computer. These divisions can take many forms. Consider the remedial task of fixing simple typographical errors. Computer science majors at Northwestern have used neural network techniques to make context-sensitive decisions about the correct spelling. There is a good chance that between half and two thirds of the five million textual defects in the TCP corpus can be fixed by a machine with a high degree of confidence. That leaves a lot of remedial work to humans, but a “mixed initiative” environment makes that work less tedious. Textual correction has a “fix it, find it, log it” workflow, where the finding and logging take up much more time than the fixing, which may be the work of seconds. The EarlyPrint sites currently house about a quarter of the publicly available TCP texts in an environment that supports “curation en passant”. If you come across a defect in a text fixing it may take no more time than writing a word in the margin of a book. The software takes over the logging. It will reduce to seconds the time cost of looking for other defects. If you work with a text in that environment corrections you make for yourself become available to others without any additional work on your part.

A similar logic is at work if you move beyond remedial tasks to the construction of higher-order metadata that are the digital successors of the indexes of citations, people, places, and things in the Luther edition. Such indexes are created by machines running a script, but it takes a lot of iteration, data checking, and tweaking by humans to make the indexes “fit for purpose.”

Up-to-snuff and free images are the most appealing part of a re-mediated corpus of the Early Modern print heritage. As said before, side-by-display of text and image provides visible proof for the trustworthiness of the transcribed text, while layout and typography have their own story to tell. Matching a transcribed page of text with a new image from the same edition is a non-trivial task that benefits from a mixed-initiative approach. The identifiers of the digital texts are based on the IDs of the EEBO images rather than the often unreliable or non-existent page numbers of the source texts. The identifier of a newly made image will be based on the practices of the institution that created the image and will also have nothing to do with the page number of its source. You cannot automatically align text and image , but you can create an environment in which it will rarely take more than fifteen minutes for a reader with a laptop, an Internet connection, and an interest in a particular text to align text and image and create a digital combo.

A Hortatory Conclusion

To return once more to the concept of “research data life cycle management”, Early Modern printed books have entered a phase in their life cycle in which the affordances of digital media create many opportunities, but it will be the work of many individuals to turn those opportunities into real improvements and enrichments of a textual heritage. Joshua Sosin, a papyrologist at Duke who played a major role in the Integrating Digital Papyrology project, has argued forcefully for “investing greater data control in the user community”. Clay Shirky has written eloquently about “cognitive surplus” in his book with that title. In a digital world lots of people have lots of hours that they can spend (or waste) in different ways. Many of the 55,000 texts will benefit from some attention of a housekeeping kind, and much of that attention can be given to textual problems that can be solved in minutes or hours at a time rather than days or weeks and whose solution calls on patience and attention to detail rather than highly specialized professional competence.

The Renaissance Society of America and the Shakespeare Association of America have between them several thousand members, and their students are probably counted in the tens of thousands. Together they have a lot of cognitive surplus. If a little of it is spent every year on improving the corpus of Early Modern texts, the cumulative effect of five years’ work will be considerable. If work with old books is anywhere on the horizon of your career expectations, you could think of contributing to this effort as a form of service in which useful work is done and useful lessons are learned.

Scalable Reading

 

‘Scalable reading’ is my term for Digitally Assisted Text Analysis or, if you like acronyms, DATA.  I owe the term partly to Franco Moretti ,who some years ago coined the term ‘distant reading’ as a way of challenging the hallowed practice of ‘close reading’ and drawing attention to distinctive affordances of tests in digital form. I also owe a debt to the engaging How to Talk About Books you Haven’t Read  in which in Pierre Bayard tells you about the very ancient art of somehow gathering just as much knowledge about a book to say something clever about it even if you have not read all or even most of it. This has always been an important skill since as long as there have been books there have been more than anybody could read.  For a while I talked about The Importance of Not-Reading, and I remembered a poem by Christian Morgenstern, an early twentieth century German poet famous for his nonsense poems many of which bear witness to his philosophical and mystical leanings:

Die Brille The Spectacles
Korf liest gerne schnell und viel;
darum widert ihn das Spiel
all des zwölfmal unerbetnen
Ausgewalzten, Breitgetretnen.
Korf reads avidly and fast.
Therefore he detests the vast
bombast of the repetitious,
twelvefold needless, injudicious.
Meistens ist in sechs bis acht
Wörtern völlig abgemacht,
und in ebensoviel Sätzen
läßt sich Bandwurmweisheit schwätzen.
Most affairs are settled straight
just in seven words or eight;
in as many tapeworm phrases
one can prattle on like blazes.
Es erfindet drum sein Geist
etwas, was ihn dem entreißt:
Brillen, deren Energieen
ihm den Text – zusammenziehen!
Hence he lets his mind invent
a corrective instrument:
Spectacles whose focal strength
shortens texts of any length.
Beispielsweise dies Gedicht
läse, so bebrillt, man – nicht!
Dreiunddreißig seinesgleichen
gäben erst – Ein – – Fragezeichen!!
Thus, a poem such as this,
so beglassed one would just — miss.
Thirty-three of them will spark
nothing but a question mark.

The poem anticipates modern technologies of  text summarization and shrewdly points to what might get lost in such an endeavour.  But in the end neither ‘distant reading’ nor ‘not-reading’ seemed to express adequately the powers that new technologies bring to the old business of reading.  And both terms implicitly set the ‘digital’ into an unwelcome opposition to some other — a trend explicitly supported by the term  “Digital Humanities” or its short form DH, which puts phenomena into the ghetto of an acronym that makes its practitioners feel good about themselves but allows the rest of the humanities to ignore them.

The charms of Google Earth led me to the term Scalable Reading as a happy synthesis of ‘close’ and ‘distant’ reading. With Google Earth you can zoom in and out of things and discover that different properties of phenomena are revealed by looking at them from different distances. If you stand at the corner of Halsted and Division you cannot see the North-South oblong of Chicago’s street grid or the fact that the city is located at the southern tip of a very large lake. Both of these important facts about Chicago become visible as you zoom out. Tara MacPherson has drawn my attention to  Powers of Ten, a 1968 documentary  by Charles and Ray Eames.  They are better known today for the Eames chair (1956), but for generations of middle school kids the seventies and eighties Powers of
Ten  offered a first glimpse into the mysteries of the universe when contemplated “at scale” or rather at different scales.

Scalable reading, then, does not promise the transcendence of reading, –close or otherwise — by bigger or better things. Rather it draws attention to the fact that texts in digital form enable new and powerful ways of shuttling between ‘text’ and ‘context.’ Who could complain about tools that let you rapidly expand or contract your angle of vision?

 

 

 ……like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1.287-91)

Milton’s famous comparison  of Satan’s shield to the moon speaks to the  thrills of discovery at the dawn of modern science, but its association with Satan is not accidental. Anxiety about technological change is an old thing, and nowhere are such changes more pronounced than in language technologies (think of Plato and writing).  In the early sixteenth century the Abbot of Sponheim fulminated in print against the evils of print, and Filippo di Strata observed that “the pen is a virgin, the printing press a whore.”   “L’ordinateur est un instrument de déshumanisation de la recherche et de la désincarnation du vivant” a dissertation supervisor is alleged to have written only a few years  to a French doctoral student.

He could have cited Goethe’s Mephistopheles who gave this advice to a freshman:

Wer will was Lebendigs erkennen und beschreiben,
Sucht erst den Geist heraus zu treiben,
Dann hat er die Teile in seiner Hand,
Fehlt leider! nur das geistige Band.
Whoever wants to know and describe something living
will first seek to expel its spirit,
then he will have the parts in his hand,
Alas! the spiritual link will be missing.

 

Such rhetoric invariably transforms Paul’s “the letter killeth, but the spirit giveth life” (2 Cor 3:6) into some version of “once the spirit gave life, but now the letter killeth.” How does this square with the fact that in this enterprise of replacing a living body with the dead inventory of its parts the original sin was committed by the medieval monks who invented what we still call a ‘concordance.’  Faced with the task of understanding the complexity and infinite harmony of the Word of God but keenly aware of the limitations of their memory, they hit on the divide-and-conquer strategy of turning the Bible into an alphabetically sorted list of its words and their locations. A very mechanical procedure, but a great help in going from a difficult word here to other occurrences of it there, pondering the connections that had eluded fallible memory, and constructing from them the “concordance” of God’s words with each other and with charity, the axiom of Augustinian hermeneutics. The monks “killed” the text by dividing it into its letters, but this was part of a strategy to bring back rather than drive out the “spirit.” Not all the monks succeeded all the time. But abusus non tollit usum.

It is the same with the large digital corpora that in principle support scalable reading (although the practice still lags far behind the possible). Strip a fancy text retrieval system to its basic operations, and you find a concordance on steroids, a complex machine for transforming sequential texts into inventories of their parts that can be retrieved and manipulated very fast. But when it comes to finding “das geistige Band” or, in modern parlance, “connecting the dots” modern readers are pretty much in the same situation as medieval monks, even (or especially) when the machine uses algorithms to construct statistically based patterns. No machine can tell you whether  a pattern “makes sense.”  Call this the “last mile problem” of human understanding.

Or remember the anecdote about Dr. Johnson on his deathbed.  He quoted Macbeth to his doctor:

Canst thou not minister to a mind diseased,
Pluck from the memory a rooted sorrow,
Raze out the written troubles of the brain
And with some sweet oblivious antidote
Cleanse the stuffed bosom of that perilous stuff
Which weighs upon the heart?

Dr. Johnson was much relieved when the doctor responded with the words of Macbeth’s doctor:

     Therein the patient
Must minister to himself.