Scalable Reading

dedicated to DATA: digitally assisted text analysis

...the broad circumference
Hung on his shoulders like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1. 286-91)

crowdsourcing

Complete the CIC TCP initiative with a BTAA initiative for creating matching images that are free and high-quality

The libraries of the Big Ten Academic Alliance (BTAA) are looking forward to an “interdependent networked future”  and to managing their separate collections “as if they were a single, shared one“.  Here are some ideas about how this might work in Early Modern Studies, a field to whose documentary infrastructure those libraries have  made a...

Collaborative Curation of TCP texts

This is a report about the current state of the collaborative curation of TCP texts. While I have written about this topic many times on this blog, this report is written for newcomers who have an interest in what was printed before 1800 but may or may not know anything about TCP texts. TCP stands...

Fixing the Blackdot Words in the TCP corpus: a “mixed initiative” in Engineering English

This is a report on a “mixed initiative”–a term of art in computer science–that  combines old-fashioned philological elbow grease with new-fangled long short-term memory neural network processing (LSTM).  The goal is to fix as many as possible of the approximately five million incompletely transcribed words in the 1.7 billion word TCP corpus of English printed...

New release of Shakespeare His Contemporaries

I have put a new version of Shakespeare His Contemporaries on Google Drive, where you may or view or download the plays. In this version I have grouped the plays by decades and put them in directories with names like 155, 156 …165. The plays have been encoded in TEI  Simple. The texts are in...

Hannah, Kate, and Lydia at work

While reviewing the work of Hannah, Kate, and Lydia, I enjoyed the precision and concision of their annotations. A sample of them appears below. While a full documentation would require snippets of the image and the transcription as well as the annotation, the annotations themselves clearly show their minds at work, combining clear description with...

Thou com’st in such a questionable shape: Data Janitoring the SHC corpus from the perspectives of Hannah, Kate, and Lydia

  Below are the reflections of Hannah Bredar, Kate Needham, and Lydia Zoells about their adventures in the mundane world of Lower Criticism,  about which I wrote in an earlier blog and of which the digital surrogates of our cultural heritage will need a lot in the decades to come.  Racine observes in his preface...

Shakespeare His Contemporaries (SHC): The next release

This is a progress report on the basic clean-up of the 504 plays in my current Shakespeare his Contemporaries corpus (SHC).  I hope to release an updated corpus  by the end of November. It will replace the current corpus at https://github.com/martinmueller39/shc The SHC texts are partially curated versions of the TCP texts, which  have “known...

Engineering English: Machine-assisted curation of TCP texts

The are somewhere in the neighbourhood of five million incompletely transcribed words in the rougly two billion words of English books before 1700 transcribed by the Text Creation Partnership. Depending on how you look at it, that is either a  lot or not very much at all. Less than half a percent of words are...

Shakespeare His Contemporaries (SHC) Released

In my earlier post “From Shakespeare His Contemporaries to the Book of English” I promised to release all SHC plays “later this spring.” I have now done so, and you may download all 504 of them from https://github.com/martinmueller39/shc.  Most of the texts come from Phase I of the TCP project and have been in the...

“Fluent in Marlowe”: Emily’s and Sasha’s successful adventures in data curation

The following is a reposting of excerpts from  a 2009 report by two undergraduate students of mine,  Emily Anderson and Sasha Puchalla.  As part of a course assignment, they checked the  TCP EEBO transcription of Marlowe’s Tamburlaine. They worked from a spreadsheet with a ‘verticalized’ representation of the text in which every word was a...

From transcription to scholarship

Today’s New York Times carried a touching obituary of Claude Anne Lopez, author of Mon Cher Papa: Franklin and the Ladies of London   and other biographical studies of Franklin. A Jewish refugee from Nazi-occupied Belgium, she arrived in America in 1941. She married an historian who moved to Yale, where the only employment available to...

Back to the Future or Wanted: A Decade of High-tech Lower Criticism

The title of this blog entry is the title of a keynote address I gave at the Chicago Digital Humanities and Computer Science Colloqium, held November 18-19 , 2012 at the University of Chicago. There is a pdf of the talk at http://panini.northwestern.edu/mmueller/backtothefuture.pdf The talk was about the challenges and opportunities posed by the TCP...