Scalable Reading

dedicated to DATA: digitally assisted text analysis

...the broad circumference
Hung on his shoulders like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1. 286-91)

data curation

New release of Shakespeare His Contemporaries

I have put a new version of Shakespeare His Contemporaries on Google Drive, where you may or view or download the plays. In this version I have grouped the plays by decades and put them in directories with names like 155, 156 …165. The plays have been encoded in TEI  Simple. The texts are in...

Hannah, Kate, and Lydia at work

While reviewing the work of Hannah, Kate, and Lydia, I enjoyed the precision and concision of their annotations. A sample of them appears below. While a full documentation would require snippets of the image and the transcription as well as the annotation, the annotations themselves clearly show their minds at work, combining clear description with...

Thou com’st in such a questionable shape: Data Janitoring the SHC corpus from the perspectives of Hannah, Kate, and Lydia

  Below are the reflections of Hannah Bredar, Kate Needham, and Lydia Zoells about their adventures in the mundane world of Lower Criticism,  about which I wrote in an earlier blog and of which the digital surrogates of our cultural heritage will need a lot in the decades to come.  Racine observes in his preface...

Shakespeare His Contemporaries (SHC): The next release

This is a progress report on the basic clean-up of the 504 plays in my current Shakespeare his Contemporaries corpus (SHC).  I hope to release an updated corpus  by the end of November. It will replace the current corpus at https://github.com/martinmueller39/shc The SHC texts are partially curated versions of the TCP texts, which  have “known...

Engineering English: Machine-assisted curation of TCP texts

The are somewhere in the neighbourhood of five million incompletely transcribed words in the rougly two billion words of English books before 1700 transcribed by the Text Creation Partnership. Depending on how you look at it, that is either a  lot or not very much at all. Less than half a percent of words are...

Hannah, Kate, Lydia, and Shakespeare His Contemporaries (SHC)

In an earlier blog entry I reported about the ways in which undergraduates at Northwestern and Washington University in St. Louis have contributed to the collaborative curation of TCP transcriptions of Early Modern plays. Their work was released on github as the SHC corpus, short for Shakespeare His Contemporaries. Hannah Bredar just graduated from Northwestern...

Shakespeare His Contemporaries: a half-time report

Hannah Bredar, Madeline Burg, Melina Yeh, and Nayoon Ahn have been at work for four weeks in their clean-up operation of the Early Modern plays in the TCP archive. Nicole Sheriko helped them in the first week and has since then focused on preparing a Young Scholar Edition of Fair Em. The clean-up operation proceeds...

“Fluent in Marlowe”: Emily’s and Sasha’s successful adventures in data curation

The following is a reposting of excerpts from  a 2009 report by two undergraduate students of mine,  Emily Anderson and Sasha Puchalla.  As part of a course assignment, they checked the  TCP EEBO transcription of Marlowe’s Tamburlaine. They worked from a spreadsheet with a ‘verticalized’ representation of the text in which every word was a...

From transcription to scholarship

Today’s New York Times carried a touching obituary of Claude Anne Lopez, author of Mon Cher Papa: Franklin and the Ladies of London   and other biographical studies of Franklin. A Jewish refugee from Nazi-occupied Belgium, she arrived in America in 1941. She married an historian who moved to Yale, where the only employment available to...

Back to the Future or Wanted: A Decade of High-tech Lower Criticism

The title of this blog entry is the title of a keynote address I gave at the Chicago Digital Humanities and Computer Science Colloqium, held November 18-19 , 2012 at the University of Chicago. There is a pdf of the talk at http://panini.northwestern.edu/mmueller/backtothefuture.pdf The talk was about the challenges and opportunities posed by the TCP...

EEBO-TCP 2012: The future of the TCP as a public domain and collaboratively curated corpus of Early Modern English

“Revolutionizing Early Modern Studies?” was the question that governed the recent  EEBO-TCP 2012 conference sponsored by the Bodleian Library. I gave a talk there about “Towards a Book of English: A linguistically annotated corpus of the EEBO-TCP texts.” In another blog I will write about the ways in which this project will keep Phil Burns and...

Are the TextCreation Partnership texts good enough for research purposes

The following is a republished blog post originally published on October 1, 2009 on my now defunct Literary Informatics blog. Some of the points made then are not quite true anymore (especially my lament about the lack of concern for quality in the Hathi Trust project), but many of them remain true enough. Are the...