Scalable Reading

dedicated to DATA: digitally assisted text analysis

...the broad circumference
Hung on his shoulders like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1. 286-91)



Introduction and Summary This is a report about an experiment with ~ 4,000  texts from the Text Creation Partnership ( TCP). It is more in the spirit of concept cars than production models.  There may also be an aspect of changing fro 5.25 to 3.5 floppy disks.  The TCP texts are a critical component of...

Engineering English: Machine-corrected TCP texts

Engineering and English are alphabetical neighbours in a university list of disciplines, but the members of those disciplines tend to think of the other as on the other end of the disciplinary spectrum. But work in English departments has for centuries depended on the engineering work that created and refined printing.  Future work will depend...

What is a digital combo?

How should an old book live in the digital environment of the 21st century? My answer is “as a digital combo that brings together three data streams, each a surrogate that represents and contextualizes aspect of the original object. Call them the bibliographical, material, and textual streams. This scrawny diagram illustrates their interaction in the...

Hannah, Kate, and Lydia at work

While reviewing the work of Hannah, Kate, and Lydia, I enjoyed the precision and concision of their annotations. A sample of them appears below. While a full documentation would require snippets of the image and the transcription as well as the annotation, the annotations themselves clearly show their minds at work, combining clear description with...

Engineering English: Machine-assisted curation of TCP texts

The are somewhere in the neighbourhood of five million incompletely transcribed words in the rougly two billion words of English books before 1700 transcribed by the Text Creation Partnership. Depending on how you look at it, that is either a  lot or not very much at all. Less than half a percent of words are...

Best Buy and Curation en passant

I went to Best Buy to reduce the clutter of remote controls in my living room and simplify my life. Logitech’s Harmony may be the answer. Cheap it isn’t, but then ‘cheap’ and ‘simple’ are hardly synonyms–witness the very simple and very expensive white KPM china of the Königliche Porzellan-Manufaktur Berlin. I paused at the...

From Shakespeare His Contemporaries to the Book of English

 Introduction and Summary This is a report about “Shakespeare His Contemporaries”  of SHC, my project for creating an interoperable digital corpus of plays that in addition to Shakespeare’s include most of the plays written within a generation before and after his active career as a playwright. Its keywords are “query potential”, “digital surrogate”, “algorithmic amenability”, and...

Repeated n-grams in Shakespeare His Contemporaries (SHC)

This is a blog post about the distribution of  a special kind of “dislegomena,” tetragrams and longer n-grams whose “collection frequency” is 2 and whose “document frequency” is also 2. My purpose is to figure out how many swallows make a summer.  If you are interested in the intertextual relationship between one play and another,...

Getting undergraduates and amateurs into the business of re-editing our cultural heritage for a digital world

The following reprints an earlier post of an entry that I first published on January 7, 2011 on my now defunct “Literary Informatics” site.   The Chicago section of today’s New York Times has an article with the title “Volunteers at Planetarium Excel where machines lag.” The gist of the article is in these paragraphs:...

Visit my new site

I am abandoning this site in favour of a new site

The mdash

Have you ever thought about the mdash, the long dash, \u2014 in Unicode parlance or paraphrased as — in the parlance of character entities? The odds are that you have not.  I certainly have not thought much about it, but it tripped me up this morning in the EEBO-MorphAdorner project that Phil Burns and I...

Google maps and crowdsourcing

David Pogue has an  on his New York Times blog about what makes Google maps so good.  It’s a story of incremental and iterative improvement over years, combining sophisticated algorithms with a lot of manual work. Definite lessons for the incremental and iterative improvement over time of the TCP texts and similar corpora.