Scalable Reading

dedicated to DATA: digitally assisted text analysis

...the broad circumference
Hung on his shoulders like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1. 286-91)

Latest entries

Looking up stuff in an Early Modern corpus

The following is a discussion of a set of “search and sort” operations that could be useful in exploring the EEBO-TCP corpus of English books before 1700. It also includes some paragraphs about making texts more computationally tractable so that search operations can more quickly answer more complex queries. A good search environment depends as...

Collaborative Curation of TCP texts

This is a report about the current state of the collaborative curation of TCP texts. While I have written about this topic many times on this blog, this report is written for newcomers who have an interest in what was printed before 1800 but may or may not know anything about TCP texts. TCP stands...

About Metadata in the Early Print corpus

The EarlyPrint site contains linguistically annotated and partly corrected TCP texts in an environment that supports collaborative curation. For many practical purposes the  approximately 60, ooo TCP texts from 1473 to 1700 add up to a deduplicated library of that period.  EarlyPrint looks forward to a time when for (almost) every book in that corpus there...

Hobbes and Maggie Haberman about Twitter

Some years ago I read Joel Spolsky’s very funny description of Twitter in which he said: Although I appreciate that many people find Twitter to be valuable, I find it a truly awful way to exchange thoughts and ideas. It creates a mentally stunted world in which the most complicated thought you can think is one sentence long....

TCP2ESTC

Introduction and Summary This is a report about an experiment with ~ 4,000  texts from the Text Creation Partnership ( TCP). It is more in the spirit of concept cars than production models.  There may also be an aspect of changing fro 5.25 to 3.5 floppy disks.  The TCP texts are a critical component of...

Fixing the Blackdot Words in the TCP corpus: a “mixed initiative” in Engineering English

This is a report on a “mixed initiative”–a term of art in computer science–that  combines old-fashioned philological elbow grease with new-fangled long short-term memory neural network processing (LSTM).  The goal is to fix as many as possible of the approximately five million incompletely transcribed words in the 1.7 billion word TCP corpus of English printed...

Machine Learning in the Enterprise and in English Departments

Fifty years ago resistance to theory was a common thing in English Departments. Today there is a lot of resistance to things digital if they go beyond using a word processor, dressing up text with pretty pictures, or doing “Media”, as if text by itself were not a medium, and a very challenging one at...

Engineering English: Machine-corrected TCP texts

Engineering and English are alphabetical neighbours in a university list of disciplines, but the members of those disciplines tend to think of the other as on the other end of the disciplinary spectrum. But work in English departments has for centuries depended on the engineering work that created and refined printing.  Future work will depend...

What is a digital combo?

How should an old book live in the digital environment of the 21st century? My answer is “as a digital combo that brings together three data streams, each a surrogate that represents and contextualizes aspect of the original object”. Call them the bibliographical, material, and textual streams. This scrawny diagram illustrates their interaction in the...

Whither TEI? The Next Thirty Years

In the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination (Jerome McGann, 2001) You have to put the corn where the hogs can get at it (Bill Clinton) Only the paranoid survive (Andrew Grove)   Introduction At...

Freebo, Free Lunch, and Crowdfunding New EEBO Images

Here is a prefixed postscript (April 18, 2016) to my December 2015 blog post about creating new EEBO images: in a recent conversation with Thomas Stäcker, the deputy director of the Herzog-August-Bibliothek in Wolfenbüttel (HAB), I learned that their average cost for creating a digital image good enough for most scholarly purposes is about a dollar...

New release of Shakespeare His Contemporaries

I have put a new version of Shakespeare His Contemporaries on Google Drive, where you may or view or download the plays. In this version I have grouped the plays by decades and put them in directories with names like 155, 156 …165. The plays have been encoded in TEI  Simple. The texts are in...