Fifty years ago resistance to theory was a common thing in English Departments. Today there is a lot of resistance to things digital if they go beyond using a word processor, dressing up text with pretty pictures, or doing “Media”, as if text by itself were not a medium, and a very challenging one at that. Literature and the digital have not been good friends, and being skeptical about the digital is a sign that you are good humanist.  Phrases like ‘machine learning’  or ‘text mining’ produce strong allergic reactions. If you are a Miltonist (as I once was), “rifled the bowels of their mother earth” may come to mind.

But with regard to “the digital and its discontents”  the humanities and business may not be that far apart. This morning I came across a slideshow in eWeek with the title Eight Reasons Machine Learning Isn’t Mainstream in the Enterprise. It attracted my attention because for several years I have wrestled with the problems of making the large corpus of Early Modern TCP texts more computationally tractable. Reasons 5-8 turned out be as good a description of the problems as I’ve come across anywhere. Here they are:

5 The Challenge of Data Preparation

Machine learning isn’t as easy as simply collecting data and running through some algorithm. Once you collect the data, then you have to aggregate it, determine if there are any problems with it and make sure it’s able to adapt to missing data, outlying data, garbage data and data that’s out of sequence

6 The Lack  of Publicly Labeled Datasets

The availability of publicly labeled data set would make it much easier for companies to get started with machine learning. Unfortunately, these do not yet exist, and  without them, most companies are looking at a “cold start”.

7  The Need for Domain Knowledge

At its best, machine learning represents the perfect marriage between an algorithm and a problem. This means domain knowledge is a prerequisite for effective machine, but there is no off-the shelf way to obtain domain domain knowledge. It is built up in organizations over time and and includes not just the inner workings of spcific companies and industries, but the IT systems they use and the data that is generated by them.

8 Hiring Brilliant Data Scientists is not a Panacea

Most data scientists are mathematicians. Depending on their previous job experience, they may have zero domain knowledge that is relevant to their employer’s business. They need to be paired up with analysts and domain experts, which increases the cost of any machine learning project.

My hunch is that “machine learning” or any form of “distant reading” is not likely to tell us new things about first-level canonical authors. These texts have been crawled over by generations of the “slow but smart” computers known otherwise as readers. Occasionally computational approaches will provide comprehensive evidence about the “how” rather than the “what” of  texts, as in J. F. Burrows’ classic Computation into Criticism: A Study of Jane Austen’s Novels. But computationally assisted approaches shine when it comes to second-, third-, or fourth-level stuff– texts like the 17th century Thomason tracts, which Carlyle called “that hideous mass of rubbish” but “greatly preferable to all the sheepskins in the Tower… for informing the English what the English were in former times.”  If they are lucky enough to survive, the ephemera and garbage of one age become treasures for later generations.

How to make sense of lots of stuff that you don’t really want to read in the first place is a problem shared equally by the Enterprise and undergraduates in English or History.   Reading closely is a virtue, but so is the ability to extract salient data from a lot of unreadable stuff.  High-level literacy will increasingly require some skill in making use of digital tools for the essential skill of what I once called “not-reading“.  Enterprises may look with favour on English or History undergraduates who have learned a little about “data science” and have made good use of it in an honors thesis about a topic for which the messy data came from the Thomason tracts or their equally unreadable Colonial American cousins.

The “challenge of data preparation” and the “lack of publicly labeled data sets” point to social problems that need to be addressed in a collaborative and inter-institutional manner. Undergraduates with an interest in that kind of work often face a “cold start” scenario. The promising data are not sufficiently clean or agile to submit to algorithmic analysis, but the time cost of getting them “fit for purpose” is way beyond the days or (few) weeks within which undergraduate assignments or project typically unfold.   There is much to be learned from practices in the Life Sciences about constructing “cultural genomes” that  would turn the “hideous rubbish” of the Thomason tracts or similar texts into clean and algorithmically amenable data sets. Biologists  have gene banks, which are libraries of genomes or text corpora in a four-letter alphabet.  A non-trivial amount of many researchers’ time is taken up by ‘editorial’ activities where you segment and annotate a genome and deposit in the library or take  a genome from the library and add annotations to it. The result is publicly and consistently labeled data sets. Think of that as the Lower Criticism of the Life Sciences—the care and feeding of the basic data on which all forms of higher analysis depend.

“Data” is Latin for “givens.”  What is “given” constrains what you can do with it. Often you must do a lot “to” stuff before you can do anything “with” it. “Doing with” often seems a nobler thing than “doing to.” Nietzsche impatiently exclaimed: “Was hilft mir der echte Text, wenn ich ihn nicht verstehe” (what use is the authentic text if I don’t understand it.  True, but “dirty” or unmanageable data hamper understanding. In this context, the extraordinary interaction of computational work and the plain old sleuthing that led to the arrest of the Golden State Killer may be a model for future scholarly work in the humanities. You need to be savvy about some quantitative routines, and you need to have clean and agile data. Lots to be done on both fronts.