The following is a discussion of a set of “search and sort” operations that could be useful in exploring the EEBO-TCP corpus of English books before 1700. It also includes some paragraphs about making texts more computationally tractable so that search operations can more quickly answer more complex queries. A good search environment depends as much on the tractability of the data as on the power of the search engine. I have written this for  scholarly readers to whom you need not explain why the 60 volumes of the Weimar edition of Luther’s works have 13 German and Latin index volumes of persons, citations, and subjects, but who may not be as well informed about the search tools or data structures  that are in principle available for exploring two billion words in 60,000 texts from more than two centuries of Early Modern English in the British Isles and North America.

Some 12,000 Early Modern English and American texts are currently available at EarlyPrint: A site for the collaborative curation and exploration of Early Modern texts. In early 2019 we will add the remaining 13,000 EEBO-TCP texts that are already in the public domain. We will add approximately 30,000 EEBO-TCP texts as soon as they move into the public domain in 2020.

The current site has quite primitive exploration tools. Most of the work has gone into creating a stable environment for collaborative curation. One purpose of this blog post is to define the scope of a more powerful and complex search environment.

Save the time of the reader

Ranganathan’s fourth law of library science is a good point of departure. The point of any index is to reduce the look-up time. This is as true of the thirteen index volumes of Luther’s works as of the “diligent and necessary Index, or Table of the most notable thynges, matters, and woordes contained in these workes of Master William Tyndall”, which you find at at the end of the 1573 edition of Tyndall’s works. Many goals of digital finding aids are captured in the quaint eloquence of that heading.

It is equally useful to remember that searching in  a scholarly environment is quite different from searching on the Web. Google and similar search engines use very powerful algorithms both to simplify the entry of search terms and to guess what you are likely to be interested in. They also know where you are coming from.  If you live in Evanston and just put down “Wine Goddess”, the first page of the return list gives you the address and telephone number of a very nice wine bar on Main Street.

This very powerful method of saving the reader’s time is of great help in countless daily tasks, but it is not very useful for scholarly queries, where you typically end up with some list that you work your way through. You don’t want the search engine to make decisions about your interests, but you do want to be able to sort results by a variety of criteria.

Working through search results

The time cost of a search is the sum of the time it takes to:

  1.  define the query
  2. execute the search
  3. work through the search results

To begin with the last, search results divide into short/simple and long/complex. A simple and short list can be “eyeballed” by the reader and requires no formal processing. But if a list goes much beyond Miller’s famous number seven and can’t be held at once in the reader’s memory, it needs some post-processing. If the search result is a table with several columns, even a table with just seven rows may require post-processing because very few users will be able to work through the different sort option in their heads.

Web sites that try to sell you something will often have quite sophisticated sort options that let you define and sort a search by brand, price, or other properties of the item you are looking for. If you know about “web scraping” you can extract the underlying data and turn them into tables that can be loaded into a spreadsheet.  Most scholars, myself included,  don’t know about web scraping and don’t want to learn about it either. They do, however, know about spreadsheets. It is therefore very useful (and still surprisingly rare) for search environments to have an “export” function that will create a file that the user can load into a spreadsheet and manipulate in various ways. Adding an export function to a search page is a relatively simple procedure. Replicating the functionality of Excel or Google spreadsheets on a web site for early modern texts is well beyond the financial resources of a digital project in the humanities. Moreover, the designers of a web site for  something on sale can make quite firm assumptions about what they would like for their users to see and do. It is much harder to anticipate what scholars want to do with their search results. It is simpler and more appropriate to give them results in a standard format and let them determine what they want to do with them in software programs that they known or should know about. Spreadsheets are remarkably  flexible and user-friendly  tools for many philological tasks. Geeky scholars will use Python and other computer languages to do clever stuff with search results. Those are very powerful tools, but they are not everybody’s cup of tea, and while their learning curve has dropped over the years, it still takes considerable  time to acquire and maintain fluency in their use.

Defining the Search

A search in a text corpus is typically defined as a search for “data”  (words in a text) in a group of  texts that are selected by “metadata” or data about those texts.  The distinction between “data” becomes blurrier the more closely you look at it. Ultimately it may be “metadata all the way down”. But there is some utility in thinking about textual data as literal spellings, while defining all other “features” of a text as “metadata”.  Texts can be classified by author, date, publisher, genre, etc. These “classifiers” , “features”, or “facets” support what is often called “faceted” searching or browsing. A particular combination of “facets” or metadata defines the subsets of a corpus in which you look for “real data”.

Looking for ‘love’

In the simplest of all searches you look for a string of literal characters, ‘love’.  Most users are familiar with the most primitive of ‘wild card’ searches, where you look for any string that begins in a certain way. In the search term “lov*”, an example of what is often called “right truncation”, the asterisk stands for “any number of alphanumerical  characters including none”. You could argue that in this example the initial “lov” is a piece of metadata about the strings you are looking for.

Looking for ‘jealous’: “regular expressions” as metadata about spellings

“Regular expressions” are a computer language that lets you define patterns and then look for sequences of letters that match them.  They are a very powerful tool and should appeal to anybody who likes crossword puzzles.  Regular expressions range from the very simple to the impossibly arcane. Their underlying logic is straightforward, but its symbolic representation via a keyboard makes for difficult reading. Consider the many spellings of ‘jealous’ and ‘jealousy’. They will begin with ‘g’, ‘i’, or ‘j’ followed  by ‘e’, ‘ea’, ‘a’, followed by at least one ‘l’, ‘o’ or ‘ou’, at least one ‘s’, and either nothing, ‘y’ or ‘ie’.  In crafting a regular expression for this pattern the symbols on the keyboard either stand for themselves or are metacharacters that state a condition or instruction. Here we are definitely in the world of metadata, and the readability problems arise from the fact that data and metadata are jumbled together in the same field of vision. Some non-alphabetical characters usually serve as metacharacters. But sometimes they do not, in which case they have to be “escaped”, with ‘\’ being the escape character that marks the next character as a literal. Thus in ‘\\’ the first backslash marks the second as standing for itself. You see why this makes for awkward reading. You also see why “path dependence” on the earlier technology of the typewriter constrains choices.

Sometimes numbers do not stand for themselves but serve as “occurrence indicators”, in which case they are enclosed in braces ({3}). Textual critics are familiar with such notation “kludges”; ordinary readers have a low pain threshold for them.

The utility of a pattern is measured by its ‘precision’ and ‘recall’.  Precision focuses on avoiding false positives at the risk of false negatives. Recall focuses on capturing all true hits but runs the risk of false positives.

Here is a regular expression for ‘jealous’, as knotty a thing as the meaning of the spellings it seeks to capture:

^ [gij] [ae]+ l+ [ou]+ s$

For the purpose of demonstration I have separated its components by a blank space. The metacharacters appear in red. Such colour coding is now a common feature of text editors and is of great help in keeping track of the ‘real’ and the ‘meta’. ‘^’ and ‘$’  are metacharacters that specify an initial or terminal condition. Square brackets contain a list of alternatives. Thus ^[gij] means that the hit must begin with one of the three letters. The symbol ‘+’  is an occurrence indicator meaning “one or more”.  The asterisk is an occurrence indicator meaning  “any number of occurrences including none”. Thus the entire regular expression says “look for a string that

  • begins with ‘g’, ‘i’ or ‘j’
  • is followed  by at least one occurrence of ‘a’ or ‘e’ separately or in combination
  • is followed by at least one ‘l’
  • is followed by at least one occurrence of ‘o’ or ‘u’ separately or in combination
  • may conclude with some combination of  ‘e’, ‘i’, ‘y’ or ‘s’

If you run  the regular expression ‘^[gij][ae]+l+[ou]+s$‘ across the word list of a corpus of some 50,000 texts, it will find the following 42 spellings:

gaalus, gaellos, gaellus, gaelus, gallos, gallous, gallouss, gallus, galluus, galos, galous, galus, galuus, gealous, gellous, gellus, gelos, gelous, gelus, ialoous, ialous, ialus, ieallous, iealoous, iealous, iellous, iellus, ieloous, ielous, ielus, jaelous, jalous, jalus, jeallous, jealoous, jealous, jealouss, jealus, jeelus, jellous, jelos, jelous, jelus

Note that it returns some false positives beginning with ‘gal’, but these are hard to avoid if you want to avoid false negatives and miss spellings like ‘ialous’.  In this case the time cost of throwing away the false positives is probably less than the time cost of crafting a more sophisticated regular expression.

On my not particularly special iMac this regular expression with Bbedit took 90 seconds to work through 60,000 TCP texts in their Michigan XML version and retrieve 20,000 hits in 7,600 files.

Blacklab is a powerful corpus query engine designed by the Institute for Dutch Lexicology. An experimental implementation on the EarlyPrint site   almost instantly finds 4,382 matches. It takes three minutes to generate  tab-delimited concordance output, and it is no more than a minute’s work to cut and paste that file into a  Google spreadsheet  that you can sort by author, date, title, or  the context that precedes or follows the hit.

A few months ago I corresponded with a young Spenser scholar who who wanted some advice about how to work with TCP texts.  I wrote back and said: “spend a day or even three days to learn about regular expressions. A month later you will wonder how you ever got on without them.” A few days later he replied: “You weren’t lying about regular expressions–I’ve found great things with them already.”

Abstract search criteria

Looking for words by their surface form has been for centuries the most common look-up. In a digital corpus you can also  look for unknown strings that meet certain criteria, such as frequency, co-occurrence, or similarity. Frequency is an important property of  words.  Aristotle in the Poetics talks about common and rare words. There are printed frequency dictionaries, but the digital medium supports much more flexible and powerful inquiries that start with counting. Some of these inquiries have been overconfident or downright stupid, but abusus non tollit usum. 

In principle, and given sufficiently powerful machines, all frequency-based inquiries could by computed and answered “on the fly” by performing all necessary computation from scratch in response to a particular user query. For instance, a query about all words that occur only once in Hamlet  can be computed by a machine so quickly that the cost of computing it each time may be less than the cost of keeping a record.

Other queries are computationally more expensive because they involve comparisons of one set of data with another.  My mother liked to say that “three hairs on your head is relatively few. Three hairs in the soup–not so”. Most abstract and complex queries stay within the scope of that observation. Emily Dickinson’s “we see comparatively” is a more dignified expression of it.

Raw counts of textual phenomena or frequencies per 10,000 or a million words are rarely interesting in themselves.  Dunning’s log likelihood ratio is a well-established statistical procedure that lets you determine what phenomena in a given “analysis set” of texts are relatively more or less common than in a “reference set” of texts. Instead of looking for the occurrences of known words, you look for unknown words in one set of texts that are overused or underused relative to some other set of texts.  In the WordHoard program raw frequency data for such comparisons are precomputed so that a query about differences between Shakespeare’s comedies and tragedies almost instantly tells you that the “overuse” of ‘she’ in comedies is the most striking discriminator. The corresponding WordHoard word cloud uses letter size and black (+)  and grey(-) to transform a complex table with many decimals into a visualization that can be taken in at one glance.

It is no surprise that the result is not surprising.  The human animal is an inveterate and shrewd but tacit statistician constantly figuring the odds. Quantitative analysis will very rarely produce new insights, especially with highly canonical texts that have been crawled over by generations of scholar. Laplace observed that “probability theory is nothing more than common sense reduced to calculation”. The formal calculation of the odds often gives us a more precise and nuanced view of what we already know and may draw our attention to types of evidence that we tend to neglect–in this case the “little” words. Burrows’ Computation into criticism: a study of Jane Austen  is the classic of that approach.

Other forms of verbal profiling

The most famous verbal profiler is the Google ngram viewer, which made a very prominent debut in 2010. Anupam Basu at Washington University in St.Louis  used the software to build an EEBO N-gram browser based on the EEBO-TCP corpus linguistically annotated with the first release of MorphAdorner, a Natural Language Processing tool suite created by Phil Burns at Northwestern University.

In the EEBO N-gram browser the writings of each decade may be seen as a chapter in a Book of English from 17470 to 1700.  Each chapter is an “analysis set”, while the book as a whole is the reference set.  The frequencies (in words per million) are sensitive to the total word counts for each year or decade.

This is very much a bird’s eye view, and it can be complement by closer looks. It would not be hard to construct reference sets for each decade by defining each set as a list of words from the beginning of the previous to the end of the next decade. Such a list would include the document and collection frequencies for each lexical item over a thirty-year span. Document frequency refers to the number of texts in which the item appears. Collection frequency is sum of all occurrences.

Such an approach would leave you with 23 reference sets of words with their frequencies, each with entries in the low millions, which no longer counts as Big Data. Leaving aside margins of error and significant discrepancies between the date of publication and the date of creation, this sequence of reference sets would let you put each individual text roughly in the middle of a 30-year span and trace its ‘figure’ against the ‘carpet’ of its generation.

You can complement a diachronic with a genre-based approach. Textsorte is a term of art in German discussions of text classification. There is a German project about which I only remember that it had 23 Textsorten and one of them was the Leichenpredigt. Like the State of the Union address, the funeral sermon is  a fixed genre in which many variables are held constant. That makes it an excellent guinea pig for diachronic inquiries. Right now it would not be easy to find all the funeral sermons in the EEBO corpus. In a thoroughly curated corpus it should be a matter of seconds. A “discovery engine” with that goal in mind is currently under construction at Washington University in St. Louis.

A reference corpus of a generation  is not an assembly of all the texts but a data summary extracted from them. It size is less than 0.5%of  the aggregate size of the texts, and the percentage actually declines with the size of the corpus.   Creating, storing, and processing those data creates few technical or financial problems. The upper constraint on their number is a function of the human attention it takes to keep order among them once they exist.

With the four data points of text_id, lexical item, document frequency, and collection frequency you can create a variety of ‘data frames’ that can become the inputs for different forms of comparing texts with one another. Statistical routines typically state their result to the fourth or sixth decimal. This is quite misleading.  Different NLP routines yield somewhat different results, just like political polls. They have margins of error, and biases in what or how they measure. You need to aware of them. If you follow Nate Silver’s FiveThirtyEight  you will quickly find  out that good statisticians are very aware of them.  Some variant of Silver’s method of averaging polls probably should have a place in quantitatively driven inquiries into the structure and meaning of texts and their words. Silver is very good at telling stories with numbers. Good stories with numbers may be harder to tell about literature than about politics or sports, but humanities scholars have lessons to learn from him.  Current arguments about authorship attribution in Early Modern plays come to mind.

Collocation analysis

The linguist J. B. Firth famously observed that “you shall know a word by the company it keeps.”  Collocation is the term of art for a variety of stochastic routines that identify a stronger association between words than you would expect  from a random perspective. Given the frequency of term A and term B you can compute the likelihood of their occurrence within some distance of each other.  If their co-occurence exceeds that likelihod by a significant margin something may be going on.  Note however that the conventional threshold of “significance” (1 in 20) is worthless when it comes to written language. You need much more striking odds to claim that something is going on.

WordHoard uses four different formulas for determining the significance of collocation. In practice the  “Specific Mutual Information” formula produces the best results in the sense that the results square with the expectations readers bring to texts they know something about. For instance, the collocates of ‘honour’ in Chaucer, Spenser, and Shakespeare generated by Specific Mutual Information make sense to me as a reader who has spent much time with their texts. Why bother with this if it only tells me what I know already? Good question, but if a machine passes a test with texts you know something about, you may trust it with texts you know nothing or little about.

The Philologic search engine has a much cruder, but quite effective, collocation finder. If you look for words that most frequently collocate with ‘liberty’, ‘Christian’ is first by a wide margin. This may surprise modern readers, but a moment’s reflection on Romans 8:2 (or listening to  Bach’s setting of it in Jesu meine Freude) will make sense of it.

Other forms of textual abstraction

Every method discussed so far depends on the “bag of words”model in which words are just treated like marbles of different colours without attention to their order. This is a horribly reductive model, but for many purposes it works better than it should. Philip Stone, author of the General Inquirer, remarked that “science consists in the systematic throwing away of evidence.”

In “bag of words” inquiries you often throw away “stop words”, the 150 or so function words that on the surface contribute little to the meaning of a text. This practice was a necessity thirty or more years ago when processing a million words was a big deal. Now you can afford to keep them, and some people say that “stop words are a bug, not a feature.”

I once went to a talk where the author–a professional in the Intelligence business–described an experiment that took the opposite approach, replaced all the content words with place holders, and produced a meaningless syntactic skeleton. The author argued that for Arabic texts this was a surefire way of locating the geographic origin of the text or its author with considerable precision.

The TCP texts are all encoded in XML, which means that discursive units–paragraphs, verse lines, speeches, etc– are wrapped in container elements or tags delimited by the angle brackets familiar from HTML. The TCP texts use about three dozen of those elements. Anupam Basu is currently playing with an even more reductive model, in which all the content is thrown away, and the text is reduced to an empty skeleton of the element hierarchy together with a count of tokens inside each element that contains word tokens.  Thus

<lg>
<l>Mary had a little lamb</>
<l>Its fleece was white as snow</>
</lg>

turns into

<lg>
<l>5</>
<l>6</>
</lg>

This does not do very much for the reader of this or any other poem in isolation. But when confronted with the task of discovering resemblances among 60,000 texts, the throwing away of all “real” text turn up surprisingly strong clues for  identifying text clusters.

This is most obviously apparent in the case of plays, where the frequency of <sp> and <speaker> texts identifies the genre with high precision. Moreover, explicit metadata (originally in Latin) are a constitutive part of the genre.   But the tag distribution goes much beyond saying “I am a play or play-like thing”.  It measures the division into verse and prose as well as  the frequency of turn-taking. It gives you the “rhythm section” of a play and may even reveal distinct authorial habits.

Linguistic Annotation

In the universe of Google searches you have to take texts as they come in hour by hour. Google does not  aim at creating, in Thucydides’ words, “a possession forever” (ktêma eis aei). Instead Google throws an extraordinary amount of human and machine resources at the task of making it simple for today’s users to find what they need right now. Tomorrow is another day with new data. But because yesterday’s data are not thrown away, the accumulated  data are like a gigantic attic with lots of treasures that you can rummage around in.

A data set like  the two billion words of the 60, 000 TCP texts is a very different thing. By Google standards it is tiny. It fits comfortably on an iPhone, leaving ample room for baby and vacation pictures. It is largely static even if hundreds of texts are added here or there. It will serve as the major documentary source for generations of scholars. It will usually provide the most convenient, and often the only access to a given text. And it will provide by far the most powerful and flexible environment for “text and context” inquiries.

The very different time frame changes the relationship between search tools and data. Whereas in the Google world you need very clever search tools to extract good enough ‘signals’ from very noisy and rapidly changing data,  in an environment for long-term use of mostly stable data you can focus on reducing noise by refining and enriching the data. Your search engine should not make assumptions about what you are likely to want, but it should be able to assume that data are kept in a consistent and easily recoverable manner. The metaphor of a “cultural genome”  is useful in this context. Genbank is ” the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences”. ‘Annotation’ in the world of the life science involves steps that identify, segment, transcribe, and describe assemblies of nucleotides in a manner that creates interoperable data sets that a machine can process and from which humans can extract this or that for the purposes of analysis.

Linguistic annotation is a simpler version of these procedures. It turns each ‘word’ (not an obvious concept) into a discrete object with a formal description that lets users view it at different levels of abstraction or reduction. Consider this item at a molecular level of a machine-readable annotated text:

<w xml:id=”a8bet-008-a-0170″ lemma=”love” pos=”vvz” reg=”loveth” >louythe</w>

This describes a lexical item with a corpus-wide unique ID  (the verb ‘love’)  in a particular grammatical state (3rd person singular) in the surface form  of its specific occurrence (‘louythe’). The description adds the most common surface form of this lemma in this state. This does not tell human readers of this word anything they do not already know. But a corpus query engine like Blacklab would within seconds provide results for queries like “show me all the instances where the lemma ‘love’ has the part-of-speech tag ‘vvz'” or “all the instance of the verb ‘love’ regardless of their grammatical state”. It can also retrieve more complex queries, e.g adjectives preceding ‘liberty’, or syntactic patterns of the type ‘handsome, clever, and rich’.

Linguistic annotation is a procedure with substantial upstream costs but long-term downstream benefits for users. It is very often (nearly always?) the case that high-level formal or semantic structures of interest to a scholar leave traces in the distribution of low-level phenomena and that a firm sense of those distributions enriches our understanding of how the higher-level structures work. Pursuing low-level patterns by hand  through many texts is a tedious and error-prone business.  A corpus query engine and a consistently annotated corpus great reduce the time cost and increase the accuracy and complexity of such inquiries.

To return for a moment to the metaphor of the cultural genome, plagiarism detection software is a close relative of sequence alignment tools. Both depend on the ability to match fuzzy or incomplete versions of shared patterns.  The power and relative ease of use of such tools has revived an interest in authorship attribution, but they also serve the broader purpose of pursuing many textual filiations that are not driven by a curiosity about who wrote what. The EEBO corpora are shot through with explicit citations whose network is an interesting object of study. The network of implicit citations or tacit echoes may be even denser, and its exploration more rewarding. David Smith at Northeastern University has discovered many such filiations with software that made you wonder how a machine could capture passages that for a human reader clearly were alike although they differed in many small and not so small details.

Semantic Annotation

The Historical Thesaurus of the Oxford English Dictionary is based on an ontology developed over several decades  by Michael Samuels and his colleagues at the University of Glasgow. It divides the world into three parts, external, mental, and social. The subcategories of each add up to 354, not to speak of their subcategories. Linguistic DNA, a project at the University of Sheffield, used this ontology in connection with the EEBO TCP corpus to “model the semantic and conceptual changes which occurred in English printed discourse between c.1500 and c.1800”.  The project relied in part on the Morphadorned EEBO-TCP corpus.

Some words can be mapped unambiguously to one semantic category. Many cannot. The automatic mapping of polysemic words to a particular category in a given context is a very problematical business. But partial success may be valuable for some purposes. Exploring the potential of a historically oriented ontology for the analysis of an Early Modern English corpus is an opportunity that should be explored further.

Digital versions of Plain Old Indexes

There is still much room for old-fashioned but digital indexes that from the user’s perspective are very similar to the indexes in the Weimar edition of Luther and treat the corpus of 60,000 texts as if it were a single book.

Names

An Index Nominum  has been a standard feature of learned books for centuries. In the Early Print corpus names are tagged as such. There about 1.3 million different spellings of names. They add up to 43 million occurrences. God(1) and Christ (2) account for 15% of all naming events.  England (5), Rome (6), Israel(15), London (16), and France(19) are the most common ‘place’ names. In the top 100 hundred names, Augustine(47), Luther (82),  and Aristotle(98) are the three most common names that refer unambiguously to one historical individual. One can make something of that list.

I don’t know how many distinct names there are in the EEBO TCP corpus  (as opposed to their spellings).  Probably somewhere between 25,000 and 100,000.

To classify something as a name is a useful beginning, but it is not enough. Does John refer to the Baptist, the Apostle, or the English king? When is  ‘Worcester’ a place  and when is it a person?  MorphAdorner has a supplementary program that assigns names to places, organizations, and individuals, with possible sub-classifications as historical, Biblical, legendary, mythological, or literary. Purchas His Pilgrimage, a million-word ethnographic compilation from the early 17th century, probably contains a non-trivial percentage of names that occurred in Early Modern texts up to that date. It was processed by that program, and the results were reviewed by Katie Poland, a remarkable Northwestern undergraduate (BA 2017). Modern Named Entity Recognition programs (NER) are not much help with older names. The most promising route to getting to something like “two thirds of a loaf” in a short time will be to extrapolate from Purchas His Pilgrimage  as a kind of “training data” and mine  a variety of onomastic texts or gazetteers from the period. Unlike a printed index, a digital version can make heavy use of Linked Data and point directly to various modern authority lists.

Citations

Since half of the top 40 names in the EEBO Corpus refer to Biblical characters, it is not surprising that Biblical citations dominate the citation network of the EEBO corpus. The modern division into chapter and verse was not completed until the 1550’s, but it spread very rapidly. From the 1580’s on Biblical verses  were cited in reasonably consistent versions of the format we still use today. A complex regular expression can capture most of them. As you move back in time, there is more variance in the patterns, but there are also far fewer texts.

If you add the Apocrypha, the Bible has about six dozen “books”, 1,200 “chapters”, and 32,000 verses. A few verses are cited a lot, some verses not at all.  If you think of the Bible as a Web site and of each citation as a ‘hit’ on a particular, a Google Analytics type approach to Bible visitors and their visits to different parts of the Bible over more than two centuries would have its interest. Neglected verses could be as interesting as the often cited ones.

Early Modern authors did not have the benefit of the Chicago Manual of Style,  but from the late middle of the 16th century onward they used quite similar abbreviations and conventions of referring to titles, chapters, and pages.  A combination of largely algorithmic procedures with some manual review could in a relatively short time create a digital index of citations the would be quite rough but would serve some purposes immediately and be a strong foundation for collaborative refinement and enrichment over time. In books before 1650 most references are to books in English or Latin. Almost all  of the English and many of the Latin books have entries in the English Short Title Catalogue.  If you get the reference to the title right, the ESTC number provides much useful information for network analyses of many types. A “Six Degrees of Francis Bacon” version of books rather than people has its appeals.

The construction of indexes of this type requires an initial phase in which a small team, relying as much as possible on algorithmic work, build a prototype that is robust and contains enough information to have some use for many users. Once that is done, it is best to follow a “release early and often” policy, and engage users in the correction and enrichment of data.  Think of collaborative curation on the model of a potluck or eranos.