Not quite two years ago I wrote an open letter about the TEI in which I wondered about its successes and failures.  I wrote about  “a thought experiment where you ask the chairs of history, literature, linguistics, philosophy, and religion departments of the world’s 100 top universities to write a sentence or short paragraph about the TEI. These would be very short sentences or paragraphs. The one message you would not get from them is the recognition that the TEI offers an important enabling technology for work in their disciplines.”   Encoders get a lot of pleasure and satisfaction from solving the many problems involved in encoding texts of any complexity.

But what about the added value of TEI specific encoding for the historian, linguist, philosopher, literary critic etc.? How can they decode or get at it, and what does it do for them? The answer is that for the most part they cannot get at it at all. I remember a conversation with a librarian who said something like “Oh yes, those TEI texts. We put them through the Lucene indexer and that’s pretty much it.” In principle, TEI encoding increases the query potential of the digital surrogate that is created by it. In practice, most of that query potential is ignored by the indexing and search software through which the encoded texts are mediated. Or if it is not ignored, it is used as instructions for XSLT style sheets to render the XML in HTML. As a result, the scholarly end users who encounter TEI-encoded texts almost never encounter them in an environment where they can take advantage of the distinct affordances of that encoding.

You can spend a lot of time explaining to your colleagues in an English department that it is a wonderful thing for texts to be encoded in TEI because it offers a much more robust, granular, and flexible way of storing textual data in digital form. But if what they see is browser-rendered HTML and if what they search might as well be a plain text file, it is not easy to persuade them of the value of this robust, granular, and flexible encoding. It does nothing to help them with their current project. Thus it is not much of an exaggeration to say that for ordinary scholarly users, TEI encoded texts right now offer no advantage over plain-text, html, or epub texts. Nietzsche once exclaimed in exasperation: “Was hilft mir der echte Text wenn ich ihn nicht verstehe?” or “What use is the true text if I don’t understand it?” One might vary this into “What use is the encoded text if I cannot decode it?”

We are still a long way from a situation in which large text archives, encoded in TEI, allow non-technical users to explore the full query potential created by that encoding and use it for searches that combine bibligraphical, linguistic, and structural criteria in a flexible and comprehensive manner.  But BlackLab, a project of the Institute of Dutch Lexicology, takes some significant steps in that direction.

Phil Burns, the developer of MorphAdorner, has put up a proof-of-concept site that uses BlackLab to search a TEI-P5 encoded and linguistically annotated version of the ECCO texts as well as of the Wordhoard Shakespeare. This corpussearch site is still in an embryonic state, especially with regard to the incorporation of structural features. The documentation is rudimentary–though very clear if you know your way around regular expressions. But it is very clear from this early implementation that with BlackLab you can build a site that lets non-technical users with a very modest grasp of regular expressions execute quite complex queries across large data sets  and download the result sets for further manual or algorithmical treatment. (You download results by clicking on the inconspicuous TSV link at bottom right.)

Some years ago I listened to a presentation in which Martin Wattenberg showed visualizations of phrases like “the king’s daughter.”  You learn quite a bit about the structure of a world when you know who owns what.  You can look for phrases of this type in the ECCO corpus with the command

[pos=”ng1″] [pos=”n.*”]

This will within seconds retrieve a list of 100,00 hits, which you can download into a file (which may take a while). But you can restrict the search to verse by giving the command in this form

[pos=”ng1″ & verse = “y”] [pos=”n.*”]

and get only 28,000 hits. You can constrain it further by bibliographical criteria, such as author, title, or date of publication.

The command language may not be very intuitive, but neither is it very difficult. It is the kind of thing that paralegals in law firms do all the time. You probably will not pick it up in five minutes, but 80% of what you need can be learned in a few hours, and it will become second nature after a few days. If you write a dissertation, an article, or a book, and your research depends on finding words and phrases of various kinds, the required upfront investment is quite low, and the downstream benefits can be very high, both in lowering the time cost of searches and in letting you find stuff that you might otherwise not find at all.

If you work with plays you may want to focus on only the words that characters speak and that audiences were expected to hear.  A search like

[“king” & spoken =”y”]

will pick up all occurrences in which the word occurs in an <sp> element and none of the many cases in which “King” appears in a speaker label.

If you are a history of the book person, you may sometimes be less interested in the “text itself” (whatever that means) than in the stuff that appears on titlepages, in prefaces, notes, indexes, etc.  This version of corpussearch includes a distinction (inherited from the MONK project) that divides text into “main text” and “paratext.” The distinction is crude, but for many purposes useful. “Main text” includes the words that readers think of as making up the “book” they read, typically the <p> and <l> children of the <div> elements that are children of the <body> element of a TEI text. Paratext is the rest.

If you do not like that crude division , you can look for stuff that appears only in the <front> or the <body> or the <back> elements of texts in the corpus. You will need to know something about the underlying encoding of the text. If you Google for nice restaurants in Copenhagen, you trust that Google will know what you want. That works a lot of the time, but when you do scholarship there is no way around knowing your data.

The structural features that are currently implemented in BlackLab are a first stab. It imay not be possible in BlackLab to search by the combination of all structural “elements” that are used in the TEI encoding of the TCP texts. But a 65/35, perhaps even 80/20, implementation seems eminently doable.  Getting at 65%  of the valued added by TEI encoding  is a lot better than getting at nothing (not to speak of 80%). And nothing is pretty much the going percentage now.

If you want to understand what search engines like BlackLab can do, you need to move beyond the “readerly” notion that a text is an ordered sequence of words. Instead you want to think of it as an ordered sequence of locations or addresses, each of which has  “attributes.”  The spelling at a particular address is just one “positional attribute”. Its part-of-speech may be another, and its place in a structural hierarchy  (verse or <l> rather than prose or <p>) may be a third.  Searching then becomes a hunt for addresses with some desired set of positional attributes.

The most interesting searches often combine values from differrent types of attributes. What kinds of adjectives are used to describe rulers in Shakespeare? You can try

[pos=”j”][lem=”king|queen|prince|tyrant|sovereign|duke”]

and get a list of 471 hits.

What about “love” near “death”?  Try this:

[lem=”love”] [] {1,5} [lem=”death”]

Not very pretty, but quite clear and not rocket science. The square brackets contain the search instructions for the positional attribute(s) of a token. Square brackets with nothing in them mean “anything”.  A quite modest knowledge of regular expressions includes knowing the fact that curly braces mark “occurrence indicators” where

{1}  = 1

{1,5} = between 1 and 5

So [lem=”love”] [] {1,5} [lem=”death”] translates into

a phrase that begins with any form of the lemma ‘love’ followed by at least one and no more than five words and terminated by any form of the lemma ‘death’

After a while you become attached to that form of shorthand.

My favourite search is looking for phrases that meet the three-adjective rule, such as “handsome, clever, and rich.” I love to learn that “the Scottish king grows dull, frostie, and wayward.” You learn such things  with the command

[pos=”j”]{2} [pos=”cc”] [pos=”j”]

There are 69 hits in Shakespeare, on average twice per play , although there are eight plays in which it does not  appear at all. Nine hits don’t fit the pattern for one reason or another. I thought at first that negative chains predominated, but the actual ratio of bad and good phrases is 33:27. Perhaps we respond more strongly to ‘foggy, raw, and dull’ than ‘bold, just, and impartial.’  There are not quite 800 hits in the 630 plays of Early Modern drama for an average frequency of 1.3.  So Shakespeare may use this type of phrase more often than his contemporaries. But on closer inspection this may have to do more with genre than author.