A Young Scholar edition is a project that fits into the scale of an undergraduate honors project. It will normally take as its point of departure a TEI-P5 version of a TCP text that has been linguistically annotated with MorphAdorner. Like ice skating, it consists of compulsory figures and a free form routine. The two crucial parts of the compulsory figures involve steps that will

  1. check whether the TCP transcription is an accurate version of the print original
  2. review the linguistic annotation for accuracy.

The free form routines are just that but will normally include some elements described below in greater detail. They should take advantage of the affordances that the digital medium offers to edition making.  The final version of a Young Scholar edition is subject to review by a scholar in the field. If approved, it will be published digitally with a Creative Commons license in an environment whose details remain to be determined.

In what follows I limit the discussion of Young Scholar editions to editions of Early Modern plays from the TCP collection in the context of Shakespeare His Contemporaries. But there are thousands of other Early Modern TCP texts that in terms of their length and content would make excellent choices for digital editorial projects by undergraduates.

Since no Young Scholar edition has been produced yet, the following guidelines should be taken as very provisional. Remember what is said about battle plans: they rarely survive the first encounter with the enemy.

The compulsory figures of a Young Scholar edition

Proofread the TCP transcription

The editor’s first duty is to review the TCP transcription word by word and make sure that it is a faithful transcription of the printed original, which in many cases will be the only source for the play. This review must be based  on AnnoLex, which will log editorial changes much more rigorously than a human hand could. It is also the only reliable way of managing recommendations for corrections to be included in the source texts of the TCP archive.

The acceptable rate of error in this task is zero. Careful proofreading will in nearly all cases lead to the discovery of words or passages that are doubtful or make no sense at all. At a minimum you will describe the problem as best you can. In some case you may suggest ’emendations’ with varying degrees of confidence.

Check the lemmatization and part-of-speech tagging

The text you work with has been linguistically annotated with MorphAdorner. That means that the text has been segmented into tokens, each of them representing a word or punctuation mark. The token has three “positional attributes,” metadata that make some assertion about the token.The first  is an ID that explicitly asserts the place of the token in the token stream. The second associates the token with a “lemma” or the form of the word in which you would look for it in a dictionary. “Love”  is the lemma of “loves”,  “loved”, “loving”, and “lovingly.” The third associates the token with a “part of speech”  tag or POS: the POS tag of “lovingly” is “av-vvg,” which says that it is the adverbial form of the present participle of the verb ‘love’.

The purpose of linguistic annotation is to make a text “machine-actionable.” Readers do not need to be told that the POS tag of “lovingly” is “av-vvg.” They have a practical knowledge so tacit that they do not even know they have it. Linguistic annotation is  an “explicitation” of tacit readerly knowledge for the benefit of the machine. Or rather for the benefit of human users who can use the speed of the machine to retrieve patterns that it would take readers a long time to find. If you come across phrases like “handsome, clever, and rich” or “soft, gentle, and low” you might be interested in finding other examples of this stylistic formula.  But where would you find the time to look for them in 600 different plays? But if you know how to talk to a machine that has a linguistically annotated text, you can give it a command like

[pos=”j”][pos=”j”][pos=”cc”][pos=”j”]

which means “look for a sequence of four tokens, where the first two are adjectives (‘j’), the third token is a conjunction, and the last token is an adjective. It takes the machine less than a second to retrieve almost 800 occurrences of this pattern, including things like the description of a “dainty Gentlewoman” as “young, sweet, and modest,” a “rascall woman” as “lewd, abominable, and plain,” and the “Scottish king” as “dull, frostie, and wayward.”  If you are moderately handy with some quite basic text processing routines you can download such a list and turn it into a three_adjectives spreadsheet that is sortable by pattern, author, play, or date.  There is enough material there for an interesting essay about thematic or formal aspects of this pattern and the ways in which it does (or does not) circle around the trinity of the true, the good, and the beautiful.

It takes a little while to learn how to talk to the machine, but it is not rocket science.

Automatically produced POS tagging has an error of about 3%, and the errors tend to cluster in particular areas. The manually corrected annotation of a single play is a good thing, but the real benefit comes from doing enough of them over time to build what the NLP folks would call a “gold standard corpus.”

In proofreading a text you must do it word by word and make sure that the words are in the correct order with nothing missing or in the wrong place. When you check the POS tagging, it is more efficient to review the text by type of POS tag. If, for instance, you go through the list of words that are tagged ‘n2’ you only need to ask yourself whether the current is a plural noun or not.

Reviewing the linguistic annotation is a tedious task, but doing it for a single play is quite manageable, and its time cost is measured in hours.  Plays are relatively short documents.  Two thirds of the word tokens consist of very common words like ‘the’, ‘a’, ‘in’, ‘of’, ‘king’, ‘lady’.  There are few errors in such words, and you can scroll through them relatively quickly.

“Lemmatization” is the other part of linguistic annotation. The manual review of lemmatization is also tedious, but it is a lot easier because you can rely on your tacit knowledge of what the right lemma should be. If in doubt about the spelling, consult the Oxford English Dictionary.  At some time in the fall of 2013 we will have a lemma dictionary of the drama corpus.  The mapping of variant spellings of a name to its lemma remains a problem to be solved.

Correcting or creating a cast list

Cast lists are an important part of the metadata that are a constitutive feature of drama. Not all plays have them, and where they exist they are not necessarily complete. In the machine-actionable version of a text, it is important to map all the speaker labels to a “role ID” that corresponds to the name of the character in the cast list. This mapping may be created algorithmically, but you need to check it carefully and correct it where necessary.

Review  and correct the division into acts and scenes

The division of a play into acts and scenes is also part of the genre’s conventional metadata. Some text have errors in the numbering of acts and scenes. Some plays have acts, but no scenes. Some play texts lack any division into acts or scenes (The White Devil, the old Leir play). The absence of such divisions almost certainly does not mean that the author thought of the play as a single scene. Early Modern plays were not high-prestige objects, and the editorial attention they received from early publisher/printers was limited. So the reason for the absence of act/scene division is most probably found in the printer’s failure to get around to it.

If a play exists in later editions, you may want to adopt its way with acts and scenes, but you should look into the question how and when that division came about and what authority it has.  If there is no guidance from that source you should create a division that follows the stage directions. By convention in English plays, a scene changes when all characters leave the stage and a new set enters. But this is not universally true, and the stage directions for entries and exits may not be complete or unambiguous.

If a play has neither acts nor scenes, you can probably divide it into scenes with reasonable confidence. Division into acts is a much more arbitrary business, and you may be better off not trying it.

From a technical perspective, fiddling with the play’s acts and scenes will require some competence with handling XML documents with an XML editor such as oXygen. Becoming comfortable with basic forms of XML editing is a matter of days rather than hours. But it is not a matter of weeks.

Create a modern spelling edition of the text

From the combination of a lemma and a POS tag it is possible to derive an algorithmically created version of the text in modern spelling. This algorithmic version needs to be proofread, which is a relatively straightforward task. Apart from checking for gross errors, the biggest task will be to standardize capitalization. It is an open question whether punctuation needs changing.

The free-form features of the Young Scholar edition

The nature of the play and the interests of the editor will shape the free-form features of a Young Scholar edition. We are in a world of “may” or “should” rather than “must.” There are several features, however, whose absence will raise the eyebrows of a scholarly reader if the play or its history present evidence for dicussion.

The reception history of a play

Many plays in the SHC corpus were printed once and have slumbered in oblivion until they were woken up by their TCP transcription.  Others were reprinted in their own day or in later centuries.  Where there is documentary evidence for a reception history a Young Scholar edition should pay attention to it. Tracking down the bibliographical history of a play and what was said about it by whom, when, and where is an ecellent way of learning something about scholarly method.

For the history of a play before 1800 there are three major sources:

  1. The English Short Title Catalogue (ESTC) will tell you whether and in what kinds of books a play was reprinted.
  2. The current EEBO-TCP corpus may contain texts that refer to particular authors and their plays. Don’t expect to find a lot there: discussions or reviews of individual works are rare before the late 17th century, and your play never made a Top 40 list.
  3. The ECCO collection lets you search the “dirty” OCR of ~200,000 books published in the 18th century. If there is more than an occasional 18th century reference to your play or its author, you should find something there, though it may take sleuthing to track it down.

For the nineteenth century and beyond your two major sources are Jstor and the library catalogues of major libraries  You should pay particular attention to the work published between ~1870 and ~1930, a period that saw the creation of a modern documentary infrastructure for the study of Early Modern drama.

Review of the scholarly literature

The review of the scholarly literature about your play is a subset of its reception history. If the play you choose has a substantial scholarly literature, you should reconsider your choice.  You are unlikely to add much value to what has already been done, and a big point of the Young Scholar edition in any way is to take a fresh look at things that have been neglected.

Read what scholarly stuff there is and write a report of it that gives your readers a sense of what kinds of things interested the scholars, what questions they asked, and how they answered them. You may well conclude that what they said was mere drivel and has no bearing on a modern understanding of the play. But the errors and follies of an earlier age have their own interest. Remember, however, that the great historian Leopold von Ranke said that “all generations are equidistant from God” and avoid the attitude that David Bromwich (I believe) once characterized as “we are so smart now because they were so dumb then.”

Different versions of a play

If a play exists in more than one version you are potentially in the world of a “critical edition,” a project that meticulously assembles the different sources of a text, compares them, and analyzes their relationship.  A critical edition will in most cases be beyond the scope of a Young Scholar edition, both in terms of editorial expertise and the time required.  But different versions of a text will require some attention on your part. If you have access to facsimile images in EEBO or ECCO, spending some time with the images of each version will allow you to make some useful statements about them. Is the second edition just a reprint of the first? A play will typically consist of three dozen page images, so it doesn not take much time to compare one text with the other.

If the pages do not look the same, do some spot checking for differences in spelling and make some estimate about the type and extent of editorial intervention. If the inspection of a later version leads you to think that there are substantive differences it will almost certainly be worth your while paying attention to them.

An analytical summary of the play

Readers of your edition who have not read the play will be grateful for an analytical summary that tells them about its ‘what’, ‘why’ and ‘how’. This is a quite difficult thing to.  You cannot do it well without having a view of the play, and in doing it you may well develop or refine your view of it. But in this exercise you do not want to tell the reader what you think or how you feel about the play.  You want to be as detached as possible. There are plenty of opportunities elsewhere in your project to express your opinions.

Explanatory notes

Every text of any length will pose some difficulties that require explanation. It will also contain passages that draw attention to themselves, whether they are very striking, very characteristic or, for that matter, quite uncharacteristic. Notes attached to particular places are the genre for this kind of explanation. Think of the writing of such as a way of delivering “just in time” knowledge to the reader.

Media based annotation

It has become technically trivial to attach audio or video clips to part of a text. Editors who are also actors or have friends in the theater world may want to include readings or stagings of particular scenes.  This may be of particular significance where a scene contains ambiguities and the actor’s gestures or vocal inflection underscore or undermine the surface meaning of a text.

Verbal profiling

As an editor of a  Young Scholar edition you will have at your disposal very comprehensive resources that allow you to explore the verbal texture of a play and analyze the ways in which that texture is shaped by words or phrases that are used a lot, rarely, or not at all when compared with the corpus as a whole or subsets of it defined by author, genre, or period. These resources support close reading or stylistic analyis of a traditional kind or quantitatively based stylometric procedures that employ statistical analysis and visualization. Theis extensive and very fast access to linguistic data gives you a competitive advantage over scholars of an older generation. If you pursue it with discipline and imagination, it may be the area where you can make the biggest difference to the understanding of Early Modern drama.