Corpora in literary and stylistic studies

Corpora in literary and stylistic studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Aims of this session • Lecture • An overview of applications of corpora in literary and stylistic studies • Case study: Culpeper’s (2002) keyword analysis of six characters in Romeo and Juliet • Lab session • To duplicate Culpeper’s (2002) study

Corpora vs. literary stylistics • Stylistic shifts in usage may be observed with reference to features associated with either particular situations of use or particular groups of speakers (cf. Schilling-Estes 2002: 375) • In this sense, similar to registers and genres or dialects and language varieties • …but stylisticians are typically more interested in individual works by individual authors rather than language or language variety as such • The use of corpora in stylistics and literary studies is presently very limited

Potential uses of corpora • Study of prose style • Study of individual authorial styles • Authorship attribution • Literary appreciation and criticism • Teaching of stylistics • Study of literariness in discourses other than literary texts (e.g. Carter 1999)

Study of prose style • In stylistics, there is a long tradition of focusing on the representation of speech and thought in fiction • Leech and Short’s (1981) influential model of speech and thought presentation • Style in Fiction, Longman, 1981 • Further refined in Short, Semino and Culpeper (1996), and Semino, Short and Culpeper (1997)

S&TP: Lancaster Speech, Thought and Writing Presentation Corpus • Developed during 1994-2003 • Written: 260,000 words in size, three narrative genres: prose fiction, newspaper reportage and (auto)biography, which are further divided into ‘serious’ and ‘popular’ sections • Spoken: created with the express aim of comparing S&TP in spoken and written languages systematically, 260,000 words, 60 samples from BNCdemo, and 60 samples from oral history archives in the Centre for North West Regional Studies at Lancaster • Download:http://ota.ahds.ac.uk/headers/2464.xml

S&TP categories • Direct category, e.g. • direct speech, direct thought and direct writing • Free direct category, e.g. • free direct speech, free direct thought, free direct writing • Indirect category, e.g. • indirect speech, indirect thought, indirect writing • Free indirect category • free indirect speech, free indirect thought, free indirect writing • Representation of speech/thought/writing act category • Representation of voice/internal state/writing category • Report category, e.g. • report of speech, report of thought, report of writing

Authorial styles of individual authors • Typically specialized corpora of the works of individual authors, e.g. • A corpus composed of their early and later works to track any stylistic shift over time • A corpus composed of their works belonging to different genres (e.g. plays and essays) to compare their styles across genres • A corpus composed of works by different authors to compare their different authorial styles • Large general corpora can provide ‘a means of establishing a norm for comparison when discussing features of literary style’ (Hunston 2002: 128)

Techniques of studying authorial styles • Corpus stylistics goes well beyond simple counting but rather relying heavily on sophisticated statistical approaches • MDA (e.g. Watson 1994) • Principal Component Analysis (e.g. Binongo and Smith 1999) • Multivariate analysis (or more specifically, cluster analysis, e.g. Watson 1999; Hoover 2003) • Stylistics + computation + statistics • stylometry, stylometrics, computational stylistics, statistical stylistics, corpus stylistics

Authorship attribution • Is the work by Shakespeare or Marlowe? • Cluster analysis of frequent words, frequent word sequences, and frequent collocations provides an accurate and robust method for authorship attribution (Hoover 2001, 2002, 2003a, 2003b) • Corpus-based authorship attribution has been used as linguistic evidence in court (“forensic linguistics”) • Confession/witness statements (e.g. Coulthard 1993) • Blackmail/ransom/suicide notes (Baldauf 1999) • Plagiarism detection in academic and education settings (e.g. Turnitin UK)

The Derek Bentley case • Derek Bentley was hanged in the UK in 1953 for allegedly encouraging his young companion Chris Craig (a minor) to shoot a policeman • The evidence that weighed against him was a confession statement which he signed in police custody but later claimed at the trial that the police had ‘helped’ him (to?) produce • The case was re-opened in 1993, 40 years after Derek was hanged • Malcolm Coulthard, a forensic linguist, was commissioned by Bentley’s family to examine the confession as part of an appeal to get a posthumous pardon for Derek

The Derek Bentley case • The appeal was initially rejected by the Home Secretary • In 1998, another court of appeal overthrew the original conviction and found Derek Bentley innocent • In 1999 the Home Secretary awarded compensation to the Bentley family

The Derek Bentley case • In Bentley’s confession, the word then was unusually frequent • It occurred 10 times in his 582-word confession statement, ranking as the 8th most frequent word in the statement • It ranked 58th in a corpus of spoken English, and 83rd in the Bank of English (on average once every 500 words) • Six witness statements • 3 made by other witnesses: then occurs just once in 980 words • 3 by police officers, including two involved in the Bentley case: then occurs 29 times – once in every 78 words!

The Derek Bentley case • The position of then • Subject + then (e.g. I then, Chris then) was unusually frequent in Bentley’s confession • I then occurs three times (once every 190 words) • In a 1.5-million-word corpus of spoken English, the sequence occurs just nine times (once every 165,000 words) • No instance of I then was found in ordinary witness statements • Nine occurrences were found in the police statement • In the spoken BoE, then I was 10 times as frequent as I then

The Derek Bentley case • The sequence subject + then was characteristic of the police statement • Although the police denied Bentley’s claim and said that the statement was a verbatim record of what Bentley had actually said, the unusual frequency of thenand its abnormal position could be taken to be indicative of some intrusion of the policemen’s register in the statement

Culpeper (2002) • Culpeper, Jonathan (2002) Computers, language and characterisation: An analysis of six characters in Romeo and Juliet. In U. Melander-Marttala, C. Ostman and Merja Kyto (eds.), Conversation in Life and in Literature. Uppsala: Universitetstryckeriet, pp.11-30. • www.lexically.net/wordsmith/corpus_linguistics_links/Keywords-Culpeper.pdf

Aim of Culpeper (2002) • ‘The broad aim of this paper is to show how the study of an important area within “stylistics”, namely characterisation, can benefit from an empirical approach, specifically, a methodology for identifying what might be the “key” words of a text … Such an approach can reveal significant lexical and grammatical patterns without reliance on speculations about what the relevant dimensions are’ (Culpeper 2002: 12)

Keywords vs. style-markers • Enkvist (1964: 29) • ‘Style is concerned with frequencies of linguistic items in a given context, and thus with contextual probabilities.’ • ‘To measure the style of a passage, the frequencies of its linguistic items […] must be compared with the corresponding features in another text or corpus which is regarded as a norm and which has a definite relationship with this passage.’ • Style as a matter of ‘frequencies’, ‘probabilities’ and ‘norms’ • ‘We may […] define style markers as those linguistic items that only appear, or are most or least frequent in, one group of contexts. In other words, style markers are contextually bound linguistic elements…’ (ibid. 34-5) • ‘Elements that are not style markers are stylistically neutral.’ (ibid. 35) • ‘Style-markers…are words whose frequencies differ significantly from their frequencies in a norm’ (Culpeper 2002: 13) • Keywords (positive and negative)

Preparing the text • Problem 1: Which text to use … original version or modern version? • Culpeper opted for a modern edition (to get round problem of spelling variation: sweet vs. sweete, etc.) • Problem 2: Shakespeare plays are full of dialogue • How can we get the tool to distinguish between different characters? • Culpeper used a simple tagging scheme, e.g.<ROM>…<\ROM> <JUL>…<\JUL>

Who is worth concentrating on …? • Culpeper chose his characters based on the number of words that they “spoke”

Choosing a reference corpus • Culpeper opted to make 6 reference corpora – one for each character, e.g. • RC for Romeo = whole play minus Romeo’s contributions • RC for Juliet = whole play minus Juliet’s contributions • RC for Nurse = whole play minus Nurse’s contributions • … • Why use a reference corpus of the same play? • ‘Characters are partly shaped by their context. Thus, it makes little sense to compare, say, the characters of Romeo and Juliet with the characters of Macbeth or Anthony and Cleopatra, since the fictional worlds of Italy, Scotland and Egypt provide very different contextual influences. Furthermore, characters, like people, are partly perceived in terms of whom they interact with …’ (Culpeper 2002: 16)

Alternative reference corpora …? • Scott and Tribble (2006) have compared Romeo and Juliet against • The Complete Works of Shakespeare • Plays only • Tragedies only • The BNC • Interestingly … they found that • A ‘robust core’ of keywords occur whichever reference corpus is used. These include personal and place names like “Benvolio”, “Romeo”, “Juliet” and “Mantua” but also terms like “banished”, “county”, “love” and “night” • In contrast to Scott and Tribble (2006), Culpeper (2002) found that his results were more meaningful - in terms of characterisation - when using the other Romeo and Juliet characters (minus the target character) as a reference corpus

Making wordlists for each character • Making the characters’ word lists • Involves telling Wordsmith to only include <…> … <\…> • Procedure … • Wordlist – Settings – Wordlist specific – Tags – Only part of file – Sections to keep – [specifying start/end tags] • Making the reference corpora • Involves telling Wordsmith to exclude anything between <…> … <\…> • Procedure … • Wordlist – Settings – Wordlist specific – Tags – Only part of file – Sections to cut out – [specifying start/end tags]

Top 10 on wordlists (frequency) Q: Do they tell us anything interesting/worthwhile and, if so, what?

Positive keywords for the six characters What differences can you spot between the results here and the results on the previous table?

What key words can tell us about characterisation … • Romeo’s top three key words – ‘beauty’, ‘blessed’, ‘love’ • Expected? Surprising? … the lover of the play • Other keywords related to ‘love talk’ = ‘dear’, ‘stars’, ‘fair’ • Keywords relating to body parts – ‘eyes’, ‘lips’, ‘hand’ – obsessed with the physical? • Juliet’s top key word – ‘if’, ‘or’, ‘be’, ‘yet’, ‘would’ (conditional + modals) • Reflecting her state of mind – anxiety and uncertainty? • Capulet most ‘key’ key word – ‘go’ • Context reveals that mostly used as an imperative command … Capulet as head of the household to direct other people (see also ‘make’ and ‘haste’), e.g. • Go wake Juliet, go and trim her up… • Nurse’s keywords are surge features (i.e. reflecting outbursts of emotion) – ‘god’, ‘warrant’, ‘woeful’, ‘faith’, ‘marry’, ‘ah’

Negative key words for the six characters IMPORTANT These represent words that are used unusually infrequently (statistically speaking) by these characters. Do you notice anything interesting?

Use of Pronouns within Romeo and Juliet • Romeo and Juliet use first and second person pronouns • Expected? - “at the heart of the social interaction in the play” • But compare Romeo’s use of ‘me/mine’ with Juliet’s use of ‘I’ … • Culpeper’s (2002) conclusion: ‘Juliet spends much time in the play bearing her soul … whereas Romeo is much more conscious of his own role as a lover and of the effect of the circumstances upon him’ (ibid: 24) • What about Capulet? – “you”, “we”, “our”, why? • Thou-forms vs. you-forms to be covered

Culpeper’s Conclusion (2002: 27) • “In some cases, my analysis provided solid evidence for what one might have guessed (e.g. Romeo’s keywords ‘beauty’ and ‘love’) …” • “… in others, it revealed what I think would be very difficult to guess but fits well a possible interpretation (e.g. Juliet’s keywords ‘if’ and ‘yet’).” • “… keywords analysis also offers a way into analysing function words, such as pronouns, and accounting for their contribution to style and meaning”

What should we take note of …? • How he was able to come to his conclusions • The importance of having the right reference corpus • The need to use mark-up (as a means of identifying the different characters) • Knowing how to use Wordsmith … • To make the different wordlists • To make the keyword lists

Any potential weaknesses … • It did not attempt to lemmatize the word forms … so that, for example, ‘loves’ would form part of the word count of ‘love’ (Culpeper 2002: 27) • Contractions (e.g. I’ll) would also have been counted separately • Key word analysis … • makes us focus on ‘statistical deviations from a relative norm, and ignores the significance of relatively infrequent deviations from absolute norms’ (i.e. what your given texts may have in common) • ignores one-off occurrences of words

Now it’s your turn… Duplicating Culpeper (2002)

The Romeo text • Download the “Oxford Shakespeare” version of Romeo and Juliet • http://www.bartleby.com/70/index38.html • Local copy available • Using tags to separate stage directions from dialogues • Did Culpeper do this? • Tag words spoken by each character • Alternatively, you can use a local version I have prepared

Sample of tagged text • <Exeunt MONTAGUE and LADY. ROMEO. > • <Ben.> Good morrow, cousin. <\Ben.> • <Rom.> Is the day so young? <\Rom.> • <Ben.> But new struck nine. <\Ben.> • <Rom.> Ay me! sad hours seem long. Was that my father that went hence so fast? <\Rom.> • <Ben.> It was. What sadness lengthens Romeo’s hours? <\Ben.> • <Rom.> Not having that, which having, makes them short. <\Rom.> • <Ben.> In love? <\Ben.> • <Rom.> Out— <\Rom.> • <Ben.> Of love? <\Ben.> • <Rom.> Out of her favour, where I am in love. <\Rom.>

Separating words by apostrophes clear ‘ from this box and press OK

Making a wordlist for each character • Start wordlist function • Load the text • Setting – Tags – Only part of File - “Sections to keep” – type in the start/end tags given below • Ignore <*> is default setting – ignore stage directions • Make a wordlist for • Romeo_TC (<Rom.>…<\Rom.>) • Juliet_TC (<Jul.>…<.\Jul.>) • Capulet_TC (<Cap.>…<\Cap.>) • Nurse_TC (<Nurse.>…<\Nurse.>) • Mercutio_TC (<Mer.>…<\Mer.>) • Friar_L_TC (<Fri._L.>…<\Fri._L.>)

Tag and markup Only Part of file

Making a reference list for each character • Setting – Tags – Only part of File - “Sections to cut out” – type in the start/end tags given below • Excluding what is said by the target character • Make a wordlist for • Romeo_RC (<Rom.>…<\Rom.>) • Juliet_RC (<Jul.>…<.\Jul.>) • Capulet_RC (<Cap.>…<\Cap.>) • Nurse_RC (<Nurse.>…<\Nurse.>) • Mercutio_RC (<Mer.>…<\Mer.>) • Friar_L_RC (<Fri._L.>…<\Fri._L.>)

Running words

Discrepancies: Some explanations • Different tagging • We ignored stage directions • We tried what Culpeper (2002) suggested at the end of his paper, treating contracted words such as “I’ll” as two words • A potential problem of this approach with Shakespearean texts • danc’d, disturb’d, and rais’d etc all became two words! • Is there a need to annotate the text? • Not done here or in Culpeper (2002), but worth its efforts • the city’s side • let’s away • Where’s this girl? • Want to have a try? • http://ucrel.lancs.ac.uk/claws/trial.html

Top 10 on wordlists • Romeo Juliet Capulet • Nurse Mercutio Friar L whole play

Keyword settings Cutoff p value Selected statistic formula Min. Frequency

Making a keyword list per character • Romeo_kw • Romeo_TC + Romeo_RC • Juliet_kw • Juliet_TC + Juliet_RC • Capulet_kw • Capulet_TC + Capulet_RC • Nurse_kw • Nurse_TC + Nurse_RC • Mercutio_kw • Mercutio_TC + Mercutio_RC • Friar_L_kw • Friar_L_TC + Friar_L_RC

Romeo’s keywords by keyness Positive keywords Negative keywords Himself: Romeo, he, him Both: you, we Movement: come, go, up Aboutness: beauty, love, blessed, dream, joy, sin, kiss, death, poison, soul … Love talk: dear, farewell, stars Body parts: eyes, lips, hand Pronouns: mine, me, thine, thee, my

Juliet’s keywords by keyness Negative keywords Positive keywords Herself: her Both: we, you Movement: here, go People in interaction: nurse, Romeo, sweet, husband, mother, father State of mind: if, or, be, yet, would Pronouns: my, I, thou Aboutness: news, words, night, swear, send, tongue, speak

Why “nurse” and husband? (vocal function)

You-forms vs. thou-forms • You-forms vs. thou-forms • Plural: ye, you, your, yours, yourself • Singular: thou, thee, thy, thine, thyself • You-forms vs. thou-forms (thou, thine, thee) – socio-pragmatic implications • Romeo and Juliet prefer thou-forms (positive) and avoid you-forms (negative) • High status social equals use you-forms • You-forms are dispassionate and emotionally unmarked • Thou-forms are strongly expressive: positive (affection and love) or negative (anger and contempt) – intimacy, love talk • Friar Laurence prefers thou-forms: He is engaged in intimate and emotionally charged discourse • Capulet and the Nurse prefer you-forms: used among social superiors, or individuals of low status talking to people of high social status

Capulet’s keywords by keyness Positive keywords Negative keywords Pron: thy, thou Others: the, of, that, etc. [full of actions, not a ‘nouny’ style] Directions: go, haste, make, now, look (imperatives) Pronouns: you, we, her, our (directing and speaking on behalf of the household) etc… [you vs. thou: imperative; less emotional]

Nurse’s keywords by keyness Positive keywords Negative keywords Pron: thou Why ’d? Emotional: ay, ah, O, God, woeful, warrant, faith Pronouns: you, your, he, I Address terms: lady, madam, lord, sir Why “day”? - “O day! O day! O day! O hateful day!”

Why “d”? Culpeper might have made the correct decision to treat contractions as one word?

Corpora in literary and stylistic studies