600 likes | 2.06k Views
Corpus Stylistics. Outline: Background and introduction to current work Methodology in Corpus Stylistics Applications of Corpus Stylistics References. Corpus Stylistics. Background: What is Corpus Stylistics?
E N D
Corpus Stylistics Outline: • Background and introduction to current work • Methodology in Corpus Stylistics • Applications of Corpus Stylistics • References
Corpus Stylistics Background: What is Corpus Stylistics? • The statistical study of style, i.e. study of the relative frequency of elements in a text • Augustus de Morgan, 1851: disputes about the authenticity of some of the writings of St Paul settled by the measurement of the length of the words used in the various Epistles • T.C. Mendenhall, 1887: analysis of several authors’ frequency distributions of word-length
Corpus Stylistics • Corpus: a body or collection of linguistic data for use in research • Since the early 1960s: interest in computer corpora or machine readable corpora • Statements about the relative frequency of various linguistic items in a corpus have become very accurate
Corpus Stylistics • Some uses of statistical analysis of style through corpora: • Education, e.g. EFL textbook writing • Establishment of authorship, e.g. of unascribed manuscripts • Interpretive stylistics, e.g. study of the writer’s ideology and point of view
Methodology Corpus Stylistics • Simple things may characterise different styles • average sentence length • average word length • type:token ratio (vocabulary richness) • number of types = number of different words • number of tokens = total number of words • vocabulary growth (homogeneity of text) • number of new types in 1st, 2nd, …, nth 1000 words • in rich varied text, number will climb steadily • Especially when used comparatively
Corpus StylisticsMethodology (cont’d) • More complex analyses can give a more interesting picture • specific syntactic structures • degree of modification in NPs • types of verbs (e.g. verbs of persuasion, speech verbs, action verbs, descriptive verbs) • distribution of pronouns (1st/2nd/3rd person) • etc … (anything you can think of!) • Quite sophisticated mathematical techniques can give an overall picture • e.g. factor analysis: identifies from a (big) range of variables which ones best identify/characterise differences
Corpus StylisticsMethodology (cont’d) Multidimensional analysis • Collect a huge range of measures of a wide variety • some simple word counts • syntactic features • classes and subclasses of N, V, Adj, Avd • Factor analysis • choose a range of features to measure, see which ones are correlated
Corpus StylisticsMethodology (cont’d) • Example: work based on corpora trying to quantify and characterise genre and register differences • Work pioneered by Douglas Biber* • Biber used statistical measures to identify stylistic factors that co-occurred, and could therefore be definitional of text types and genres • E.g. conjuncts like therefore, nevertheless and use of passive together indicate more formal style *D. Biber, S. Conrad & R. Reppen, Corpus Linguistics: Investigating Language Structure and Use, Ch 5: the study of discourse characteristics
Corpus StylisticsMethodology (cont’d) • Corpora useful not only for counting frequencies of features, but also: • Concordancing • Lists occurrences of word in context • Identify syntactic use of word • Identify range of meanings • Identify relative frequency of different uses/meanings • Collocation • What words occur together? • Compare distribution of close synonyms
Corpus StylisticsMethodology (cont’d) Vocabulary in context • “Concordance”, also known as KWIC list (key word in context) • Allows us to see the (immediate) environment in which a word appears • Listings can be customised to show what you want more clearly, e.g. • sorted according to next or previous word • showing more or less context
Corpus StylisticsMethodology (cont’d) Collocation • Term coined by J R Firth (1957) to characterise (part of) his theory of meaning • “You shall judge a word by the company it keeps” • “The occurrence of two or more words within a short space of each other in a text” (Sinclair 1991) • “The relationship a lexical item has with items that appear with greater than random probability in its (textual) context” (Hoey 1991)
Style and CorporaMethodology (cont’d) Collocation, text type and style – example: • Distinguish between general and more usual collocations vs. technical and more personal ones • e.g. in a general corpus time collocates with save, spend, waste, fritter away, … • but in a corpus of sports reports time collocates with half, full, extra, injury, first, second, third, …
Style and CorporaApplications Stylometry • An attempt to capture the essence of the style of a particular author by reference to a variety of quantitative criteria, usually lexical, called discriminators. • Study of frequently occurring features: word/sentence length; choice and frequency of words; vocabulary richness) • The ideal situation for authorship studies is • when there are large amounts of undisputed text, or • few contenders for the authorship of the disputed text(s).
Style and CorporaApplications (cont’d) Author attribution Establishing the author of an unascribed manuscript: • Build corpora • A - works definitely by author A • B - works definitely by author B • C - works of disputed authorship, but probably written by A or B • Then select discriminantsand associated measures • When the technique has been shown to discriminate effectively between A and B, then try it on C (M. Oakes: ‘Computational Stylometry’, in Handbook of Corpus Linguistics)
Style and CorporaApplications (cont’d) Language Learning • Frequency - in particular, word frequency - had a role in language learning in the days before electronic corpora existed. • The 'corpus revolution' made available frequency information about language use in a totally unprecedented way • Frequency dictionaries and frequency-based grammatical information are becoming more and more available and new sources of frequency information from the Web are being tapped • Various kinds of knowledge found in present-day language textbooks (grammatical, collocational, semantic) are getting to be frequency-based. • In general, corpora represent real usage of language • In addition, "more frequent” can equal “more important“ in many aspects of language learning
Style and CorporaApplications (cont’d) Interpretive stylistics • Programmes like WordSmith Tools and other Windows-based applications allowresearchers to derive a list of keywords (words which occur significantly more often than expected in texts when compared to a reference corpus). • Keywords are a powerful and quick means of analysis, and they have been used to examine discourses relating to specific social and cultural issues, and the ideology behind authors / texts • See e.g. work by P. Baker on gender and sexual identity
Reading Leech, G. Language and Literature: Style and Foregrounding (Longman, 2008), ch.11 Leech, G. and Short, M. Style in Fiction (Routledge, 2007), ch. 2 and 3 Semino, E. & M. Short, Corpus Stylistics: Speech, writing and thought presentation in a corpus of English writing (Routledge, 2004) Short, M. Exploring the language of poems, plays, and prose (Longman, 1996), ch. 11