170 likes | 190 Views
LELA 30922 English Corpus Linguistics. Harold Somers Professor of Language Engineering Office: Lamb 1.15. Syllabus. Assessment. A practical project in which students will use the BNC (or other approved corpus material) to investigate some question of English language usage.
E N D
LELA 30922English Corpus Linguistics Harold Somers Professor of Language Engineering Office: Lamb 1.15
Assessment A practical project in which students will use the BNC (or other approved corpus material) to investigate some question of English language usage. Suggestion: base your project (more or less closely) on some existing study. Project write-up will include relevant background material and results and discussion of a corpus-based analysis. In other words: summarize (and criticize) the chosen study, then do your own version, and compare the results
Reading matter • Main recommendations: • Kennedy, G.D. (1998) An introduction to corpus linguistics. London: Longman. • McEnery, T. & A. Wilson (2001, 2nd ed) Corpus linguistics. Edinburgh: Edinburgh University Press. • Meyer, C. (2002) English corpus Linguistics: An introduction. Cambridge: Cambridge University Press. • Lots of other books, focussing on particular aspects • Do not ignore journals (Int J Corp Ling) and specialist conferences, especially when considering practical assignment. • http://tinyurl.com/32abhb for list of resources available at UoM
What is a corpus? • Corpus (pl. corpora) = ‘body’ • Collection of written text or transcribed speech • Usually but not necessarily purposefully collected • Usually but not necessarily structured • Usually but not necessarily annotated • (Usually stored on and accessible via computer) • Corpus ~ text archive
Computers and corpus linguistics • Historically, manual analysis of large bodies of text (esp. in literary and biblical studies) • Error-prone, time-consuming, not verifiable • Computers have introduced • Reliability, accuracy and replicability • increased speed and capacity means you can do more on a grander scale • new tools mean you can do things you might not have thought of doing
What is corpus linguistics? • Not a branch of linguistics, like socio~, psycho~, … • Not a theory of linguistics • A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject
Evidence in linguistics • Real attested usage as linguistic evidence • Contrasts with introspective approach previously typical • Relates to the competence~performance (langue~parole) distinction • Corpus linguists often more interested in trends than rules (probabilities rather than certainties) • Famous stories of corpus evidence contradicting widely-held assumptions about language use.
Activities in corpus linguistics • Design and compilation of corpora • Development of tools for corpus analysis • Descriptive linguists using corpora to analyze lexical and grammatical behaviour of language, eg for lexicography • Exploiting corpora in applied linguistics – language teaching, translation.
History of Corpus Linguisticswww.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/history.html • Textual study has always included an element of counting and cataloguing, despite impracticalities – notably concordances of Shakespeare, the Bible, etc. • Arrival of computers in 1950s of course changed everything
Brown corpus • First modern computer-readable corpus • W.N. Francis and H. Kucera, Brown University, Providence, RI • one million words of American English texts printed in 1961 • sampled from 15 different text categories • used as model for other corpora, including …
LOB corpus • compiled by researchers in Lancaster, Oslo and Bergen • one million words of British English texts printed in 1961 • sampled from same 15 text categories as Brown corpus • All texts ≤ 2,000 words long • Kolhapur corpus of Indian English compiled in 1978 to same sepcification
Chomsky’s criticisms • Chomsky’s ideas drove linguists away from empiricism (data) towards rationalism (introspection) • Chomsky switched focus onto abstract models of language competence • He was especially scathing about corpus-based approaches • Based on mistaken view that corpus linguists confused finiteness of data with finiteness of language • See McEnery & Wilson, chapter 1
The London-Lund Corpus of Spoken English (LLC) • First corpus of transcribed spoken language • Part of Survey of Spoken English at Lund University under the direction of J. Svartvik • 500,000 words of spoken British English recorded from 1953 to 1987 • different categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration
COBUILD • 1m-word corpus too small for many applications • 1980: Collins instigated collection of 20m-word corpus to support lexicographers writing new Collins Birmingham University International Learners’ Dictionary (John Sinclair) • Now expanded to Bank of English corpus, 320m words and growing • www.collins.co.uk/Corpus/CorpusSearch.aspx • www.collins.co.uk/books.aspx?group=153
BNC (1995) • http://www.natcorp.ox.ac.uk/ • 100m word collection of written and spoken text from 1975-93 (already dated in some respects!) • Carefully designed and balanced • Corpus is closed (finite, synchronic) • All text tagged to high quality • Lots of tools available for exploration
etc. • Many other corpus projects now underway, sometimes modelled on BNC or other well-known corpora • Various national projects • Specialized corpora • Historical texts • Learner English • International English • Translated English • Spoken dialogues for certain domains • When widely used, they become a kind of benchmark, eg Wall Street Journal corpus (treebank) • This can have pros and cons