1 / 17

LELA 30922 English Corpus Linguistics

LELA 30922 English Corpus Linguistics. Harold Somers Professor of Language Engineering Office: Lamb 1.15. Syllabus. Assessment. A practical project in which students will use the BNC (or other approved corpus material) to investigate some question of English language usage.

kitty
Download Presentation

LELA 30922 English Corpus Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LELA 30922English Corpus Linguistics Harold Somers Professor of Language Engineering Office: Lamb 1.15

  2. Syllabus

  3. Assessment A practical project in which students will use the BNC (or other approved corpus material) to investigate some question of English language usage. Suggestion: base your project (more or less closely) on some existing study. Project write-up will include relevant background material and results and discussion of a corpus-based analysis. In other words: summarize (and criticize) the chosen study, then do your own version, and compare the results

  4. Reading matter • Main recommendations: • Kennedy, G.D. (1998) An introduction to corpus linguistics. London: Longman. • McEnery, T. & A. Wilson (2001, 2nd ed) Corpus linguistics. Edinburgh: Edinburgh University Press. • Meyer, C. (2002) English corpus Linguistics: An introduction. Cambridge: Cambridge University Press. • Lots of other books, focussing on particular aspects • Do not ignore journals (Int J Corp Ling) and specialist conferences, especially when considering practical assignment. • http://tinyurl.com/32abhb for list of resources available at UoM

  5. What is a corpus? • Corpus (pl. corpora) = ‘body’ • Collection of written text or transcribed speech • Usually but not necessarily purposefully collected • Usually but not necessarily structured • Usually but not necessarily annotated • (Usually stored on and accessible via computer) • Corpus ~ text archive

  6. Computers and corpus linguistics • Historically, manual analysis of large bodies of text (esp. in literary and biblical studies) • Error-prone, time-consuming, not verifiable • Computers have introduced • Reliability, accuracy and replicability • increased speed and capacity means you can do more on a grander scale • new tools mean you can do things you might not have thought of doing

  7. What is corpus linguistics? • Not a branch of linguistics, like socio~, psycho~, … • Not a theory of linguistics • A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject

  8. Evidence in linguistics • Real attested usage as linguistic evidence • Contrasts with introspective approach previously typical • Relates to the competence~performance (langue~parole) distinction • Corpus linguists often more interested in trends than rules (probabilities rather than certainties) • Famous stories of corpus evidence contradicting widely-held assumptions about language use.

  9. Activities in corpus linguistics • Design and compilation of corpora • Development of tools for corpus analysis • Descriptive linguists using corpora to analyze lexical and grammatical behaviour of language, eg for lexicography • Exploiting corpora in applied linguistics – language teaching, translation.

  10. History of Corpus Linguisticswww.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/history.html • Textual study has always included an element of counting and cataloguing, despite impracticalities – notably concordances of Shakespeare, the Bible, etc. • Arrival of computers in 1950s of course changed everything

  11. Brown corpus • First modern computer-readable corpus • W.N. Francis and H. Kucera, Brown University, Providence, RI • one million words of American English texts printed in 1961 • sampled from 15 different text categories • used as model for other corpora, including …

  12. LOB corpus • compiled by researchers in Lancaster, Oslo and Bergen • one million words of British English texts printed in 1961 • sampled from same 15 text categories as Brown corpus • All texts ≤ 2,000 words long • Kolhapur corpus of Indian English compiled in 1978 to same sepcification

  13. Chomsky’s criticisms • Chomsky’s ideas drove linguists away from empiricism (data) towards rationalism (introspection) • Chomsky switched focus onto abstract models of language competence • He was especially scathing about corpus-based approaches • Based on mistaken view that corpus linguists confused finiteness of data with finiteness of language • See McEnery & Wilson, chapter 1

  14. The London-Lund Corpus of Spoken English (LLC) • First corpus of transcribed spoken language • Part of Survey of Spoken English at Lund University under the direction of J. Svartvik • 500,000 words of spoken British English recorded from 1953 to 1987 • different categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration

  15. COBUILD • 1m-word corpus too small for many applications • 1980: Collins instigated collection of 20m-word corpus to support lexicographers writing new Collins Birmingham University International Learners’ Dictionary (John Sinclair) • Now expanded to Bank of English corpus, 320m words and growing • www.collins.co.uk/Corpus/CorpusSearch.aspx • www.collins.co.uk/books.aspx?group=153

  16. BNC (1995) • http://www.natcorp.ox.ac.uk/ • 100m word collection of written and spoken text from 1975-93 (already dated in some respects!) • Carefully designed and balanced • Corpus is closed (finite, synchronic) • All text tagged to high quality • Lots of tools available for exploration

  17. etc. • Many other corpus projects now underway, sometimes modelled on BNC or other well-known corpora • Various national projects • Specialized corpora • Historical texts • Learner English • International English • Translated English • Spoken dialogues for certain domains • When widely used, they become a kind of benchmark, eg Wall Street Journal corpus (treebank) • This can have pros and cons

More Related