260 likes | 806 Views
Introduction to Corpus Linguistics. Xiaofei Lu APLNG 482Y November 11, 2008. Overview. What is a corpus Corpus design and compilation Corpus annotation Corpus querying and analysis Resources. What is a corpus?. Leech (1992):
E N D
Introduction to Corpus Linguistics Xiaofei Lu APLNG 482Y November 11, 2008
Overview • What is a corpus • Corpus design and compilation • Corpus annotation • Corpus querying and analysis • Resources
What is a corpus? • Leech (1992): • an unexciting phenomenon, a helluva lot of text, stored on a computer • Francis (1982): • a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis • Sinclair (1991): • a collection of naturally-occurring language text, chosen to characterise a state or a variety of language
Types of corpora • General-purpose vs. specialized corpora • The British National Corpus • Michigan Corpus of Academic Spoken English • Native vs. learner corpora • International Corpus of Learner English • Monolingual vs. parallel & comparable corpora • The JRC-Acquis Multilingual Parallel Corpus • The English-Chinese Parallel Concordancer • Corpora representing one or diverse language varieties • International Corpus of English • Synchronic vs. diachronic corpora • Spoken vs. written corpora
Corpus design • Purpose, type • Content: mode, interaction, domain, medium • Structure, size: comparability, proportions • Data sources, sampling • Design of the BNC
Corpus annotation • Why annotate • Levels of corpus annotation • Difficulties for corpus annotation
Why annotate • For linguistic research • Allow more effective corpus searches • For natural language processing • Spelling and grammar checking • Machine translation • Question answering
Levels of corpus annotation • Sentence and word segmentation • Part-of-speech (POS) tagging • Syntactic parsing • Semantic, pragmatic and discourse annotation • Learner corpora: error annotation
Difficulties for corpus annotation • Ambiguity • I saw a pig with binoculars. • Problems for tagging, parsing, & WSD • Unknown words • Identification • POS tagging • Semantic annotation
Corpus querying and analysis • Using windows- or web-based software • Good for processing raw corpora • Word frequency, concordances, lexical bundles, and keyword lists • Examples: AntConc and GOLD • Using natural language processing tools • Good for processing annotated corpora • Extracting occurrences of grammatical patterns • Examples: Stanford parser and Tregex
Interpreting corpus data • Are frequency differences statistically significant? • w appears x times in an n-word corpus, and y times in an m-word corpus • Chi-square test • Fisher’s Exact Test
Interpreting corpus data (cont.) • Collocation analysis • How strongly are x and y associated • Mutual information - Measures difference between observed and expected frequencies of (X,Y) • T-test - Measures confidence with which to claim strong association between X and Y
Resources • Books • Hunston (2002): Corpora in Applied Linguistics • McEnery (2006): Corpus-Based Language Studies • Journals • International Journal of Corpus Linguistics • Corpora • Websites and mailing lists • Bookmarks for corpus-based linguists • Linguistic data consortium • The corpora list
Resources • Corpus annotation and analysis tools • Stanford Natural Language Processing Group • Places for exploration • MICASE • BNC Online • Courses on corpus linguistics • Computational and Statistical Methods for Corpus Analysis (Summer 2009) • Seminar on Applied Corpus Linguistics (Fall 2009)
Note on research project design • Purpose of project • Corpus compilation and annotation • Corpus analysis • Bottom-up: from observations of recurring patterns to hypothesis and generalizations • Top-down: start with given categories and search for evidence of use and variance • Caution on generalizability