210 likes | 268 Views
Research methods in corpus linguistics. Xiaofei Lu. Overview. What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics.
E N D
Research methods in corpus linguistics Xiaofei Lu
Overview • What is a corpus? • Types of corpora • Corpus design • Where to obtain corpora • Corpus annotation • Corpus analysis • Note on research project design • Exercises and demos in between • Future courses on corpus linguistics
What is a corpus? • Leech (1992): • an unexciting phenomenon, a helluva lot of text, stored on a computer • Francis (1982): • a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis • Sinclair (1991): • a collection of naturally-occurring language text, chosen to characterise a state or a variety of language
Types of corpora • General-purpose monolingual corpora • The British National Corpus • Specialized corpora • Lancaster Corpus of Academic Written English • Learner corpora • International Corpus of Learner English • Parallel & comparable corpora • The JRC-Acquis Multilingual Parallel Corpus • The English-Chinese Parallel Concordancer • Corpora and varieties • International Corpus of English • Synchronic and diachronic corpora
Corpus design • Purpose • Comparability • Type • Content: mode, interaction, domain, medium • Structure: proportions • Size • Sampling? • Design of the BNC
Where to obtain corpora • Linguistic data consortium • Bookmarks for corpus-based linguists • Ask on the corpora list • Compile your own corpora • Design your corpus • Getting permission • File format, metadata, and data markup • Text capture • Scanning, typing, electronic files, web crawlers, e.g., WebSPHINX • Transcription tools, e.g., Transcriber • A Guide to Good Practice
Corpus annotation • Why annotate • Levels of corpus annotation • Difficulties for corpus annotation • Tools for corpus annotation
Why annotate • For linguistic research • Allow more effective corpus searches • For natural language processing • Spelling and grammar checking • Text summarization • Machine translation • Question answering
Levels of corpus annotation • Sentence segmentation • Word segmentation/tokenization • Part-of-speech (POS) tagging • Chunking/shallow parsing • Syntactic parsing • Semantic annotation • Pragmatic annotation • Parallel corpora: sentence alignment • Learner corpora: error annotation
Difficulties for corpus annotation • Ambiguity • I saw a pig with binoculars. • Problems for tagging, parsing, & WSD • Unknown words • Identification • POS tagging • Semantic annotation
Tools for corpus annotation • Bookmarks for corpus-based linguists • Corpora and Corpus Annotation Tools on the WWW • POS tagger demonstration • Sentence segmentation • POS tagging • Extracting NPs of the form DT NN NN • Dexter: Tools for analyzing language data
Corpus analysis • Levels of corpus analysis • Tools for corpus analysis • Interpreting corpus data
Levels of corpus analysis • Word frequency lists • Concordances • Collocation (lexical patterning) • Colligation (syntactic patterning) • Keyword lists
Tools for corpus analysis • Bookmarks for corpus-based linguists • Recommendations: • WordSmith Tools (not free) • AntConc (free) • TextStat (free) • Unix tools • Write your own scripts
Exercise (part 1) • Download and install AntConc • Download some text for processing • Project Gutenberg • Generate a word frequency list for your mini-corpus
Interpreting corpus data • Are frequency differences statistically significant? • w appears x times in an n-word corpus, and y times in an m-word corpus • Chi-square test (doesn’t work well for small numbers) • Fisher’s Exact Test (doesn’t work for a cross table larger than 2×2)
Exercise (part 2) • Compare your word frequency list with that of BNC • Anything interesting? • Run the chi-square test and Fisher’s Exact test on some interesting words
Interpreting corpus data (cont.) • Collocational analysis: How strongly are x and y associated • Mutual information • Measures difference between observed and expected frequencies of (X,Y) • Higher MI, stronger association • Doesn’t work well for low frequencies • T-test • Measures confidence with which to claim strong association between X and Y • Higher t-score, higher association • Online calculations
Exercise (part 3) • Generate a concordance for a target word • Find a word that co-occurs frequently with the target word • Test if the word is strongly associated with the target word
Note on research project design • Purpose of project • Corpus compilation and annotation • Corpus analysis • Bottom-up: from observations of recurring patterns to hypothesis and generalizations • Top-down: start with given categories and search for evidence of use and variance • Caution on generalizability
Future courses on corpus linguistics • Spring 2007 • APLING 597E: Introduction to Corpus Linguistics • Hands-on course on principles and tools for corpus compilation, annotation, processing, and analysis • Spring 2008 • APLING 597: Seminar on Corpus Linguistics • Advanced seminar on using corpora for serious research projects