300 likes | 319 Views
This article provides an overview of corpus linguistics, including the definition of a corpus, its history, types of corpora, and their applications in computational linguistics. It also discusses the construction and annotation of corpora, as well as the differences between corpora and test suites. The article aims to help readers understand and analyze linguistic data using corpus linguistics.
E N D
Korpuslinguistik mit und für Computerlinguistik Martin Volk Universität Zürich Eurospider Information Technology AG
Sources for linguistic information • Introspection (own usage and judgement) • Usage and judgement by others • Questioning (goal-driven) • interview • questionaire • Observation ('involuntary' utterances) • spoken utterances ( corpora) • written utterances ( corpora) Martin Volk
What is a corpus? • a text collection • a representative text collection • a representative and structured text collection • a representative, structured and annotated text collection • ... Martin Volk
Example Is 'ob' used as a preposition in German? • Introspection • Rothenburg ob der Tauber • Dictionary (Wahrig. Deutsches Wörterbuch. 1996): Präp. mit Dativ; veraltet; ob dem Wasserfall • Web: Google 'ob dem' • Sage: Der Wilde Jäger ob dem Neuenburgersee • Corpus Martin Volk
Corpus Examples • CZ94: ... fiel schier vom Stuhl ob der Äusserung eines Ozeanologen ... • CZ94: Bei manchem Ölgiganten kam ob der Ergebnisse gar Euphorie auf. • CZ94: ... rieben sich vergnügt die Hände ob des zu erwartenden Schlagabtauschs. • ob is a preposition with genitive!! • in CZ corpus: 'ob' is tagged as preposition 21 times (obviously some incorrect) Martin Volk
History of Corpus Linguistics • collections of text were widely used in the 19th century and in the first half of the 20th century • language acquisition • orthography (letter frequency) • field linguistics American Structuralism (influential until 1960) Martin Volk
History of Corpus Linguistics • Chomsky's criticism: Speakers produce and understand infinitely many new sentences/words. • therefore the new research goal is: to describe the underlying language faculty of a speaker (= universal grammar), competence rather than performance Martin Volk
History of Corpus Linguistics • Chomsky's criticism: every collection of texts is a collection of performance data and so many factors contribute to it that it cannot be used to model competence. • A corpus is necessarily skewed. Some sentences won't occur because they are obvious, false or impolite. Martin Volk
theoretical linguistics competence (what is grammatical?) introspection indefinitely many types, productivity grammatical vs. ungrammatical corpus linguistics performance (what is attested?) instances finite number of types degrees of grammaticality History of Corpus Linguistics Martin Volk
Corpus research in Linguistics • Lexicography (Dictionaries) • Grammaticography (Reference grammars) • Learner corpora: Language acquisition • Parallel corpora: Translation Martin Volk
Construction of Corpora • Written text is easier to obtain than spoken text. Some examples: • Newspapers • Fiction (e.g. fairy tales) • Technical Literature (e.g. manuals, medicine) • Personal letters: Email • Advertising (incl. political propaganda) • Belief and Thought (e.g. bible) Martin Volk
Corpora of spoken language • Spontaneous spoken language • recording of dialogues (e.g. telephone conversation) • Prepared spoken language • Public speeches (e.g. in parliament) • Radio or TV news • Spoken utterances must be transcribed for linguistic research. Martin Volk
Size of corpora • Brown Corpus for English (1964, 1 Mio. words) • LIMAS-Corpus for German (1970, 1 Mio. words) • British National Corpus (1995, 100 Mio. words) • Cosmas corpus (2002, > 100 Mio. words) Martin Volk
Brown Corpus (1964) • 500 texts • out of 15 different text types • with 2000 words each Martin Volk
British National Corpus • 90% written English, 10% spoken English • 3209 texts • out of 10 different text types written and • 6 text types spoken • with < 40'000 words each multi-purpose corpus Martin Volk
Other considerations • Time frame of the corpus • Native and non-native speakers • Sociolinguistic variables • Gender • Age • Education • Dialect • Social context and relationships Martin Volk
Types of corpora • Raw texts • Automatically annotated corpora • Texts with Part-of-Speech tags • Partially parsed texts • Manually annotated corpora • Treebank • FrameNet Martin Volk
Types of Corpora • Balanced Corpora vs. special corpora • Spoken vs. written language • Monolingual vs. Multilingual Corpora • Parallel vs. comparable corpora Martin Volk
Corpora in Computational Linguistics Corpora annotation Facts Rules Preferences learning Martin Volk
My Motivation for Corpus Linguistics • Attempt to build a parser for German • But: problems with ambiguities!! • Therefore: Learn attachment preferences from a corpus! Martin Volk
Corpora vs. Test suites A test suite • is a collection of manually constructed and selected sentences. • is used for testing computational grammars and parsers. • reduces the amount of testing. • leads to specific problems of the NLP system. Martin Volk
Basic problems in CL • Knowledge is missing (too little information) • e.g. unknown words • Ambiguities (too much information) • e.g. in syntax: attachment preferences Martin Volk
Corpora in Computational Linguistics • Widespread use of (manually) annotated material for measuring progress! • Some examples from COLING 2002: • Treebanks to train and test probabilistic grammars • Enriching treebanks with dependency information • Automatic error detection in PoS-Tagged Corpora • SENSEVAL data to train and test word sense disambiguation programs Martin Volk
Possible Student Tasks • Which German prepositions take a noun without a determiner? (e.g. pro, via) • When is mit used as an adverb? (e.g. ) • What is the distribution of separable verb prefixes in German? • How often are relative clauses introduced with welche(r) ? • How often are present participle forms used in German? • What kind of foreign language material is in the corpus? Martin Volk
Possible Student Tasks • Create a small parallel corpus (e.g. with various versions of 'Alice in Wonderland' or National Geographic) • Create a small corpus of spoken language (e.g. by transcription of one issue of 'Big Brother'). • Create a small treebank with the ANNOTATE tool. Martin Volk
What corpora do we have for German? Raw text • ComputerZeitung 1993-97 (about 1.3 million words per year) • ComputerZeitung iX • Tages-Anzeiger 2000 Martin Volk
Information in TagesAnzeiger • Date • Category (Sport, Politics, Culture, Economics etc.) • Author • Title vs. Text Martin Volk
What corpora do we have for German? Syntactically Annotated Text (Treebanks) • NEGRA treebank (20'000 sentences) • ComputerZeitung treebank (3'000 sentences) Text with manually corrected PoS tags • 50'000 sentences from University speeches • others Martin Volk
The goal If you can walk, you can dance. If you can talk, you can sing. If you can parse, you can understand. (Hans Uszkoreit, COLING 2002) Martin Volk
Acknowledgement Some slides were highly influenced by or even copied from Anke Lüdeling's course "Introduction to Corpus Linguistics" at http://www.cl-ki.uni-osnabrueck.de/~aluedeli/Corpuslinguistik.html Martin Volk