210 likes | 303 Views
School of Computing FACULTY OF ENGINEERING . What’s in a Corpus?. Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors). Reminder. Why NLP is difficult: language is a complex system
E N D
School of Computing FACULTY OF ENGINEERING What’s in a Corpus? Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)
Reminder • Why NLP is difficult: language is a complex system • How to solve it? Corpus-based machine-learning approaches • Motivation: applications of “The Language Machine” • BACKGROUND READING: (Atwell 99) The Language Machine • Intro to NLTK • Visit the website: http://www.nltk.org
Today • The main areas of linguistics • Rationalism: language models based on expert introspection • Empiricism: models via machine-learning from a corpus • Corpus: text selected by language, genre, domain, … • Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … • Corpus Annotation: text headers, PoS, parses, … • Corpus size is no. of words – depends on tokenisation • We can count word tokens, word types, type-token distribution • Lexeme/lemma is “root form”, v inflections (be v am/is/was…)
The main sub-areas of linguistics • ◮ Phonetics and Phonology: The study of linguistic sounds or speech. • ◮ Morphology: The study of the meaningful components of words. • ◮ Syntax (grammar): The study of the order and links between words. • ◮ Semantics: The study of meanings of words, phrases, sentences. • ◮ Discourse: The study of linguistic units larger than a single utterance. • ◮ Pragmatics: The study of how language is used to accomplish goals.
Why is NLP hard? • Main reason: Ambiguity in all areas and on all levels, e.g: • ◮ Phonetic Ambiguity: 1 expression being pronounced in several ways • ◮ POS Ambiguity: 1 word having several different Parts of Speech (adjective/noun...) • ◮ Lexical Ambiguity: 1 word having several different meanings • ◮ Syntactic/Structural Ambiguity: 1 phrase or sentence having several • different possible structures • ◮ Pragmatic Ambiguity: 1 sentence communicating several different intentions • ◮ Referential Ambiguity: 1 expression having several different possible references • Key Task in NLP: Disambiguation in context!
Rationalism v Empiricism • Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary) • Noam Chomsky, 1957 Syntactic Structures • Argued that we should build models through introspection: • A language model is a set of rules thought up by an expert • Like “Expert Systems”… • Chomsky thought data was full of errors, better to rely on linguists’ intuitions…
Empiricism v Rationalism • Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary) • The field was stuck for quite some time: rationalist • linguistic models for a specific example did not generalise. • A new approach started around 1990: Corpus Linguistics • Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHz • Main idea: machine learning from CORPUS data • How to do corpus linguistics: • Get large text collection (a corpus; plural: several corpora) • Compute statistical models over the words/PoS/parses/… in the corpus • Surprisingly effective
What is a corpus? • A corpus is a finite machine-readable body of naturally occurring text, selected according to specified criteria, eg: • ◮ Language and type: English/German/Arabic/…, dialects v. “standard”, edited text v. spontaneous speech, … • ◮ Genre and Domain: 18th century novels, newspaper text, software manuals, train enquiry dialogue... • ◮ Web as Corpus: URL “domain” = country: .uk .ar • ◮ Media: “Written” Text, Audio, Transcriptions, Video. • ◮ Size: 1000 words, 50K words, 1M words, 100M words, ???
Brown and LOB • ◮ Brown: Famous first corpus! (well, first widely-used corpus) • ◮ by Nelson Francis and Henry Kucera, Brown University USA • ◮ A balanced corpus: representative of a whole language • ◮ Brown: balanced corpus of written, published American English from 1960s (newspapers, books, … NOT handwritten) • ◮ 1 million words, Part-of-Speech tagged. • ◮ LOB: Lancaster-Oslo/Bergen corpus: British English version • ◮ published British English text from equivalent 1960s sources • ◮FROWN, FLOB: US, UK text from equivalent 1990s sources
Some recent corpora • Corpus features: Size, Domain, Language • British National Corpus: 100M words, balanced British English • Newswire Corpus: 600M words, newswire, American English • UN or EU proceedings: 20M+ words, legal, 10 language pairs • Penn Treebank: 2M words, newswire American English • MapTask: 128 dialogues, British English • Corpus of Contemporary Arabic: 1M words, balanced Arabic • Web: 8 billion(?) words, many domains and languages • Web-as-Corpus: harvest your own corpus from WWW, via “seed terms” Google API web-pages Corpus! • Marco Baroni: BootCat, Adam Kilgarriff: SketchEngine, …
Corpus Annotation • Annotation is a process in which linguistics experts add (linguistic) information to the corpus that is not explicitly there (increases utility of a corpus), e.g.: • ◮Text Headers: meta-data for each text: author, date, type,… • ◮ Part-of-speech tag for each word (very common!). • ◮ Syntactic structure: parse-tree for each sentence • ◮ Word Sense label for each word • ◮ Prosodic information: pauses, rise and fall in pitch, etc.
Annotation example: POS tagging • ◮ Some texts are annotated with Part-of-speech (POS) tags. • ◮ POS tags encode simple grammatical functions. • <s><w pos=RN> Here </w> <w pos=BEZ> is </w> <w pos=AT> a </w> • <w pos=NN> sentence </w>.</s> • ◮ Several tag sets: • ◮ Brown tag set (87 tags) in Brown corpus • ◮ CLAWS / LOB tag set (132 tags) in LOB corpus • ◮ Penn tag set (45 tags) in Penn Treebank • ◮ CLAWS c5 tag set (62 tags) in BNC (British National Corpus) • ◮ Tagging is usually done automatically (then proofread and corrected)
What’s a word? • How many words do you find in the following short text? • What is the biggest/smallest plausible answer to this question? • What problems do you encounter? • It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $2300.50 and that database B costs $5000. All databases cost far too much. • Time: 1 minute
Counting words: tokenization • Tokenisation is a processing step where the input text is automatically divided into units called tokens where each is either a word or a number or a punctuation mark… • So, word count can ignore numbers, punctuation marks (?) • Word: Continuous alphanumeric characters delineated by whitespace. • Whitespace: space, tab, newline. • BUT dividing at spaces is too simple: It’s, data base
Counting words: types v tokens • ◮ Word token: individual occurrence of words. • ◮ Q: How big is the corpus (N)? • = how many word tokens are there? (LOB: 1M; BNC: 100M) • ◮Word type: the “word itself” regardless of context • ◮ Q: How many “different words” (word types) are there? • = Size of corpus vocabulary (LOB: 50K, BNC: 650K) • ◮ Q: What is the frequency of each word type? • = type-token distribution • A few word=types (the of a …) are very frequent, but most are rare, and half of all word-types occur only once! Zipf’s Law
Other sorts of “words” • ◮ Lemma/Lexeme: dictionary form of a word. cost and costs • are derived from the same lexeme “cost”. • data-base, data base, database, databases – same lexeme • Can include spaces: data base, New York • Ambiguous tokenization: as well (= also), as well as (= and) • Inflection: grammatical variant, eg cost v costs • ◮ Morpheme: basic “atomic” indivisible unit of meaning or grammar, e.g. data, base, s • ◮ For languages other than English, morphological analysis can be hard: root/stem, affixes (prefix, postfix, infix) • morph ologi cal or morpho logic al ?
Arabic Morphology • Templatic Morphology ب ت ك Root b t k ? و ? ? مَ ? ِ? ا ? Pattern ū ma i ā كاتب مكتوب Lexeme maktūbwritten kātibwriter Lexeme.Meaning = (Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random
Arabic MorphologyRoot Meaning + Pattern meaning + ?? • كتبKTB = notion of “writing” كتاب /kitāb/ book كتب /katab/ write مكتوب /maktūb/ written مكتبة /maktaba/ library مكتوب /maktūb/ letter مكتب /maktab/ office كاتب /kātib/ writer
Reminder • Rationalism: language models based on expert introspection • Empiricism: models via machine-learning from a corpus • Corpus: text selected by language, genre, domain, … • Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … • Corpus Annotation: text headers, PoS, parses, … • Corpus size is no. of words – depends on tokenisation • We can count word tokens, word types, type-token distribution • Morpheme: basic lexical unit, “root form”, plus affixes • Lexeme: dictionary entry, can be multi-word: New York