290 likes | 536 Views
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology. Outline. Different worlds? Corpus-based computational linguistics Computational corpus linguistics Similarities and differences
E N D
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala UniversityDepartment of Linguistics and Philology
Outline • Different worlds? • Corpus-based computational linguistics • Computational corpus linguistics • Similarities and differences • Opportunities for collaboration • Computational linguistics – an example • Dependency-based syntactic analysis • Machine learning
Corpora and computers • The empirical revolution in (computational) linguistics: • Increased use of empirical data • Development of large corpora • Annotation of corpus data (syntactic, semantic) • Underlying causes: • Technical development: • Availability of machine-readable text (and digitized speech) • Computational capacity: • Storage • Processing • Scientific shift: • Criticism of armchair linguistics • Development of statistical language models
Computational corpus linguistics • Goal: • Knowledge of language • Descriptive studies • Theoretical hypothesis testing • Means: • Corpus data as a source of knowledge of language • Descriptive statistics • Statistical inference for hypothesis testing • Computer programs for processing corpus data • Corpus development and annotation • Search and visualization (for humans) • Statistical analysis (descriptive and inferential)
Corpus-based computational linguistics • Goal: • Computer programs that process natural language • Practical applications (translation, summarization, …) • Models of language learning and use • Means: • Corpus data as a source of knowledge of language: • Statistical inference for model parameters (estimation) • Computer programs for processing corpus data • Corpus development and annotation • Search and information extraction (for computers) • Statistical analysis (estimation/machine learning)
Corpus processing 1 • Corpus development: • Tokenization (minimal units, words, etc.) • Segmentation (on several levels) • Normalization (e.g., abbreviations, orthography, multi-word units; graphical elements, metadata, etc.) • Annotation: • Part-of-speech tagging (word word class) • Lemmatization (word base form/lemma) • Syntactic analysis (sentence syntactic representation) • Semantic analysis (word sense, sentence proposition) • Standard methodology: • Automatic analysis (often based on other corpus data) • Manual validation (and correction)
Corpus processing 2 • Searching and sorting: • Search methods: • String matching • Regular expressions • Dedicated query languages • Special-purpose programs • Results: • Concordances • Frequency lists • Visualization: • Textual: • Concordances, etc. • Graphical: • Diagram, syntax trees, etc.
Corpus processing 3 • Statistical analysis: • Descriptive statistics • Frequency tables and diagrams • Statistical inference • Hypothesis testing (t-test, 2, Mann-Whitney, etc.) • Machine learning: • Probabilistic: Estimate probability distributions • Discriminative: Approximate mapping input-output • Induction of lexical and grammatical resources(e.g. collocations, valency frames)
Corpus linguists Software Accessible Easy to use General Output Suitable for humans Perspicuous (graphical visualization) Functions Specific search Descriptive statistics Computational linguists Software Efficient Modifiable Specific Output Suitable for computers Well-defined format (annotated text) Functions Exhaustive search Statistical learning User Requirements
Summary • Different goals: • Study language • Create computer programs • … give (partly) different requirements: • Accessible and usable (for humans) • Efficient and standardized (for computers) • … but (partly) the same needs: • Corpus development and annotation • Searching, sorting, and statistical analysis
Symbiosis? • What can computational linguists do for corpus linguists? • Technical and general linguistic competence • Software for automatic analysis (annotation) • What can corpus linguists do for computational linguists? • Linguistic and language specific competence • Manual validation of automatic analysis • What can they achieve together? • Automatic annotation improves precision in corpus linguistics • Manual validation improves precision computational linguistics • A virtuous circle?
Dependency analysis P ROOT OBJ PMOD NMOD SBJ NMOD NMOD NMOD
Inductive dependency parsing • Deterministic syntactic analysis (parsing): • Algorithm for deriving dependency structures • Requires decision function in choice situations • All decisions are final (deterministic) • Inductive machine learning: • Decision function based on previous experience • Generalize from examples (successive refinement) • Examples = Annotated sentences (treebank) • No grammar – just analogy
Algorithm • Data structures: • Queue of unanalyzed words (next = first in queue) • Stack of partially analyzed words (top = on top of stack) • Start state: • Empty stack • All words in queue • Algorithm steps: • Shift: Put next on top of stack (push) • Reduce: Remove top from stack (pop) • Right: Put next on top of stack (push); link top next • Left: Remove top from stack (pop); link next top
P ROOT OBJ PMOD NMOD SBJ NMOD NMOD NMOD 0 Algorithm example RA(P) SHIFT RA(PMOD) LA(NMOD) RA(NMOD) RA(OBJ) LA(NMOD) LA(NMOD) REDUCE SHIFT SHIFT SHIFT LA(SBJ) SHIFT REDUCE REDUCE
OBJ … eats pizza with … Decision function • Non-determinism: • Decision function: (Queue, Stack, Graph) Step • Possible approaches: • Grammar? • Inductive generalization! RA(ATT)? RE?
Machine learning • Decision function: • (Queue, Stack, Graph) Step • Model: • (Queue, Stack, Graph) (f1, …, fn) • Classifier: • (f1, …, fn) Step • Learning: • { ((f1, …, fn), Step) } Classifier
hd rd ld ld t1 th … . … top … . . … next n1 n2 n3 Model • Parts of speech: t1, top, next, n1, n2, n3 • Dependency types: t.hd, t.ld, t.rd, n.ld • Word forms: top, next, top.hd, n1 Stack Queue
Memory-based learning • Memory-based learning and classification: • Learning is storing experiences in memory. • Problem solving is achieved by reusing solutions of similar problems experienced in the past. • TIMBL (Tilburg Memory-Based Learner): • Basic method: k-nearest neighbor • Parameters: • Number of neighbors (k) • Distance metrics • Weighting av attributes, values and instances
Learning example • Instance base: • (a, b, a, c) A • (a, b, c, a) B • (b, a, c, c) C • (c, a, b, c) A • New instance: • (a, b, b, a) • Distances: • D(1, 5) = 2 • D(2, 5) = 1 • D(3, 5) = 4 • D(4, 5) = 3 • k-NN: • 1-NN(5) = B • 2-NN(5) = A/B • 3-NN(5) = A
Experimental evaluation • Inductive dependency analysis: • Deterministic algorithm • Memory-based decision function • Data: • English: • Penn Treebank, WSJ (1M words) • Converted to dependency structure • Swedish: • Talbanken, Professional prose (100k words) • Dependency structure based on MAMBA annotation
Results • English: • 87.3% of all words got the correct head • 85.6% of all words got the correct head and label • Svenska: • 85.9% of all words got the correct head • 81.6% of all words got the correct head and label
Dependency types: English • High precision (86% F): • VC (auxiliary verb main verb) 95.0%NMOD (noun modifier) 91.0%SBJ (verb subject) 89.3%PMOD (complement of preposition) 88.6%SBAR (complementizer verb) 86.1% • Medium precision (73% F 83%): • ROOT 82.4%OBJ (verb object) 81.1% VMOD (adverbial) 76.8%AMOD (adj/adv modifier) 76.7%PRD (predicative complement) 73.8% • Low precision (F 70%): • DEP (other)
Dependency types: Swedish • High precision (84% F): IM (infinitive markerinfinitive) 98.5%PR (preposition noun) 90.6%UK (complementizer verb) 86.4%VC (auxiliary verb main verb) 86.1%DET (noun determiner) 89.5%ROOT 87.8%SUB (verb subject) 84.5% • Medium precision (76% F 80%): ATT (noun modifier) 79.2%CC (coordination) 78.9%OBJ (verb object) 77.7%PRD (verb predicative) 76.8%ADV (adverbial) 76.3% • Low precision (F 70%): INF, APP, XX, ID
Corpus annotation • How good is 85%? • Good enough to save time for manual annotators • Good enough to improve search precision • Recent release: SUC with syntactic annotation • How can accuracy be improved further? • By annotation of more data, which facilitates machine learning • By refined linguistic analysis of the structures to be annotated and the errors performed
MaltParser • Software for inductive dependency parsing: • Freely available (open source) • http//maltparser.org • Evaluated on close to 30 different languages • Used for annotating corpora at Uppsala University