Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department o

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala UniversityDepartment of Linguistics and Philology

Outline • Different worlds? • Corpus-based computational linguistics • Computational corpus linguistics • Similarities and differences • Opportunities for collaboration • Computational linguistics – an example • Dependency-based syntactic analysis • Machine learning

Different worlds?

Corpora and computers • The empirical revolution in (computational) linguistics: • Increased use of empirical data • Development of large corpora • Annotation of corpus data (syntactic, semantic) • Underlying causes: • Technical development: • Availability of machine-readable text (and digitized speech) • Computational capacity: • Storage • Processing • Scientific shift: • Criticism of armchair linguistics • Development of statistical language models

Computational corpus linguistics • Goal: • Knowledge of language • Descriptive studies • Theoretical hypothesis testing • Means: • Corpus data as a source of knowledge of language • Descriptive statistics • Statistical inference for hypothesis testing • Computer programs for processing corpus data • Corpus development and annotation • Search and visualization (for humans) • Statistical analysis (descriptive and inferential)

Corpus-based computational linguistics • Goal: • Computer programs that process natural language • Practical applications (translation, summarization, …) • Models of language learning and use • Means: • Corpus data as a source of knowledge of language: • Statistical inference for model parameters (estimation) • Computer programs for processing corpus data • Corpus development and annotation • Search and information extraction (for computers) • Statistical analysis (estimation/machine learning)

Corpus processing 1 • Corpus development: • Tokenization (minimal units, words, etc.) • Segmentation (on several levels) • Normalization (e.g., abbreviations, orthography, multi-word units; graphical elements, metadata, etc.) • Annotation: • Part-of-speech tagging (word  word class) • Lemmatization (word  base form/lemma) • Syntactic analysis (sentence syntactic representation) • Semantic analysis (word  sense, sentence  proposition) • Standard methodology: • Automatic analysis (often based on other corpus data) • Manual validation (and correction)

Corpus processing 2 • Searching and sorting: • Search methods: • String matching • Regular expressions • Dedicated query languages • Special-purpose programs • Results: • Concordances • Frequency lists • Visualization: • Textual: • Concordances, etc. • Graphical: • Diagram, syntax trees, etc.

Corpus processing 3 • Statistical analysis: • Descriptive statistics • Frequency tables and diagrams • Statistical inference • Hypothesis testing (t-test, 2, Mann-Whitney, etc.) • Machine learning: • Probabilistic: Estimate probability distributions • Discriminative: Approximate mapping input-output • Induction of lexical and grammatical resources(e.g. collocations, valency frames)

Corpus linguists Software Accessible Easy to use General Output Suitable for humans Perspicuous (graphical visualization) Functions Specific search Descriptive statistics Computational linguists Software Efficient Modifiable Specific Output Suitable for computers Well-defined format (annotated text) Functions Exhaustive search Statistical learning User Requirements

Summary • Different goals: • Study language • Create computer programs • … give (partly) different requirements: • Accessible and usable (for humans) • Efficient and standardized (for computers) • … but (partly) the same needs: • Corpus development and annotation • Searching, sorting, and statistical analysis

Symbiosis? • What can computational linguists do for corpus linguists? • Technical and general linguistic competence • Software for automatic analysis (annotation) • What can corpus linguists do for computational linguists? • Linguistic and language specific competence • Manual validation of automatic analysis • What can they achieve together? • Automatic annotation improves precision in corpus linguistics • Manual validation improves precision computational linguistics • A virtuous circle?

Computational linguistics – an example

Dependency analysis P ROOT OBJ PMOD NMOD SBJ NMOD NMOD NMOD

Inductive dependency parsing • Deterministic syntactic analysis (parsing): • Algorithm for deriving dependency structures • Requires decision function in choice situations • All decisions are final (deterministic) • Inductive machine learning: • Decision function based on previous experience • Generalize from examples (successive refinement) • Examples = Annotated sentences (treebank) • No grammar – just analogy

Algorithm • Data structures: • Queue of unanalyzed words (next = first in queue) • Stack of partially analyzed words (top = on top of stack) • Start state: • Empty stack • All words in queue • Algorithm steps: • Shift: Put next on top of stack (push) • Reduce: Remove top from stack (pop) • Right: Put next on top of stack (push); link top  next • Left: Remove top from stack (pop); link next  top

P ROOT OBJ PMOD NMOD SBJ NMOD NMOD NMOD 0 Algorithm example RA(P) SHIFT RA(PMOD) LA(NMOD) RA(NMOD) RA(OBJ) LA(NMOD) LA(NMOD) REDUCE SHIFT SHIFT SHIFT LA(SBJ) SHIFT REDUCE REDUCE

OBJ … eats pizza with … Decision function • Non-determinism: • Decision function: (Queue, Stack, Graph)  Step • Possible approaches: • Grammar? • Inductive generalization! RA(ATT)? RE?

Machine learning • Decision function: • (Queue, Stack, Graph)  Step • Model: • (Queue, Stack, Graph)  (f1, …, fn) • Classifier: • (f1, …, fn)  Step • Learning: • { ((f1, …, fn), Step) }  Classifier

hd rd ld ld t1 th … . … top … . . … next n1 n2 n3 Model • Parts of speech: t1, top, next, n1, n2, n3 • Dependency types: t.hd, t.ld, t.rd, n.ld • Word forms: top, next, top.hd, n1 Stack Queue

Memory-based learning • Memory-based learning and classification: • Learning is storing experiences in memory. • Problem solving is achieved by reusing solutions of similar problems experienced in the past. • TIMBL (Tilburg Memory-Based Learner): • Basic method: k-nearest neighbor • Parameters: • Number of neighbors (k) • Distance metrics • Weighting av attributes, values and instances

Learning example • Instance base: • (a, b, a, c)  A • (a, b, c, a)  B • (b, a, c, c)  C • (c, a, b, c)  A • New instance: • (a, b, b, a) • Distances: • D(1, 5) = 2 • D(2, 5) = 1 • D(3, 5) = 4 • D(4, 5) = 3 • k-NN: • 1-NN(5) = B • 2-NN(5) = A/B • 3-NN(5) = A

Experimental evaluation • Inductive dependency analysis: • Deterministic algorithm • Memory-based decision function • Data: • English: • Penn Treebank, WSJ (1M words) • Converted to dependency structure • Swedish: • Talbanken, Professional prose (100k words) • Dependency structure based on MAMBA annotation

Results • English: • 87.3% of all words got the correct head • 85.6% of all words got the correct head and label • Svenska: • 85.9% of all words got the correct head • 81.6% of all words got the correct head and label

Dependency types: English • High precision (86% F): • VC (auxiliary verb  main verb) 95.0%NMOD (noun modifier) 91.0%SBJ (verb  subject) 89.3%PMOD (complement of preposition) 88.6%SBAR (complementizer  verb) 86.1% • Medium precision (73% F 83%): • ROOT 82.4%OBJ (verb  object) 81.1% VMOD (adverbial) 76.8%AMOD (adj/adv modifier) 76.7%PRD (predicative complement) 73.8% • Low precision (F 70%): • DEP (other)

Dependency types: Swedish • High precision (84% F): IM (infinitive markerinfinitive) 98.5%PR (preposition  noun) 90.6%UK (complementizer  verb) 86.4%VC (auxiliary verb  main verb) 86.1%DET (noun  determiner) 89.5%ROOT 87.8%SUB (verb  subject) 84.5% • Medium precision (76% F 80%): ATT (noun modifier) 79.2%CC (coordination) 78.9%OBJ (verb  object) 77.7%PRD (verb  predicative) 76.8%ADV (adverbial) 76.3% • Low precision (F 70%): INF, APP, XX, ID

Corpus annotation • How good is 85%? • Good enough to save time for manual annotators • Good enough to improve search precision • Recent release: SUC with syntactic annotation • How can accuracy be improved further? • By annotation of more data, which facilitates machine learning • By refined linguistic analysis of the structures to be annotated and the errors performed

MaltParser • Software for inductive dependency parsing: • Freely available (open source) • http//maltparser.org • Evaluated on close to 30 different languages • Used for annotating corpora at Uppsala University

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department o