920 likes | 1.15k Views
Constraint satisfaction inference for discrete sequence processing in NLP. Antal van den Bosch ILK / CL and AI, Tilburg University DCU Dublin April 19, 2006 (work with Sander Canisius and Walter Daelemans). Constraint satisfaction inference for discrete sequence processing in NLP.
E N D
Constraint satisfaction inference for discrete sequence processing in NLP Antal van den Bosch ILK / CL and AI, Tilburg University DCU Dublin April 19, 2006 (work with Sander Canisius and Walter Daelemans)
Constraint satisfaction inference for discrete sequence processing in NLP Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion
How to map sequences to sequences? • Machine learning’s pet solution: • Local-context windowing (NETtalk) • One-shot prediction of single output tokens. • Concatenation of predicted tokens.
The near-sightedness problem • A local window never captures long-distance information. • No coordination of individual output tokens. • Long-distance information does exist; holistic coordination is needed.
Holistic information • “Counting” constraints: • Certain entities occur only once in a clause/sentence. • “Syntactic validity” constraints: • On discontinuity and overlap; chunks have a beginning and an end. • “Cooccurrence” constraints: • Some entities must occur with others, or cannot co-exist with others.
Solution 1: Feedback • Recurrent networks in ANN (Elman, 1991; Sun & Giles, 2001), e.g. word prediction. • Memory-based tagger (Daelemans, Zavrel, Berck, and Gillis, 1996). • Maximum-entropy tagger (Ratnaparkhi, 1996).
Feedback disadvantage • Label bias problem (Lafferty, McCallum, and Pereira, 2001). • Previous prediction is an important source of information. • Classifier is compelled to take its own prediction as correct. • Cascading errors result.
Solution 2: Stacking • Wolpert (1992) for ANNs. • Veenstra (1998) for NP chunking: • Stage-1 classifier, near-sighted, predicts sequences. • Stage-2 classifier learns to correct stage-1 errors by taking stage-1 output as windowed input.
Stacking disadvantages • Practical issues: • Ideally, train stage-2 on cross-validated output of stage-1, not “perfect” output. • Costly procedure. • Total architecture: two full classifiers. • Local, not global error correction.
What exactly is the problem with mapping to sequences? • Born in Made, The Netherlands O_O_B-LOC_O_B-LOC_I-LOC • Multi-class classification with 100s or 1000s of classes? • Lack of generalization • Some ML algorithms cannot cope very well. • SVMs • Rule learners, decision trees • However, others can. • Naïve Bayes, Maximum-entropy • Memory-based learning
Solution 3: n-gram subsequences • Retain windowing approach, but • Predict overlapping n-grams of output tokens.
Resolving overlapping n-grams • Probabilities available: Viterbi • Other options: voting
N-gram+voting disadvantages • Classifier predicts syntactically valid trigrams, but • After resolving overlap, only local error correction. • End result is still a concatenation of local uncoordinated decisions. • Number of classes increases (problematic for some ML).
Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion
Four “chunking” tasks • English base-phrase chunking • CoNLL-2000, WSJ • English named-entity recognition • CoNLL-2003, Reuters • Dutch medical concept chunking • IMIX/Rolaquad, medical encyclopedia • English protein-related entity chunking • Genia, Medline abstracts
Treated the same way • IOB-tagging. • Windowing: • 3-1-3 words • 3-1-3 predicted PoS tags (WSJ / Wotan) • No seedlists, suffix/prefix, capitalization, … • Memory-based learning and maximum-entropy modeling • MBL: automatic parameter optimization (paramsearch, Van den Bosch, 2004)
IOB-codes for chunks: step 1, PTB-II WSJ ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))
IOB-codes for chunks: step 1, PTB-II WSJ ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))
IOB codes for chunks:Flatten tree [Once]ADVP [he]NP [was held]VP [for]PP [three months]NP [without]PP [being charged]VP
Example: Instances feature 1 feature 2 feature 3 class (word -1) (word 0) (word +1) • _ Once he I-ADVP • Once he was I-NP • he was held I-VP • was held for I-VP • held for three I-PP • for three months I-NP • three months without I-NP • months without being I-PP • without being charged I-VP • being charged . I-VP • charged._ O
MBL • Memory-based learning • k-NN classifier (Fix and Hodges, 1951; Cover and Hart, 1967; Aha et al., 1991), Daelemans et al. • Discrete point-wise classifier • Implementation used: TiMBL (Tilburg Memory-Based Learner)
Memory-based learning and classification • Learning: • Store instances in memory • Classification: • Given new test instance X, • Compare it to all memory instances • Compute a distance between X and memory instance Y • Update the top k of closest instances (nearest neighbors) • When done, take the majority class of the k nearest neighbors as the class of X
Similarity / distance • A nearest neighbor has the smallest distance, or the largest similarity • Computed with a distance function • TiMBL offers two basic distance functions: • Overlap • MVDM (Stanfill & Waltz, 1986; Cost & Salzberg, 1989) • Feature weighting • Exemplar weighting • Distance-weighted class voting
The Overlap distance function • “Count the number of mismatching features”
The MVDM distance function • Estimate a numeric “distance” between pairs of values • “e” is more like “i” than like “p” in a phonetic task • “book” is more like “document” than like “the” in a parsing task
Feature weighting • Some features are more important than others • TiMBL metrics: Information Gain, Gain Ratio, Chi Square, Shared Variance • Ex. IG: • Compute data base entropy • For each feature, • partition the data base on all values of that feature • For all values, compute the sub-data base entropy • Take the weighted average entropy over all partitioned subdatabases • The difference between the “partitioned” entropy and the overall entropy is the feature’s Information Gain
Feature weighting in the distance function • Mismatching on a more important feature gives a larger distance • Factor in the distance function:
Distance weighting • Relation between larger k and smoothing • Subtle extension: making more distant neighbors count less in the class vote • Linear inverse of distance (w.r.t. max) • Inverse of distance • Exponential decay
Current practice • Default TiMBL settings: • k=1, Overlap, GR, no distance weighting • Work well for some morpho-phonological tasks • Rules of thumb: • Combine MVDM with bigger k • Combine distance weighting with bigger k • Very good bet: higher k, MVDM, GR, distance weighting • Especially for sentence and text level tasks
Base phrase chunking • 211,727 training, 47,377 test examples • 22 classes • [He]NP [reckons]VP [the current account deficit]NP [will narrow]VP [to]PP [only $ 1.8 billion]NP [in]PP [September]NP .
Named entity recognition • 203,621 training, 46,435 test examples • 8 classes • [U.N.]organizationofficial [Ekeus]personheads for [Baghdad]location
Medical concept chunking • 428,502 training, 47,430 test examples • 24 classes • Bij [infantiel botulisme]diseasekunnen in extreme gevallen [ademhalingsproblemen]symptomen [algehele lusteloosheid]symptomoptreden.
Protein-related concept chunking • 458,593 training, 50,916 test examples • 51 classes • Most hybrids express both [KBF1]proteinand [NF-kappa B]proteinin their nuclei , but one hybrid expresses only [KBF1]protein .
Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion
Comparative study • Base discrete classifier: Maximum-entropy model (Zhang Le, maxent) • Extended with feedback, stacking, trigrams, combinations • Compared against • Conditional Markov Models (Ratnaparkhi, 1996) • Maximum-entropy Markov Models (McCallum, Freitag, and Pereira, 2000) • Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001) • On Medical & Protein chunking
Maximum entropy • Probabilistic model: conditional distribution p(C|x) (= probability matrix between classes and values) with maximal entropy H(p) • Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible • Maximize entropy in matrix through iterative process: • IIS, GIS (Improved/Generalized Iterative Scaling) • L-BFGS • Discretized!
Conditional Markov Models • Probabilistic analogue of Feedback • Processes from left to right • Produces conditional probabilities, including previous classification, limited by beam search • With beam=1, equal to Feedback • Can be trained with maximum entropy • E.g. MXPOST, Ratnaparkhi (1996)