330 likes | 344 Views
Explore how probabilistic features improve NLP pipeline models by considering all possible annotations in subsequent stages, with examples of feature instances and computational complexities explained.
E N D
Learning with Probabilistic Featuresfor Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens, OH bunescu@ohio.edu EMNLP, October 2008
Introduction • NLP systems often depend on the output of other NLP systems. POS Tagging Syntactic Parsing Question Answering Named Entity Recognition Semantic Role Labeling
Traditional Pipeline Model: M1 • The best annotationfrom one stage is used in subsequent stages. x POS Tagging Syntactic Parsing • Problem: Errors propagate between pipeline stages!
Probabilistic Pipeline Model: M2 • All possible annotationsfrom one stage are used in subsequent stages. x POS Tagging Syntactic Parsing probabilistic features • Problem: Z(x) has exponential cardinality!
Probabilistic Pipeline Model: M2 • Feature-wise formulation: • When original i‘s are count features, it can be shown that:
Probabilistic Pipeline Model • Feature-wise formulation: • When original i‘s are count features, it can be shown that: An instance of feature i , i.e. the actual evidence used from example (x,y,z).
Probabilistic Pipeline Model • Feature-wise formulation: • When original i‘s are count features, it can be shown that: The set of all instances of feature i in (x,y,z), across all annotations zZ(x).
Example: POS Dependency Parsing • Feature i RB VBD • The set of feature instances Fi is: 0.91 RB VBD The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11
Example: POS Dependency Parsing • Feature i RB VBD • The set of feature instances Fi is: 0.91 0.01 RB RB VBD VBD The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11
Example: POS Dependency Parsing • Feature i RB VBD • The set of feature instances Fi is: 0.91 0.01 0.1 RB RB RB VBD VBD VBD The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11
Example: POS Dependency Parsing • Feature i RB VBD • The set of feature instances Fi is: 0.91 0.01 0.1 0.001 RB RB RB VBD VBD VBD RB VBD The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11
Example: POS Dependency Parsing • Feature i RB VBD • The set of feature instances Fi is: 0.91 0.01 0.1 0.001 0.001 RB RB RB VBD VBD VBD RB RB VBD VBD The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11
Example: POS Dependency Parsing • Feature i RB VBD • The set of feature instances Fi is: 0.91 0.01 0.1 0.001 0.001 0.002 RB RB RB VBD VBD VBD RB RB RB VBD VBD VBD The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11
Example: POS Dependency Parsing • Feature i RB VBD • The set of feature instances Fi is: 0.91 0.01 0.1 0.001 0.001 … … 0.002 RB RB RB VBD VBD VBD RB RB RB VBD VBD VBD The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11 N(N-1) feature instances in Fi .
Example: POS Dependency Parsing • Feature i RB VBD uses a limited amount of evidence: • the set of feature instances Fi has cardinality N(N-1). • computing takes O(N|P|2) time using a constrained version of • the forward-backward algorithm: • Therefore, computing i takes O(N3|P|2) time.
Probabilistic Pipeline Model: M2 • All possible annotations from one stage are used in subsequent stages. x POS Tagging Syntactic Parsing polynomial time • In general, the time complexity of computing i depends on the complexity of the evidence used by feature i.
Probabilistic Pipeline Model: M3 • The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence: x POS Tagging Syntactic Parsing
Probabilistic Pipeline Model: M3 • The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence: x POS Tagging Syntactic Parsing The set of instances of feature i using only the best annotation
Probabilistic Pipeline Model: M3 • Like the traditional pipeline model M1, except that it uses the probabilistic confidence values associated with annotation features. • More efficient than M2, but less accurate. • Example: POS Dependency Parsing • shows features generated by template ti tjand their probabilities. y: 0.81 0.92 0.85 0.97 0.95 0.98 0.91 0.97 0.90 0.98 DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11 x: The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11
Probabilistic Pipeline Models Model M2 Model M3
Two Applications • Dependency Parsing • Named Entity Recognition x POS Tagging Syntactic Parsing x POS Tagging Syntactic Parsing Named Entity Recognition
1) Dependency Parsing • Use MSTParser [McDonald et al. 2005]: • The score of a dependency tree the sum of the edge scores: • Feature templates use words and POS tags at positions u and v and their neighbors u 1 and v 1. • Use CRF [Lafferty et al. 2001] POS tagger: • Compute probabilistic features using a constrained forward-backwardprocedure. • Example: feature titj has probability p(ti, tj) • constrain the state transitions to pass through tags tiand tj.
1) Dependency Parsing • Two approximations of model M2: • Model M2’: • Consider POS tags independent: • p(ti RB,tj VBD|x) p(ti RB|x) p(tj VBD|x) • Ignore tags with low marginal probability: • p(ti) 1/(|P|) • Model M2”: • Like M2’, but use constrained forward-backward to compute marginal probabilities when the tag chunks are less than 4 tokens apart.
1) Dependency Parsing: Results • Train MSTParser on sections 2-21 of Pen WSJ Treebank using gold POS tagging. • Test MST Parser on section 23, using POS tags from CRF tagger. • Absolute error reduction of “only” 0.19% : • But POS tagger has a very high accuracy of 96.25%. • Expect more substantial improvement when upstream stages in the pipeline are less accurate.
2) Named Entity Recognition • Model NER as a sequence tagging problem using CRFs: z2: z1: DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11 y: x: O I O O O O O O O O O The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11 • Flat features:unigram, bigram and trigram that extend either left or right: • sailors, the sailors, sailors RB, sailors RB thought… • Tree features: unigram, bigram and trigram that extend in any direction in the undirected dependency tree: • sailors thought, sailors thought RB, NNS thought RB, …
Named Entity Recognition: Model M2 x POS Tagging Syntactic Parsing Named Entity Recognition • Probabilistic features: • Example feature NNS2thought4 RB3:
Named Entity Recognition: Model M3’ • M3’ is an approximation of M3 in which confidence scores are computed as follows: • Consider POS tagging and dependency parsing independent. • Consider POS tags independent. • Consider dependency arcs independent. • Example feature NNS2thought4 RB3: • Need to compute marginals p(uv|x).
Probabilistic Dependency Features • To compute probabilistic POS features, we used a constrained version of the forward-backward algorithm. • To compute probabilistic dependency features, we use a constrained version of Eisner’s algorithm: • Compute normalized scores n(uv | x) using the softmax function: • Transform scores n(uv|x) into probabilities p(uv|x) using isotonic regression [Zadrozny & Elkan, 2002].
Named Entity Recognition: Results • Implemented the CRF models in MALLET [McCallum, 2002] • Trained and tested on the standard split from the ACE 2002 + 2003 corpus (674 training, 97 testing). • POS tagger and MSTParser were trained on sections 2-21 of WSJ Treebank • Isotonic regression for MSTParser on section 23. Area under PR curve
Named Entity Recognition: Results • M3’ (probabilistic) vs. M1 (traditional) using tree features:
Conclusions & Related Work • A general method for improving the communication between consecutive stages in pipeline models: • based on computing expectations for count features. • an efective method for associating probabilities with output substructures. • adds polynomial time complexity to pipeline whenever the inference step at each stage is done in polynomial time. • Can be seen as complementary to the sampling approach of [Finkel et al. 2006]: • approximate vs. exact in polynomial time. • used in testing vs. used in training and testing.
Future Work • Try full model M2 / its approximation M2’ on NER. • Extend model to pipeline graphs containing cycles.