270 likes | 489 Views
Helmut Schmid: Probabilistic Part-of-Speech-Tagging Using Decision Trees. Editing: Nicolai Kilian 30.11.2004. Overview. What are Part-Of-Speech-taggers? What different POS-taggers are there? Common properties between n-gram-taggers and tree-taggers
E N D
Helmut Schmid: Probabilistic Part-of-Speech-Tagging Using Decision Trees Editing: Nicolai Kilian 30.11.2004
Overview • What are Part-Of-Speech-taggers? • What different POS-taggers are there? • Common properties between n-gram-taggers and tree-taggers • n-gram-tagger: Hidden-Markov-Model • Classification via Viterbi Algorithm • Tree-tagger • Functioning • Construction of the Decision Tree • Lexicon structure • Results
1. What are Part-Of-Speech-taggers? • Automatic assignment of every word form in a corpus with a label(tag) • Each tag marks a part-of-speech • Tags come from a predefined Tagset T CLASSIFICATION • Main problem with tagging: Words often ambiguous in their POS noun, e.g. „store“ finite verb, infinitive
Disambiguation of the POS? by context! „The 1977 PCs could store two pages of data.“ In this context, „store“ can only be an infinitive. strong dependence on pre-context POS: „[The]DET [big]ADJstoreNN“ vs. „[want]Vfin [to]TOstoreVINF“
2. What different POS-taggers are there? • I. Stochastic/Statistic taggers use transition probabilities between word sequences • II. Rule-based taggers (Brill-taggers) Both kinds of tagger learn the required information from a manually tagged training corpus. Supervised Learning
3. Common properties between n-gram-taggers and tree-taggers Both taggers • are probabilistic • model the probability of a tagged word sequence recursively by: p[(w1 w2 … wn), (t1 t2 … tn)]= p(tn|tn-2 tn-1) * p(wn|tn) * p(w1 w2 … wn-1, t1 t2 tn-1) has to be maximized current transition probability currentemission probability recursive step
3. Common properties between n-gram-taggers and tree-taggers Difference: different ways of estimating the transition probability (p(tn|tn-2 tn-1)): • N-gram-taggers: Hidden Markov Model (HMM) • tree-taggers: Decision Tree
4. n-gram-tagger: TrainingHMM (Simplified: first order /Bigram) Example: „Wir werden geschickt.“ („We are sent.“) F (tn-1 tn) F (tn wn) p(tn|tn-1) = p(wn|tn) = F (tn-1) F (tn) in case of second order (trigram model): POS pairs (Adj Adj / Adj AuxV …). ProblemZero-frequencies: Trigram ungrammatical or just rare?
5. Classification for both n-gram- and tree-tagger: Search for the best path viaViterbi Algorithm Maximizationof p[(wir werden geschickt), (t1 t2 … tn)]
6. TreeTagger --- Functioning • Estimation of transition probabilities via decision trees (DTs) Last time we‘ve learnt about DTs … • Every node represents a set of classes, every daughter node a subset of classes • Every class has one or more attributes • Classes and attributes are given • Every attribute can be converted into binary attributes • Classes are split by asking yes-no-questions for attributes • Best order of questions is determined by minimizing an impurity measure for the sister nodes
6. TreeTagger --- Functioning 4x<Det Adj NN>, 1x<Det Adj Adj>, 1x< Adv To Vinf> , … Compare: round? tag-1 = Adj? yes no 4x<Det Adj NN>, 1x<Det Adj Adj> no yes equilateral? tag-2 = Det? … yes no yes no … <Det Adj NN>: 80%, <Det Adj Adj>: 20% • * a priori probability LEXICON!
6. TreeTagger --- Functioning So far, we have used only trigram-POS as attributes. Other questions are imaginable: • Is the suffix of the following word „X“? • Was the word „Y“ disambiguated before? (YES higher probability for already assigned class; problem: can‘t be learned; domain can change) • Is the previous word „A“? Is word-2 „B“? • … ?
6. TreeTagger --- Construction given: • tagged training corpus, e.g. [The]DET [man]NN [is]VAUX [sleeping]VPart1. • at the node: all trigram samples from the training data to find out: best test order / question order Maximize the information gain at each node!
6. TreeTagger --- Construction How to maximize the information gain at each node? (= minimize Iq) ∆i(N) = i(N) – (PL*i(NL) + (1-PL)*i(NR)) Iq(N) = PL* i(NL) + PR*i(NR) Iq(N) = [-p(C+|C) * ∑tεTp(t|C+)*log2p(t|C+)] + [-p(C-|C) * ∑tεTp(t|C-)*log2p(t|C-)] ∆i(N) = information gain at node N Iq(N)= Amount of information which is still needed to identify the third tag after result of test q is known p(C+|C) = f(C+) / f(C) f(C) =Amount of trigrams at current node f(C+) =Amount of trigrams at current nodewhich passed the test i(NL) = impurity(NL): shows, how many different classes are at node NL , here: ENTROPY IMPURITY
6. TreeTagger --- Construction How to maximize the information gain at each node? (= minimize Iq) ∆i(N) = i(N) – (PL*i(NL) + (1-PL)*i(NR)) Iq(N) = PL* i(NL) + PR*i(NR) Iq(N) = [-p(C+|C) * ∑tεTp(t|C+)*log2p(t|C+)] + [-p(C-|C) * ∑tεTp(t|C-)*log2p(t|C-)] ∆i(N) = information gain at node N Iq(N)= Amount of information which is still needed to identify the third tag after result of test q is known = OVERALL IMPURITY after q p(C+|C) = f(C+) / f(C) f(C) =Amount of trigrams at current node f(C+) =Amount of trigrams at current nodewhich passed the test i(NL) = impurity(NL): shows, how many different classes are at node NL , here: ENTROPY IMPURITY
6. TreeTagger --- Construction Entropy Impurity i(N) = -∑tεT p(t|C)*log2p(t|C) Entropy is a measure of uncertainty. The lower the entropy value, the less different classes at the current node. We measure the impurity in relation to the third tag t that we have to disambiguate. p(t|C+) = f(t,C) / f(C)
6. TreeTagger --- Construction Stop criterion If test would generate at least one subset of trigrams whose size is below some predefined threshold: • Make node a leaf • Store tag probabilities p(ti|C), i=1…n at the leaf
6. TreeTagger --- Construction Pruning • Build the whole tree without pruning • Conditions for pruning: if • both subnodes of a node are leaves • weighted information gain at node is under some treshold • then: • remove subnodes • make node a leaf
6. TreeTagger --- Construction Pruning(2): weighted information gainG G = f(C) * (I0 – Iq) I0 = -∑tεTp(t|C)*log2p(t|C) = impurity before testing q Iq = overall impurity after testing q
6. TreeTagger --- Construction Pruning(3): Why no pruning during tree construction? weighted information gain G could be below the threshold here although G is above the threshold here, subtree isn‘t built G no good stop-criterion
6. TreeTagger --- Lexicon The lexicon • contains the a priori tag probabilities for each word: p(t|w) • has three parts • fullform lexicon: 1st lookup --- fail? • suffix lexicon: 2nd lookup --- fail? • default entry • suffix lexicon for open-class POS words • fullform and suffix lexicon created from Penn Treebank corpus (2 million tagged words)
6. TreeTagger --- Suffix Lexicon • organized as a tree,e.g. • suffix tree constructed from suffices of length 5 of all open class POS words in training corpus • tag frequencies stored at corresponding nodes sample suffix tree of length 3
6. TreeTagger --- Suffix Lexicon • Functioning: • Calculating the a priori probability of „tagging“ which is not in the fullform lexicon. • follow the path <g,n,i> stored tag frequencies at <g,n,i>: e.g. 100x NN 50 x PresPart • p(NN|ing) = 0.66 • p(PresPart|ing) = 0.33
6. TreeTagger --- Suffix Lexicon • Pruning: • for each node: calculate Information measure • I(S) = -∑tεT p(pos|S)*log2p(pos|S) • pos = tag, S = suffix • for each leaf: calculate weighted information gain • G(aS) = F(aS) * (I(S) – I(aS)) • S = current node, aS = parent node • G(aS) below given threshold? blackboard
7. Tests • Comparison of the following taggers: TreeTagger (0.1): Zero frequencies replaced by 0.1 TreeTagger (10-10): ~ by 10-10
7. Tests Influence of the training corpus size
7. Tests Influence of different pruning thresholds