1 / 27

Helmut Schmid: Probabilistic Part-of-Speech-Tagging Using Decision Trees

Helmut Schmid: Probabilistic Part-of-Speech-Tagging Using Decision Trees. Editing: Nicolai Kilian 30.11.2004. Overview. What are Part-Of-Speech-taggers? What different POS-taggers are there? Common properties between n-gram-taggers and tree-taggers

elewa
Download Presentation

Helmut Schmid: Probabilistic Part-of-Speech-Tagging Using Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Helmut Schmid: Probabilistic Part-of-Speech-Tagging Using Decision Trees Editing: Nicolai Kilian 30.11.2004

  2. Overview • What are Part-Of-Speech-taggers? • What different POS-taggers are there? • Common properties between n-gram-taggers and tree-taggers • n-gram-tagger: Hidden-Markov-Model • Classification via Viterbi Algorithm • Tree-tagger • Functioning • Construction of the Decision Tree • Lexicon structure • Results

  3. 1. What are Part-Of-Speech-taggers? • Automatic assignment of every word form in a corpus with a label(tag) • Each tag marks a part-of-speech • Tags come from a predefined Tagset T  CLASSIFICATION • Main problem with tagging: Words often ambiguous in their POS noun, e.g. „store“  finite verb, infinitive

  4. Disambiguation of the POS? by context! „The 1977 PCs could store two pages of data.“ In this context, „store“ can only be an infinitive. strong dependence on pre-context POS: „[The]DET [big]ADJstoreNN“ vs. „[want]Vfin [to]TOstoreVINF“

  5. 2. What different POS-taggers are there? • I. Stochastic/Statistic taggers use transition probabilities between word sequences • II. Rule-based taggers (Brill-taggers) Both kinds of tagger learn the required information from a manually tagged training corpus.  Supervised Learning

  6. 3. Common properties between n-gram-taggers and tree-taggers Both taggers • are probabilistic • model the probability of a tagged word sequence recursively by: p[(w1 w2 … wn), (t1 t2 … tn)]= p(tn|tn-2 tn-1) * p(wn|tn) * p(w1 w2 … wn-1, t1 t2 tn-1) has to be maximized current transition probability currentemission probability recursive step

  7. 3. Common properties between n-gram-taggers and tree-taggers Difference: different ways of estimating the transition probability (p(tn|tn-2 tn-1)): • N-gram-taggers: Hidden Markov Model (HMM) • tree-taggers: Decision Tree

  8. 4. n-gram-tagger: TrainingHMM (Simplified: first order /Bigram) Example: „Wir werden geschickt.“ („We are sent.“) F (tn-1 tn) F (tn wn) p(tn|tn-1) = p(wn|tn) = F (tn-1) F (tn) in case of second order (trigram model): POS pairs (Adj Adj / Adj AuxV …). ProblemZero-frequencies: Trigram ungrammatical or just rare?

  9. 5. Classification for both n-gram- and tree-tagger: Search for the best path viaViterbi Algorithm Maximizationof p[(wir werden geschickt), (t1 t2 … tn)]

  10. 6. TreeTagger --- Functioning • Estimation of transition probabilities via decision trees (DTs) Last time we‘ve learnt about DTs … • Every node represents a set of classes, every daughter node a subset of classes • Every class has one or more attributes • Classes and attributes are given • Every attribute can be converted into binary attributes • Classes are split by asking yes-no-questions for attributes • Best order of questions is determined by minimizing an impurity measure for the sister nodes

  11. 6. TreeTagger --- Functioning 4x<Det Adj NN>, 1x<Det Adj Adj>, 1x< Adv To Vinf> , … Compare: round? tag-1 = Adj? yes no 4x<Det Adj NN>, 1x<Det Adj Adj> no yes equilateral? tag-2 = Det? … yes no yes no … <Det Adj NN>: 80%, <Det Adj Adj>: 20% • * a priori probability  LEXICON!

  12. 6. TreeTagger --- Functioning So far, we have used only trigram-POS as attributes. Other questions are imaginable: • Is the suffix of the following word „X“? • Was the word „Y“ disambiguated before? (YES  higher probability for already assigned class; problem: can‘t be learned; domain can change) • Is the previous word „A“? Is word-2 „B“? • … ?

  13. 6. TreeTagger --- Construction given: • tagged training corpus, e.g. [The]DET [man]NN [is]VAUX [sleeping]VPart1. • at the node: all trigram samples from the training data to find out: best test order / question order  Maximize the information gain at each node!

  14. 6. TreeTagger --- Construction How to maximize the information gain at each node? (= minimize Iq) ∆i(N) = i(N) – (PL*i(NL) + (1-PL)*i(NR))  Iq(N) = PL* i(NL) + PR*i(NR)  Iq(N) = [-p(C+|C) * ∑tεTp(t|C+)*log2p(t|C+)] + [-p(C-|C) * ∑tεTp(t|C-)*log2p(t|C-)] ∆i(N) = information gain at node N Iq(N)= Amount of information which is still needed to identify the third tag after result of test q is known p(C+|C) = f(C+) / f(C) f(C) =Amount of trigrams at current node f(C+) =Amount of trigrams at current nodewhich passed the test i(NL) = impurity(NL): shows, how many different classes are at node NL , here: ENTROPY IMPURITY

  15. 6. TreeTagger --- Construction How to maximize the information gain at each node? (= minimize Iq) ∆i(N) = i(N) – (PL*i(NL) + (1-PL)*i(NR))  Iq(N) = PL* i(NL) + PR*i(NR)  Iq(N) = [-p(C+|C) * ∑tεTp(t|C+)*log2p(t|C+)] + [-p(C-|C) * ∑tεTp(t|C-)*log2p(t|C-)] ∆i(N) = information gain at node N Iq(N)= Amount of information which is still needed to identify the third tag after result of test q is known = OVERALL IMPURITY after q p(C+|C) = f(C+) / f(C) f(C) =Amount of trigrams at current node f(C+) =Amount of trigrams at current nodewhich passed the test i(NL) = impurity(NL): shows, how many different classes are at node NL , here: ENTROPY IMPURITY

  16. 6. TreeTagger --- Construction Entropy Impurity i(N) = -∑tεT p(t|C)*log2p(t|C) Entropy is a measure of uncertainty. The lower the entropy value, the less different classes at the current node. We measure the impurity in relation to the third tag t that we have to disambiguate. p(t|C+) = f(t,C) / f(C)

  17. 6. TreeTagger --- Construction Stop criterion If test would generate at least one subset of trigrams whose size is below some predefined threshold: • Make node a leaf • Store tag probabilities p(ti|C), i=1…n at the leaf

  18. 6. TreeTagger --- Construction Pruning • Build the whole tree without pruning • Conditions for pruning: if • both subnodes of a node are leaves • weighted information gain at node is under some treshold • then: • remove subnodes • make node a leaf

  19. 6. TreeTagger --- Construction Pruning(2): weighted information gainG G = f(C) * (I0 – Iq) I0 = -∑tεTp(t|C)*log2p(t|C) = impurity before testing q Iq = overall impurity after testing q

  20. 6. TreeTagger --- Construction Pruning(3): Why no pruning during tree construction? weighted information gain G could be below the threshold here although G is above the threshold here, subtree isn‘t built  G no good stop-criterion

  21. 6. TreeTagger --- Lexicon The lexicon • contains the a priori tag probabilities for each word: p(t|w) • has three parts • fullform lexicon: 1st lookup --- fail? • suffix lexicon: 2nd lookup --- fail? • default entry • suffix lexicon for open-class POS words • fullform and suffix lexicon created from Penn Treebank corpus (2 million tagged words)

  22. 6. TreeTagger --- Suffix Lexicon • organized as a tree,e.g. • suffix tree constructed from suffices of length 5 of all open class POS words in training corpus • tag frequencies stored at corresponding nodes sample suffix tree of length 3

  23. 6. TreeTagger --- Suffix Lexicon • Functioning: • Calculating the a priori probability of „tagging“ which is not in the fullform lexicon. • follow the path <g,n,i> stored tag frequencies at <g,n,i>: e.g. 100x NN 50 x PresPart • p(NN|ing) = 0.66 • p(PresPart|ing) = 0.33

  24. 6. TreeTagger --- Suffix Lexicon • Pruning: • for each node: calculate Information measure • I(S) = -∑tεT p(pos|S)*log2p(pos|S) • pos = tag, S = suffix • for each leaf: calculate weighted information gain • G(aS) = F(aS) * (I(S) – I(aS)) • S = current node, aS = parent node • G(aS) below given threshold?  blackboard

  25. 7. Tests • Comparison of the following taggers: TreeTagger (0.1): Zero frequencies replaced by 0.1 TreeTagger (10-10): ~ by 10-10

  26. 7. Tests Influence of the training corpus size

  27. 7. Tests Influence of different pruning thresholds

More Related