460 likes | 599 Views
ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL. DecisionTrees. Decision Trees. Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data.
E N D
ML: Classical methods from AI • Decision-Tree induction • Exemplar-based Learning • Rule Induction • TBEDL
DecisionTrees Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)
A1 v1 v3 v2 ... A2 A2 A3 ... ... v5 v4 Decision Tree A5 A2 ... SIZE v6 small big C3 A5 SHAPE COLOR v7 red circle triang blue C1 C2 C1 neg pos pos neg DecisionTrees An Example
Training DT Training Set + TDIDT = Test DT + = Example Class DecisionTrees Learning Decision Trees
functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_exampes(X,amax,val); A’ := A \ {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function DTs General Induction Algorithm
functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A \ {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function DTs General Induction Algorithm
DecisionTrees Feature Selection Criteria • Functions derived from Information Theory: • Information Gain, Gain Ratio (Quinlan86) • Functions derived from Distance Measures • Gini Diversity Index (Breiman et al. 84) • RLM (López de Mántaras 91) • Statistically-based • Chi-square test (Sestito & Dillon 94) • Symmetrical Tau (Zhou & Dillon 91) • RELIEFF-IG: variant of RELIEFF (Kononenko 94)
DecisionTrees Information Gain (Quinlan79)
DecisionTrees Information Gain(2) (Quinlan79)
DecisionTrees (Quinlan86) Gain Ratio
DecisionTrees RELIEF (Kira & Rendell, 1992)
DecisionTrees (Kononenko, 1994) RELIEFF
DecisionTrees RELIEFF-IG (Màrquez, 1999) • RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).
DecisionTrees Extensions of DTs (Murthy 95) • (pre/post) Pruning • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • etc.
DecisionTrees Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98)
DecisionTrees Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97) • More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions)
POS Tagging • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus) DecisionTrees Example: POS Tagging using DT
Language Model Disambiguation Algorithm DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Raw text Tagged text Morphological analysis … POS tagging
DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Decision Trees Raw text Disambiguation Algorithm Tagged text Morphological analysis … POS tagging
Language Model Raw text RTT STT RELAX Tagged text Morphological analysis POS tagging DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) …
root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others ... P(IN)=0.13 P(RB)=0.87 tag(+2) Statistical interpretation: IN ^ P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.987 P(IN)=0.013 P(RB)=0.987 ^ P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.013 leaf DecisionTrees DT-based Language Modelling “preposition-adverb” tree
root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others Collocations: ... P(IN)=0.13 P(RB)=0.87 tag(+2) “as_RB much_RB as_IN” IN “as_RB soon_RB as_IN” P(IN)=0.013 P(RB)=0.987 “as_RB well_RB as_IN” leaf DecisionTrees DT-based Language Modelling “preposition-adverb” tree
Minimizing the effect of over-fitting, data fragmentation & sparseness DecisionTrees Language Modelling using DTs • Granularity? Ambiguity class level • adjective-noun, adjective-noun-verb, etc. • Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning • CART (Breiman et al. 84), C4.5 (Quinlan 95), etc. • Attributes: Local context (-3,+2) tokens • Particular implementation: • Branch-merging • CART post-pruning • Smoothing • Attributes with many values • Several functions for attribute selection
DecisionTrees Model Evaluation The Wall Street Journal(WSJ) annotated corpus • 1,170,000 words • Tagset size: 45 tags • Noise: 2-3% of mistagged words • 49,000 word-form frequency lexicon • Manual filtering of 200 most frequent entries • 36.4% ambiguous words • 2.44 (1.52) average tags per word • 243 ambiguity classes
DecisionTrees Model Evaluation The Wall Street Journal(WSJ) annotated corpus Number of ambiguity classes that cover x%of the training corpus Arity of the classification problems
DecisionTrees 12 Ambiguity Classes They cover 57.90% of the ambiguous occurrences! Experimental setting: 10-fold cross validation
DecisionTrees N-fold Cross Validation Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn fori:=1 toNdo learn and test a classifier using: training_set := Usj for all j different from i validation_set :=si end_for return: the average accuracy from the n experiments Which is a good value for N? (2-10-...) Extreme case (N=training set size): Leave-one-out
Average size reduction: 51.7%46.5% 74.1% (total) DecisionTrees Size: Number of Nodes
DecisionTrees Accuracy (at least) No loss in accuracy
Statistically equivalent DecisionTrees Feature Selection Criteria
(Màrquez & Rodríguez 97) (Màrquez & Rodríguez 99) (Màrquez & Padró 97) DecisionTrees DT-based POS Taggers • Tree Base = Statistical Component • RTT: ReductionisticTree based tagger • STT: Statistical Tree based tagger • Tree Base = Compatibility Constraints • RELAX: Relaxation-Labelling based tagger
DecisionTrees RTT (Màrquez & Rodríguez 97) Language Model stop? Filter Classify Update Tagged text Raw text Morphological analysis yes no Disambiguation
DecisionTrees STT (Màrquez & Rodríguez 99) N-grams (trigrams)
Estimated using Decision Trees DecisionTrees STT (Màrquez & Rodríguez 99) Contextual probabilities
Lexical probs. + Contextual probs. DecisionTrees STT (Màrquez & Rodríguez 99) LanguageModel Viterbi algorithm Tagged text Raw text Morphological analysis Disambiguation
DecisionTrees STT+ (Màrquez & Rodríguez 99) LanguageModel N-grams Lexical probs. + + Contextual probs. Viterbi algorithm Tagged text Raw text Morphological analysis Disambiguation
(Màrquez & Rodríguez 97) (Màrquez & Rodríguez 99) (Màrquez & Padró 97) DecisionTrees • Tree Base = Statistical Component • RTT: ReductionisticTree based tagger • STT: Statistical Tree based tagger • Tree Base = Compatibility Constraints • RELAX: Relaxation-Labelling based tagger
DecisionTrees (Màrquez & Padró 97) RELAX LanguageModel Linguistic rules N-grams + + Set of constraints Relaxation Labelling (Padró 96) Tagged text Raw text Morphological analysis Disambiguation
root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others Negative constraint Positive constraint ... P(IN)=0.13 P(RB)=0.87 tag(+2) 2.37 (RB) (0 “as” “As”) (1 RB) (2 IN) -5.81 (IN) (0 “as” “As”) (1 RB) (2 IN) IN P(IN)=0.013 P(RB)=0.987 leaf DecisionTrees (Màrquez & Padró 97) RELAX Translating Tress into Constraints Compatibility values: estimated using Mutual Information
DecisionTrees Experimental Evaluation Using the WSJ annotated corpus • Training set: 1,121,776 words • Test set: 51,990 words • Closed vocabulary assumption • Base of 194 trees • Covering 99.5% of the ambiguous occurrences • Storage requirement: 565 Kb • Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation)
DecisionTrees Experimental Evaluation RTT results • 67.52% error reduction with respect to MFT • Accuracy =94.45% (ambiguous) 97.29% (overall) • Comparable to the best state-of-the-art automatic POS taggers • Recall = 98.22% Precision = 95.73% (1.08 tags/word) • RTT allows to state a tradeoff between precision and recall
STT allows the incorporation of N-gram information some problems of sparseness and coherence of the resulting tag sequence can be alleviated STT+ results • Better than those of RTT and STT DecisionTrees Experimental Evaluation STT results • Comparable to those of RTT
DecisionTrees Experimental Evaluation Including trees into RELAX • Translation of 44 representative trees covering 84% of the examples = 8,473 constraints • Addition of: • bigrams (2,808 binary constraints) • trigrams (52,161 ternary constraints) • linguistically-motivated manual constraints (20)
92.82 91.82 92.72 91.35 MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints” H = set of 20 hand-written linguistic rules DecisionTrees Accuracy of RELAX
DecisionTrees Decision Trees: Summary • Advantages • Acquires symbolic knowledge in a understandable way • Very well studied ML algorithms and variants • Can be easily translated into rules • Existence of available software: C4.5, C5.0, etc. • Can be easily integrated into an ensemble
DecisionTrees Decision Trees: Summary • Drawbacks • Computationally expensive when scaling to large natural language domains: training examples, features, etc. • Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation • DTs is a model with high variance (unstable) • Tendency to overfit training data: pruning is necessary • Requires quite a big effort in tuning the model