ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

ML: Classical methods from AI • Decision-Tree induction • Exemplar-based Learning • Rule Induction • TBEDL

DecisionTrees Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)

A1 v1 v3 v2 ... A2 A2 A3 ... ... v5 v4 Decision Tree A5 A2 ... SIZE v6 small big C3 A5 SHAPE COLOR v7 red circle triang blue C1 C2 C1 neg pos pos neg DecisionTrees An Example

Training DT Training Set + TDIDT = Test DT + = Example Class DecisionTrees Learning Decision Trees

functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_exampes(X,amax,val); A’ := A \ {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function DTs General Induction Algorithm

functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A \ {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function DTs General Induction Algorithm

DecisionTrees Feature Selection Criteria • Functions derived from Information Theory: • Information Gain, Gain Ratio (Quinlan86) • Functions derived from Distance Measures • Gini Diversity Index (Breiman et al. 84) • RLM (López de Mántaras 91) • Statistically-based • Chi-square test (Sestito & Dillon 94) • Symmetrical Tau (Zhou & Dillon 91) • RELIEFF-IG: variant of RELIEFF (Kononenko 94)

DecisionTrees Information Gain (Quinlan79)

DecisionTrees Information Gain(2) (Quinlan79)

DecisionTrees (Quinlan86) Gain Ratio

DecisionTrees RELIEF (Kira & Rendell, 1992)

DecisionTrees (Kononenko, 1994) RELIEFF

DecisionTrees RELIEFF-IG (Màrquez, 1999) • RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).

DecisionTrees Extensions of DTs (Murthy 95) • (pre/post) Pruning • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • etc.

DecisionTrees Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98)

DecisionTrees Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97) • More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions)

POS Tagging • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus) DecisionTrees Example: POS Tagging using DT

Language Model Disambiguation Algorithm DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Raw text Tagged text Morphological analysis … POS tagging

DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Decision Trees Raw text Disambiguation Algorithm Tagged text Morphological analysis … POS tagging

Language Model Raw text RTT STT RELAX Tagged text Morphological analysis POS tagging DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) …

root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others ... P(IN)=0.13 P(RB)=0.87 tag(+2) Statistical interpretation: IN ^ P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.987 P(IN)=0.013 P(RB)=0.987 ^ P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.013 leaf DecisionTrees DT-based Language Modelling “preposition-adverb” tree

root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others Collocations: ... P(IN)=0.13 P(RB)=0.87 tag(+2) “as_RB much_RB as_IN” IN “as_RB soon_RB as_IN” P(IN)=0.013 P(RB)=0.987 “as_RB well_RB as_IN” leaf DecisionTrees DT-based Language Modelling “preposition-adverb” tree

Minimizing the effect of over-fitting, data fragmentation & sparseness DecisionTrees Language Modelling using DTs • Granularity? Ambiguity class level • adjective-noun, adjective-noun-verb, etc. • Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning • CART (Breiman et al. 84), C4.5 (Quinlan 95), etc. • Attributes: Local context (-3,+2) tokens • Particular implementation: • Branch-merging • CART post-pruning • Smoothing • Attributes with many values • Several functions for attribute selection

DecisionTrees Model Evaluation The Wall Street Journal(WSJ) annotated corpus • 1,170,000 words • Tagset size: 45 tags • Noise: 2-3% of mistagged words • 49,000 word-form frequency lexicon • Manual filtering of 200 most frequent entries • 36.4% ambiguous words • 2.44 (1.52) average tags per word • 243 ambiguity classes

DecisionTrees Model Evaluation The Wall Street Journal(WSJ) annotated corpus Number of ambiguity classes that cover x%of the training corpus Arity of the classification problems

DecisionTrees 12 Ambiguity Classes They cover 57.90% of the ambiguous occurrences! Experimental setting: 10-fold cross validation

DecisionTrees N-fold Cross Validation Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn fori:=1 toNdo learn and test a classifier using: training_set := Usj for all j different from i validation_set :=si end_for return: the average accuracy from the n experiments Which is a good value for N? (2-10-...) Extreme case (N=training set size): Leave-one-out

Average size reduction: 51.7%46.5% 74.1% (total) DecisionTrees Size: Number of Nodes

DecisionTrees Accuracy (at least) No loss in accuracy

Statistically equivalent DecisionTrees Feature Selection Criteria

(Màrquez & Rodríguez 97) (Màrquez & Rodríguez 99) (Màrquez & Padró 97) DecisionTrees DT-based POS Taggers • Tree Base = Statistical Component • RTT: ReductionisticTree based tagger • STT: Statistical Tree based tagger • Tree Base = Compatibility Constraints • RELAX: Relaxation-Labelling based tagger

DecisionTrees RTT (Màrquez & Rodríguez 97) Language Model stop? Filter Classify Update Tagged text Raw text Morphological analysis yes no Disambiguation

DecisionTrees STT (Màrquez & Rodríguez 99) N-grams (trigrams)

Estimated using Decision Trees DecisionTrees STT (Màrquez & Rodríguez 99) Contextual probabilities

Lexical probs. + Contextual probs. DecisionTrees STT (Màrquez & Rodríguez 99) LanguageModel Viterbi algorithm Tagged text Raw text Morphological analysis Disambiguation

DecisionTrees STT+ (Màrquez & Rodríguez 99) LanguageModel N-grams Lexical probs. + + Contextual probs. Viterbi algorithm Tagged text Raw text Morphological analysis Disambiguation

(Màrquez & Rodríguez 97) (Màrquez & Rodríguez 99) (Màrquez & Padró 97) DecisionTrees • Tree Base = Statistical Component • RTT: ReductionisticTree based tagger • STT: Statistical Tree based tagger • Tree Base = Compatibility Constraints • RELAX: Relaxation-Labelling based tagger

DecisionTrees (Màrquez & Padró 97) RELAX LanguageModel Linguistic rules N-grams + + Set of constraints Relaxation Labelling (Padró 96) Tagged text Raw text Morphological analysis Disambiguation

root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others Negative constraint Positive constraint ... P(IN)=0.13 P(RB)=0.87 tag(+2) 2.37 (RB) (0 “as” “As”) (1 RB) (2 IN) -5.81 (IN) (0 “as” “As”) (1 RB) (2 IN) IN P(IN)=0.013 P(RB)=0.987 leaf DecisionTrees (Màrquez & Padró 97) RELAX Translating Tress into Constraints Compatibility values: estimated using Mutual Information

DecisionTrees Experimental Evaluation Using the WSJ annotated corpus • Training set: 1,121,776 words • Test set: 51,990 words • Closed vocabulary assumption • Base of 194 trees • Covering 99.5% of the ambiguous occurrences • Storage requirement: 565 Kb • Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation)

DecisionTrees Experimental Evaluation RTT results • 67.52% error reduction with respect to MFT • Accuracy =94.45% (ambiguous) 97.29% (overall) • Comparable to the best state-of-the-art automatic POS taggers • Recall = 98.22% Precision = 95.73% (1.08 tags/word) • RTT allows to state a tradeoff between precision and recall

STT allows the incorporation of N-gram information some problems of sparseness and coherence of the resulting tag sequence can be alleviated STT+ results • Better than those of RTT and STT DecisionTrees Experimental Evaluation STT results • Comparable to those of RTT

DecisionTrees Experimental Evaluation Including trees into RELAX • Translation of 44 representative trees covering 84% of the examples = 8,473 constraints • Addition of: • bigrams (2,808 binary constraints) • trigrams (52,161 ternary constraints) • linguistically-motivated manual constraints (20)

92.82 91.82 92.72 91.35 MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints” H = set of 20 hand-written linguistic rules DecisionTrees Accuracy of RELAX

DecisionTrees Decision Trees: Summary • Advantages • Acquires symbolic knowledge in a understandable way • Very well studied ML algorithms and variants • Can be easily translated into rules • Existence of available software: C4.5, C5.0, etc. • Can be easily integrated into an ensemble

DecisionTrees Decision Trees: Summary • Drawbacks • Computationally expensive when scaling to large natural language domains: training examples, features, etc. • Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation • DTs is a model with high variance (unstable) • Tendency to overfit training data: pruning is necessary • Requires quite a big effort in tuning the model

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

Presentation Transcript

Strong Induction

Electromagnetic Induction

Estrous Synchronization (Ovulation Induction) and Ovsynch

Induction of Anesthesia and Insertion of a Laryngeal Mask Airway in the Prone Position for Minor Surgery

Comparison between simulation programs: Matlab, Psim and Plecs

Drug-Drug Interactions: Inhibition and Induction

Ovulation Induction for PCOS

Induction Motors

errare humanum est (to err is human)

Learning

Transformation-Based Learning

CS490D: Introduction to Data Mining Prof. Chris Clifton

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Introduction to Rule-Based Reasoning

Machine Learning Methods for Decision Support and Discovery

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 8 —

Chapter 6. Classification and Prediction

Chapter 28: Magnetic Induction

INDUCTION OF LABOUR PROTOCOLS

Developmental Support

Registrar Induction Session

Chapter 3: Supervised Learning