300 likes | 426 Views
Decision Trees. Ruy Luiz Milidiú. Resumo. Objetivo Examinar o conceito de Á rvores de Decisão e suas aplicações Sumário Surpresa e Informação Entropia Informação Mútua Divergência e Entropia Cruzada. Ganho de Informação. H(X,Y) = H(X) + H(Y|X) H(X,Y) = H(Y) + H(X|Y)
E N D
Decision Trees Ruy Luiz Milidiú
Resumo • Objetivo Examinar o conceito de Árvores de Decisão e suas aplicações • Sumário • Surpresa e Informação • Entropia • Informação Mútua • Divergência e Entropia Cruzada
Ganho de Informação • H(X,Y) = H(X) + H(Y|X) • H(X,Y) = H(Y) + H(X|Y) • H(X) + H(Y|X) = H(Y) + H(X|Y) • H(X) – H(X|Y) = H(Y) – H(Y|X) I(X,Y) = H(Y) – H(Y|X) IG(X) = H(Y) – H(Y|X)
Seleção de Atributos (X1,…,Xk,Y) • IG(Xi) = H(Y) – H(Y|Xi) i=1,…,k • melhor atributo IG(Xj) = max i=1,…,k {H(Y) – H(Y|Xi)} IG(Xj) = H(Y) – min i=1,…,k {H(Y|Xi)}
…wait for a table? • Alternate alternative restaurant nearby? • Bar comfortable bar area to wait in? • Fri/Sat Friday or Saturday? • Hungry are we hungry? • Patrons people in (None, Some, Full) • Price ($, $$, $$$) • Raining raining outside? • Reservation have we made a reservation? • Type (French, Italian, Thai, Burger) • WaitEstimate (0-10, 10-30, 30-60, >60)
Learning data • Examples described by attribute values (Boolean, discrete, continuous) • E.g., situations where I will/won't wait for a table: • Classification of examples is positive (T) or negative (F)
Learning data • NO attribute is known • 6 positive (T) and 6 negative (F) • H(Y) = H(6/12) = H(1/2) = 1
Learning data • PATRONS attribute is known • 6 positive (T) and 6 negative (F) • H(Y|Patron) = ?
Learning data • PATRONS = none • 0 positive (T) and 2 negative (F) • H(Y|none) = H(0/2) = H(0) = 0
Learning data • PATRONS = some • 4 positive (T) and 0 negative (F) • H(Y|some) = H(4/4) = H(1) = 0
Learning data • PATRONS = full • 2 positive (T) and 4 negative (F) • H(Y|full) = H(1/3) = (1/3).lg(3) + (2/3).lg(3/2) = lg(3) – 2/3 = .92
Learning data • PATRONS attribute is known • H(Y|Patron) = (2/12).H(Y|none) +(4/12).H(Y|some) +(6/12).H(Y|full)
Learning data • PATRONS attribute is known • H(Y|Patron) = (2/12).0(Y|none) +(4/12).0(Y|some) +(6/12)..92ull)
Learning data • PATRONS attribute is known • H(Y|Patron) = .46).0(Y|none) +(4/12).0(Y|some) +(6/12)..92ull)
Choosing an attribute A good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • IG(Patrons) = 1 - .46 = .54 • IG(Type) = 1 - 1 = .0
Learned tree • Decision tree learned from the 12 examples • Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of data
Decision tree learning • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree divide-and-conquer
Worst Case Complexity n examples k attributes • complete tree of height k • n.(k-l) total work per tree level l n.[k+(k-1)+…+1] total work O(n.k2)
Learning subset • PATRONS = full • 2 positive (T) and 4 negative (F) • H(Y|full) = H(1/3) = (1/3).lg(3) + (2/3).lg(3/2) = lg(3) – 2/3 = .92
Learning subset • PATRONS = full TYPE is known • 2 positive (T) and 4 negative (F) • H(Y|full,Type) = (2/6).H(1/2) + (1/6).H(0) + (2/6).H(1/2) + (1/6).H(0)
Learning subset • PATRONS = full TYPE is known • 2 positive (T) and 4 negative (F) • H(Y|full,Type) = (2/6).(1) + (1/6).(0) + (2/6).(1) + (1/6).(0) = .66
Naive-DT H(Y|full,tai) = -P(T|full,tai) . lg( P(T|full,tai) ) -P(F|full,tai) . lg( P(F|full,tai) ) P(T|full,tai) = P(tai|full,T) . P(T|full) / P(tai|full) P(F|full,tai) = P(tai|full,F) . P(F|full) / P(tai|full) P(T|full,tai) P(tai|full,T) . P(T|full) P(F|full,tai) P(tai|full,F) . P(F|full) P(tai|full) = P(tai|full,T) . P(T|full) + P(tai|full,F) . P(F|full)
Naive-DT Initial Set Counts • Naive Bayes Hypotesis • P(tai|full,T) = P(tai|T) = P(tai,T)/ P(T) • P(tai|full,F) = P(tai|F) = P(tai,F)/ P(F) P(T|full,tai) P(tai|T) . P(T|full) = (2/6).(4/6) = 8/36 P(F|full,tai) P(tai|F) . P(F|full) = (2/6).(2/6) = 4/36 P(tai|full) = (8/36) + (4/36) = 12/36 = 1/3 H(Y|full,tai) = H(8/12) = H(2/3) = H(1/3) = .92 H(Y|full,Type) = P(tai|full).H(Y|full,tai) + ... H(Y|full,Type) = (1/3).(.92) + ... Subset Counts
Naive-DT Initial Set Counts • Naive Bayes Hypotesis • P(french|full,T) = P(french|T) = P(french,T)/ P(T) • P(french|full,F) = P(french|F) = P(french,F)/ P(F) P(T|full, french) P(french|T) . P(T|full) = (1/6).(4/6) = 4/36 P(F|full, french) P(french|F) . P(F|full) = (1/6).(2/6) = 2/36 P(french|full) = (4/36) + (2/36) = 6/36 = 1/6 H(Y|full, french) = H(4/6) = H(2/3) = H(1/3) = .92 H(Y|full,Type) = P(tai|full).H(Y|full,tai) + P(french|full).H(Y|full,french) + ... H(Y|full,Type) = (1/3).(.92) + (1/6).(.92) + ... Subset Counts O(m.k)
Initial Counting small for classe=1..c total[classe] = 0 for feature=1..k for valor=1..v[k] count[feature, valor, classe] = 0 for i=1..n total[y[i]] ++ for feature=1..k count[feature, x[i,feature], y[i]] ++ small
Decision trees • the UNKNOWN “true” tree
Expressiveness • Decision trees can express any function of the input attributes. • E.g., for Boolean functions, truth table row → path to leaf: • Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples • Prefer to find more compact decision trees
Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n • E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n • E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? • Each attribute can be in (positive), in (negative), or out 3n distinct conjunctive hypotheses • More expressive hypothesis space • increases chance that target function can be expressed • increases number of hypotheses consistent with training set may get worse predictions