380 likes | 595 Views
Decision Trees and more!. Learning OR with few attributes. Target function: OR of k literals Goal: learn in time polynomial in k and log n e and d constants ELIM makes “slow” progress might disqualifies only one literal per round Might remain with O(n) candidate literals.
E N D
Learning OR with few attributes • Target function:OR of k literals • Goal: learn in time • polynomial in k and log n • e and d constants • ELIM makes “slow” progress • might disqualifies only one literal per round • Might remain with O(n) candidate literals
ELIM: Algorithm for learning OR • Keep a list of all candidate literals • For every example whose classification is 0: • Erase all the literals that are 1. • Correctness: • Our hypothesis h: An OR of our set of literals. • Our set of literals includes the target OR literals. • Every time h predicts zero: we are correct. • Sample size: • m > (1/e) ln (3n/d)= O (n/e +1/e ln (1/d))
Set Cover - Definition • Input: S1 , … , St and Si U • Output: Si1, … , Sik and j Sjk=U • Question:Are there k sets that cover U? • NP-complete
Set Cover: Greedy algorithm • j=0 ; Uj=U; C= • While Uj • Let Sibe arg max |Si Uj| • Add Sito C • Let Uj+1 = Uj – Si • j = j+1
Set Cover: Greedy Analysis • At termination, C is a cover. • Assume there is a cover C* of size k. • C* is a cover for every Uj • Some S in C* covers Uj/k elements of Uj • Analysis of Uj: |Uj+1| |Uj| - |Uj|/k • Solving the recursion. • Number of sets j k ln ( |U|+1)
Building an Occam algorithm • Given a sample T of size m • Run ELIM on T • Let LIT be the set of remaining literals • Assume there exists k literals in LIT that classify correctly all the sample T • Negative examples T- • any subset of LIT classifies T- correctly
Building an Occam algorithm • Positive examples T+ • Search for a small subset of LIT which classifies T+ correctly • For a literal z build Sz={x | z satisfies x} • Our assumption: there are k sets that cover T+ • Greedy finds k ln m sets that cover T+ • Output h = OR of the k ln m literals • Size (h) < k ln m log 2n • Sample size m =O( k log n log (k log n))
k-DNF • Definition: • A disjunction of terms at most k literals • Term: T=x3 x1 x5 • DNF: T1 T2 T3 T4 • Example:
Learning k-DNF • Extended input: • For each AND of k literals define a “new” input T • Example: T=x3 x1 x5 • Number of new inputs at most (2n)k • Can compute the new input easily in time k(2n)k • The k-DNF is an OR over the new inputs. • Run the ELIM algorithm over the new inputs. • Sample size O ((2n)k/e +1/e ln (1/d)) • Running time: same.
Learning Decision Lists • Definition: 0 x4 x7 x1 1 +1 1 0 1 0 -1 +1 -1
Learning Decision Lists • Similar to ELIM. • Input: a sample S of size m. • While S not empty: • For a literal z build Tz={x | z satisfies x} • Find a Tz which all have the same classification • Add z to the decision list • Update S = S-Tz
DL algorithm: correctness • The output decision list is consistent. • Number of decision lists: • Length < n+1 • Node: 2n lirals • Leaf: 2 values • Total bound (2*2n)n+1 • Sample size: • m = O (n log n/e +1/e ln (1/d))
x4 x2 1 1 +1 1 x3 x1 x5 x7 0 0 0 -1 +1 -1 k-DL • Each node is a conjunction of k literals • Includes k-DNF (and k-CNF)
Learning k-DL • Extended input: • For each AND of k literals define a “new” input • Example: T=x3 x1 x5 • Number of new inputs at most (2n)k • Can compute the new input easily in time k(2n)k • The k-DL is a DL over the new inputs. • Run the DL algorithm over the new inputs. • Sample size • Running time
Open Problems • Attribute Efficient: • Decision list: very limited results • Parity functions: negative? • k-DNF and k-DL
+1 -1 +1 Decision Trees x1 1 0 x6 1 0
Learning Decision Trees Using DL • Consider a decision tree T of size r. • Theorem: • There exists a log (r+1)-DL L that computes T. • Claim: There exists a leaf in T of depth log (r+1). • Learn a Decision Tree using a Decision List • Running time: nlog s • n number of attributes • S Tree Size.
+1 -1 +1 Decision Trees x1 > 5 x6 > 2
Decision Trees: Basic Setup. • Basic class of hypotheses H. • Input: Sample of examples • Output: Decision tree • Each internal node from H • Each leaf a classification value • Goal (Occam Razor): • Small decision tree • Classifies all (most) examples correctly.
Decision Tree: Why? • Efficient algorithms: • Construction. • Classification • Performance: Comparable to other methods • Software packages: • CART • C4.5 and C5
Decision Trees: This Lecture • Algorithms for constructing DT • A theoretical justification • Using boosting • Future lecture: • DT pruning.
Decision Trees Algorithm: Outline • A natural recursive procedure. • Decide a predicate h at the root. • Split the data using h • Build right subtree (for h(x)=1) • Build left subtree (for h(x)=0) • Running time • T(s) = O(s) + T(s+) + T(s-) = O(s log s) • s= Tree size
DT: Selecting a Predicate • Basic setting: • Clearly: q=up + (1-u)r Pr[f=1]=q v h Pr[h=0]=u Pr[h=1]=1-u 0 1 v1 v2 Pr[f=1| h=0]=p Pr[f=1| h=1]=r
Potential function: setting • Compare predicates using potential function. • Inputs: q, u, p, r • Output: value • Node dependent: • For each node and predicate assign a value. • Given a split: u val(v1) + (1-u) val(v2) • For a tree: weighted sum over the leaves.
PF: classification error • Let val(v)=min{q,1-q} • Classification error. • The average potential only drops • Termination: • When the average is zero • Perfect Classification
q=Pr[f=1]=0.8 v h 1-u=Pr[h=1]=0.5 u=Pr[h=0]=0.5 0 1 v1 v2 p=Pr[f=1| h=1]=1 p=Pr[f=1| h=0]=0.6 PF: classification error • Is this a good split? • Initial error 0.2 • After Split 0.4 (1/2) + 0.6(1/2) = 0.2
Potential Function: requirements • When zero perfect classification. • Strictly convex.
Potential Function: requirements • Every Change in an improvement val(r) val(q) val(p) 1-u u p q r
Potential Functions: Candidates • Potential Functions: • val(q) = Ginni(q)=2q(1-q) CART • val(q)=etropy(q)= -q log q –(1-q) log (1-q) C4.5 • val(q) = sqrt{2 q (1-q) } • Assumption: • Symmetric: val(q) = val(1-q) • Convex • val(0)=val(1) = 0 and val(1/2) =1
DT: Construction Algorithm Procedure DT(S) : S- sample • If all the examples in S have the classification b • Create a leaf of value b and return • For each h compute val(h,S) • val(h,S) = uhval(ph) + (1-uh) val(rh) • Let h’ = arg minh val(h,S) • Split S using h’ to S0 and S1 • Recursively invoke DT(S0) and DT(S1)
DT: Analysis • Potential function: • val(T) = Svleaf of T Pr[v] val(qv) • For simplicity: use true probability • Bounding the classification error • error(T) val(T) • study how fast val(T) drops • Given a tree T define T(l,h) where • h predicate • l leaf. T h
Top-Down algorithm • Input: s = size; H= predicates; val(); • T0= single leaf tree • For t from 1 to s do • Let (l,h) = arg max(l,h){val(Tt) – val(Tt(l,h))} • Tt+1 = Tt(l,h)
Theoretical Analysis • Assume H satisfies the weak learning hypo. • For each D there is an h s.t. error(h)<1/2-g • Show, that in every step • a significant drop in val(T) • Results weaker than AdaBoost • But algorithm never intended to do it! • Use Weak Learning • show a large drop in val(T) at each step • Modify initial distribution to be unbiased.
Theoretical Analysis • Let val(q) = 2q(1-q) • Local drop at a node at least 16g2 [q(1-q)]2 • Claim: At every step t there is a leaf l s.t. : • Pr[l] et/2t • error(l)= min{ql,1-ql} et/2 • whereetis the error at stage t • Proof!
Theoretical Analysis • Drop at time t at least: • Pr[l] g2 [ql (1-ql)]2g2et3 / t • For Ginni index • val(q)=2q(1-q) • q q(1-q) val(q)/2 • Drop at least O(g2 [val(qt)]3/ t )
Theoretical Analysis • Need to solve when val(Tk) < e. • Bound k. • Time exp{O(1/g2 1/ e2)}
Something to think about • AdaBoost: very good bounds • DT Ginni Index : exponential • Comparable results in practice • How can it be?