1 / 48

Agnostically Learning Decision Trees

This paper discusses agnostically learning decision trees using the l1 regression method and the gradient-projection method, providing a polynomial-time algorithm for sparse l1 regression.

tpolk
Download Presentation

Agnostically Learning Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 0 1 X2 X3 0 1 0 1 0 1 1 0 Parikshit GopalanMSR-Silicon Valley, IITB’00.Adam Tauman Kalai MSR-New EnglandAdam R. KlivansUT Austin Agnostically Learning Decision Trees 1 0 X1 0 1 1 0 0 1

  2. Computational Learning

  3. Computational Learning

  4. Computational Learning f:{0,1}n! {0,1} x, f(x) Learning: Predict f from examples.

  5. Valiant’s Model f:{0,1}n! {0,1} Halfspaces: + + + + + - + x, f(x) - + - - - + - - - - - - Assumption: f comes from a nice concept class.

  6. 0 1 X2 X3 0 1 0 1 0 1 1 0 Valiant’s Model f:{0,1}n! {0,1} Decision Trees: X1 x, f(x) Assumption: f comes from a nice concept class.

  7. X1 0 1 X2 X3 0 1 0 1 0 1 1 0 The Agnostic Model [Kearns-Schapire-Sellie’94] f:{0,1}n! {0,1} Decision Trees: x, f(x) No assumptions about f. Learner should do as well as best decision tree.

  8. X1 0 1 X2 X3 0 1 0 1 0 1 1 0 The Agnostic Model [Kearns-Schapire-Sellie’94] Decision Trees: x, f(x) No assumptions about f. Learner should do as well as best decision tree.

  9. X1 0 1 X2 X3 0 1 0 1 0 1 1 0 Agnostic Model = Noisy Learning f:{0,1}n! {0,1} + = • Concept: Message • Truth table: Encoding • Function f: Received word. • Coding: Recover the Message. • Learning:Predict f.

  10. Uniform Distribution Learning for Decision Trees • Noiseless Setting: • No queries:nlog n[Ehrenfeucht-Haussler’89]. • With queries:poly(n). [Kushilevitz-Mansour’91] Agnostic Setting: Polynomial time, uses queries. [G.-Kalai-Klivans’08] Reconstruction for sparse real polynomials in the l1 norm.

  11. The Fourier Transform Method • Powerful tool for uniform distribution learning. • Introduced by Linial-Mansour-Nisan. • Small depth circuits[Linial-Mansour-Nisan’89] • DNFs[Jackson’95] • Decision trees[Kushilevitz-Mansour’94, O’Donnell-Servedio’06, G.-Kalai-Klivans’08] • Halfspaces, Intersections[Klivans-O’Donnell-Servedio’03, Kalai-Klivans-Mansour-Servedio’05] • Juntas[Mossel-O’Donnell-Servedio’03] • Parities[Feldman-G.-Khot-Ponnsuswami’06]

  12. The Fourier Polynomial • Let f:{-1,1}n! {-1,1}. • Write f as a polynomial. • AND:½ + ½X1 + ½X2 - ½X1X2 • Parity:X1X2 • Parity of ½ [n]: (x) = i 2Xi • Write f(x) =  c()(x) •  c()2 =1. Standard Basis Function f Parities

  13. The Fourier Polynomial • Let f:{-1,1}n! {-1,1}. • Write f as a polynomial. • AND:½ + ½X1 + ½X2 - ½X1X2 • Parity:X1X2 • Parity of ½ [n]: (x) = i 2Xi • Write f(x) =  c()(x) •  c()2 =1. c()2: Weight of . 

  14. Low Degree Functions • Sparse Functions: Most of the weight lies on small subsets. • Halfspaces, Small-depth circuits. • Low-degree algorithm. [Linial-Mansour-Nisan] • Finds the low-degree Fourier coefficients. Least Squares Regression: Find low-degree P minimizing Ex[ |P(x) – f(x)|2 ].

  15. Sparse Functions • Sparse Functions: Most of the weight lies on a few subsets. • Decision trees. t leaves )O(t) subsets • Sparse Algorithm. • [Kushilevitz-Mansour’91] Sparse l2 Regression: Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ].

  16. Sparse l2 Regression • Sparse Functions: Most of the weight lies on a few subsets. • Decision trees. t leaves )O(t) subsets • Sparse Algorithm. • [Kushilevitz-Mansour’91] Sparse l2 Regression: Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ]. Finding large coefficients: Hadamard decoding. [Kushilevitz-Mansour’91, Goldreich-Levin’89]

  17. f:{-1,1}n! {-1,1} +1 -1 Agnostic Learning via l2 Regression?

  18. X1 0 1 X2 X3 +1 0 1 0 1 0 1 1 0 -1 Agnostic Learning via l2 Regression?

  19. +1 -1 Agnostic Learning via l2 Regression? Target f Best Tree • l2 Regression: Loss |P(x) –f(x)|2 • Pay 1 for indecision. • Pay 4 for a mistake. • l1 Regression: [KKMS’05] • Loss |P(x) –f(x)| • Pay 1 for indecision. • Pay 2 for a mistake.

  20. +1 -1 Agnostic Learning via l1 Regression? • l2 Regression: Loss |P(x) –f(x)|2 • Pay 1 for indecision. • Pay 4 for a mistake. • l1 Regression: [KKMS’05] • Loss |P(x) –f(x)| • Pay 1 for indecision. • Pay 2 for a mistake.

  21. +1 -1 Agnostic Learning via l1 Regression Target f Best Tree Thm [KKMS’05]:l1 Regression always gives a good predictor. l1 regression for low degree polynomials via Linear Programming.

  22. Agnostically Learning Decision Trees Sparse l1 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)|]. • Why is this Harder: • l2 is basis independent, l1 is not. • Don’t know the support of P. • [G.-Kalai-Klivans]: Polynomial time algorithm for Sparse l1Regression.

  23. The Gradient-Projection Method L1(P,Q) =  |c() – d()| L2(P,Q) = [ (c() –d())2]1/2 f(x) P(x) =  c() (x) Q(x) =  d() (x) Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  24. Gradient The Gradient-Projection Method Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  25. Gradient The Gradient-Projection Method Projection Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  26. +1 -1 The Gradient • g(x) = sgn[f(x) - P(x)] • P(x) := P(x) +  g(x). f(x) P(x) Increase P(x) if low. Decrease P(x) if high.

  27. Gradient The Gradient-Projection Method Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  28. Gradient The Gradient-Projection Method Projection Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  29. Projection onto the L1 ball Currently: |c()| > t Want:|c()| · t.

  30. Projection onto the L1 ball Currently: |c()| > t Want:|c()| · t.

  31. Projection onto the L1 ball • Below cutoff:Set to 0. • Above cutoff:Subtract.

  32. Projection onto the L1 ball • Below cutoff:Set to 0. • Above cutoff:Subtract.

  33. Analysis of Gradient-Projection [Zinkevich’03] • Progress measure:Squared L2 distance from optimum P*. Key Equation: |Pt – P*|2 - |Pt+1 – P*|2¸ 2 (L(Pt) – L(P*)) • Within  of optimal in 1/2 iterations. • Good L2 approximation to Pt suffices. – 2 Progress made in this step. How suboptimal current soln is.

  34. +1 -1 Gradient f(x) • g(x) = sgn[f(x) - P(x)]. P(x) Projection

  35. +1 -1 The Gradient • g(x) = sgn[f(x) - P(x)]. f(x) P(x) • Compute sparse approximation g’ = KM(g). • Is g a good L2 approximation to g’? • No. Initially g = f. • L2(g,g’) can be as large 1.

  36. Sparse l1 Regression Approximate Gradient Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  37. Sparse l1 Regression Projection Compensates Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  38. KM as l2 Approximation The KM Algorithm: Input:g:{-1,1}n! {-1,1}, and t. Output: A t-sparse polynomial g’ minimizing Ex [|g(x) – g’(x)|2] Run Time: poly(n,t).

  39. KM as L1 Approximation The KM Algorithm: Input: A Boolean function g =  c()(x). A error bound . Output: Approximation g’ =  c’()(x) s.t |c() – c’()| · for all½ [n]. Run Time: poly(n,1/)

  40. KM as L1 Approximation Only 1/2 • Identify coefficients larger than. • Estimate via sampling, set rest to 0.

  41. KM as L1 Approximation • Identify coefficients larger than. • Estimate via sampling, set rest to 0.

  42. Projection Preserves L1 Distance L1distance at most 2after projection. Both lines stop within  of each other.

  43. Projection Preserves L1 Distance L1distance at most 2after projection. Both lines stop within  of each other. Else, Blue dominates Red.

  44. Projection Preserves L1 Distance L1distance at most 2after projection. Projecting onto the L1 ball does not increase L1 distance.

  45. Sparse l1 Regression • L1(P, P’) · 2 • L1(P, P’) · 2t • L2(P, P’)2· 4t P’ P Can take  = 1/t2. Variables:c()’s. Constraint: |c() | · t Minimize:Ex|P(x) – f(x)|

  46. Agnostically Learning Decision Trees Sparse L1 Regression: Find a sparse polynomial P minimizing Ex[ |P(x) – f(x)|]. • [G.-Kalai-Klivans’08]: • Can get within  of optimum in poly(t,1/) iterations. • Algorithm for Sparsel1Regression. • First polynomial time algorithm for Agnostically Learning Sparse Polynomials.

  47. l1 Regression from l2 Regression Functionf: D ! [-1,1],Orthonormal BasisB. Sparse l2 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)|2 ]. Sparse l1 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)|]. • [G.-Kalai-Klivans’08]:Given solution to l2 Regression, can solve l1 Regression.

  48. Agnostically Learning DNFs? • Problem: Can we agnostically learn DNFs in polynomial time? (uniform dist. with queries) • Noiseless Setting: Jackson’s Harmonic Sieve. • Implies weak learner for depth-3 circuits. • Beyond current Fourier techniques. Thank You!

More Related