Learning and smoothed analysis

Learning and smoothed analysis Alex Samorodnitsky* Hebrew University Jerusalem Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of SouthernCalifornia *while visiting Microsoft

In this talk… • Revisit classic learning problems • e.g. learn DNFs from random examples (drawn from product distributions) • Barrier = worst case complexity • Solve in a new model! • Smoothed analysis sheds light on hard problem instance structure • Also show: DNF can be recovered from heavy “Fourier coefficients”

[Valiant84] P.A.C. learning AND’s!? X = {0,1}ⁿf: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= Input: training data (xj from D, f(xj))j≤m Noiseless x2˄x4˄x7

[Valiant84] P.A.C. learning AND’s!? X = {0,1}ⁿf: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= Input: training data (xj from D, f(xj))j≤m Output: h: X → {–1,+1} with err(h)=Prx←D[h(x)≠f(x)] ≤ ε Noiseless x2˄x4˄x7 1. Succeed with prob. ≥ 0.99 2. m = # examples = poly(n/ε) 3. Polytime learning algorithm *OPTIONAL* “Proper” learning: h is an AND

P.A.C. learning AND’s!? [Kearns Schapire Sellie92] Agnostic X = {0,1}ⁿf: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= Input: training data (xj from D, f(xj))j≤m Output: h: X → {–1,+1} with err(h)=Prx←D[h(x)≠f(x)] ≤ ε x2˄x4˄x7 1. Succeed with prob. ≥ 0.99 2. m = # examples = poly(n/ε) 3. Polytime learning algorithm opt + minANDgerr(g)

Some related work x1 x2 x7 x2˄x4˄x7˄x9  0 1 Uniform D +Mem queries [Kushilevitz-Mansour’91;Goldreich-Levin’89] Uniform D +Mem queries [Gopalan-K-Klivans’08] – + x2 x9 0 1 0 1 Mem queries [Bshouty’94] – + – +  0 1 0 1 Uniform D +Mem queries [Jackson’94] (x1˄x4)˅(x2˄x4˄x7˄x9)

Some related work x1 x2 x7 x2˄x4˄x7˄x9  0 1 Product D +Mem queries [Kushilevitz-Mansour’91;Goldreich-Levin’89] Product D +Mem queries [Gopalan-K-Klivans’08] – + x2 x9 Product D [KST’09] (smoothed analysis) 0 1 0 1 Mem queries [Bshouty’94] Product D [KST’09] (smoothed analysis) – + – +  0 1 0 1 Product D +Mem queries [Jackson’94] (x1˄x4)˅(x2˄x4˄x7˄x9) Product D [KST’09] (smoothed analysis)

Outline • PAC learn decision trees over smoothed (constant-bounded) product distributions • Describe practical heuristic • Define smoothed product distribution setting • Structure of Fourier coeff’s over random prod. dist. • PAC learn DNFs over smoothed (constant-bounded) product distribution • Why DNF can be recovered from heavy coefficients (information-theoretically) • Agnostically learn decision trees over smoothed (constant-bounded) product distributions • Rough idea of algorithm

Feature Construction “Heuristic” ≈ [SuttonMatheus91] Approach: Greedily learn sparse polynomial, bottom-up, using least-squares regression • Normalize input (x1,y1),(x2,y2),…,(xm,ym) so that each attribute xi has mean 0 & variance 1 2. F:= {1,x1,x2,…,xn} 3. Repeat m¼times: F := F{ t·xi }for tϵF of min regression error, e.g., for :

Guarantee for that Heuristic For μϵ [0,1]ⁿ, let πμbe the product distribution where Ex←πμ[xᵢ] = μᵢ. Theorem 1. For any size s decision tree f: {0,1}ⁿ → {–1,+1}, with probability ≥ 0.99 over uniformly random μϵ [0.49,0.51]ⁿand m=poly(ns/ε) training examples (xj,f(xj))j≤mwith xj iid from πμ, the heuristic outputs h with Prx←πμ[sgn(h(x))≠f(x)] ≤ ε.

Guarantee for that Heuristic For μϵ [0,1]ⁿ, let πμbe the product distribution where Ex←πμ[xᵢ] = μᵢ. Theorem 1. For any size s decision tree f: {0,1}ⁿ → {–1,+1} and any νϵ [.02,.98]ⁿ, with probability ≥ 0.99 over uniformly random μϵν+[–.01,.01]ⁿand m=poly(ns/ε) training examples (xj,f(xj))j≤mwith xj iid from πμ, the heuristic outputs h with Prx←πμ[sgn(h(x))≠f(x)] ≤ ε. *same statement for DNF alg.

Smoothed analysis ass. x1 x2 x7 –1 +1 x2 x9 –1 +1 +1 –1 cube ν+[-.01,.01]ⁿf:{0,1}n{-1,1} LEARNING ALG. h (x(1), f(x(1))),…,(x(m),f(x(m))) iid … Pr[ h(x)  f(x) ] ≤  prod distπμ

“Hard” instance picture x1 red = heuristic fails x2 x7 –1 +1 x2 x9 –1 +1 +1 –1 can’t be this μϵ[0,1]n, μi=Pr[xi=1] f:{0,1}n{-1,1} prod distπμ

“Hard” instance picture x1 red = heuristic fails x2 x7 –1 +1 x2 x9 –1 +1 +1 –1 μϵ[0,1]n, μi=Pr[xi=1] f:{0,1}n{-1,1} prod distπμ Theorem 1 “Hard” instances are few and far between for any tree

Fourier over product distributions • xϵ {0,1}ⁿ, μϵ [0,1]ⁿ, • Coordinates normalized to mean 0, var. 1

Heuristic over product distributions (μᵢ can easily be estimated from data) (easy to appx any individual coefficient) • 1) • Repeat m¼ times: where S is chosen tomaximize

Example • f(x) = x2x4x9 • For uniform μ = (.5,.5,.5), xi ϵ {–1,+1}f(x) = x2x4x9 • For μ = (.4,.6,.55),†f(x)=.9x2x4x9+.1x2x4+.3x4x9+.2x2x9+.2x2–.2x4+.1x9 x2 x4 x4 – – – + + + + – x9 x9 x9 x9 + †figures not to scale

Fourier structure over random product distributions Lemma For anyf:{0,1}ⁿ→{–1,1}, α,β> 0, and d ≥ 1,

Fourier structure over random product distributions Lemma For anyf:{0,1}ⁿ→{–1,1}, α,β> 0, and d ≥ 1, Lemma Let p:Rⁿ→R be a degree-d multilinear polynomial with leading coefficient of 1. Then, for any ε>0, e.g., p(x)=x1x2x9+.3x7–0.2

An older perspective • [Kushilevitz-Mansour’91] and [Goldreich-Levin’89]find heavy Fourier coefficients • Really use the fact that • Every decision tree is well approximated by it’s heavy coefficients because In smoothed product distribution setting, Heuristic finds heavy (log-degree) coefficients

Outline • PAC learn decision trees over smoothed (constant-bounded) product distributions • Describe practical heuristic • Define smoothed product distribution setting • Structure of Fourier coeff’s over random prod. dist. • PAC learn DNFs over smoothed (constant-bounded) product distribution • Why DNF can be recovered from heavy coefficients (information-theoretically) • Agnostically learn decision trees over smoothed (constant-bounded) product distributions • Rough idea of algorithm

Learning DNF • Adversary picks DNF f(x)=C1(x)˅C2(x)˅…˅Cs(x) (and νϵ [.02,.98]ⁿ) • Step 1: find f≥ε • [BFJKMR’94, Jackson’95]: “KM gives weak learner” combined with careful boosting. • Cannot use boosting in smoothed setting  • Solution: learn DNF from f≥εalone! • Design a robust membership query DNF learning algorithm, and give it query access to f≥ε

DNF learning algorithm f(x)=C1(x)˅C2(x)˅…˅Cs(x), e.g., (x1˄x4)˅(x2˄x4˄x7˄x9) Ci is “linear threshold function,” e.g. sgn(x1+x4-1.5) [KKanadeMansour’09] approach + other stuff

I’m a burier (of details) burier noun, pl. –s, One that buries.

DNF recoverable from heavy coef’s Information-theoretic lemma(uniform distribution) For any s-term DNF f and any g: {0,1}ⁿ→{–1,1}, Thanks, Madhu! Maybe similar to Bazzi/Braverman/Razborov?

DNF recoverable from heavy coef’s Information-theoretic lemma(uniform distribution) For any s-term DNF f and any g: {0,1}ⁿ→{–1,1}, Proof f(x)=C1(x)˅…˅Cs(x), where f(x)ϵ{–1,1} but Cᵢ(x)ϵ{0,1}.

Outline • PAC learn decision trees over smoothed (constant-bounded) product distributions • Describe practical heuristic • Define smoothed product distribution setting • Structure of Fourier coeff’s over random prod. dist. • PAC learn DNFs over smoothed (constant-bounded) product distribution • Why heavy coefficients characterize a DNF • Agnostically learn decision trees over smoothed (constant-bounded) product distributions • Rough idea of algorithm

Agnostically learning decision trees • Adversary picks arbitraryf:{0,1}ⁿ→{–1,+1} andνϵ [.02,.98]ⁿ • Nature picks μϵν + [–.01,.01]ⁿ • These determine best size-s decision tree f* • Guarantee: get err(h) ≤opt + ε opt = err(f*)

Agnostically learning decision trees Design robust membership query learning algorithm that works as long as queries are to g where . • Solve: • Robustness: (Appx) solved using [GKK’08] approach

The gradient-project descent alg. • Find f≥ε:{0,1}ⁿ→R using heuristic. • h¹=0 • For t=1,…,T: • ht+1 = projs( KM( ) ) • Output h(x) = sgn(ht(x)-θ) for t≤T, θϵ[–1,1] that minimize error on held-out data set Closely following [GopalanKKlivans’08]

projection • projs(h) = • From [GopalanKKlivans’08]

Conclusions • Smoothed complexity [SpielmanTeng01] • Compromise between worst-case/average-case • Novel application to learning over product dist’s • Assumption: not completely adversarial relationship between target f and dist. D • Weaker than “margin” assumptions • Future work • Non-product distributions • Other smoothed anal. app.

Thanks!Sorry!

Average-case complexity [JacksonServedio05] • [JS05] give a polytime algorithm that learns most DTs under uniform distribution on {0,1}n • Random DTs sometimes easier than real ones “Random is not typical” courtesy of Dan Spielman

Learning and smoothed analysis

Learning and smoothed analysis

Presentation Transcript

The Smoothed Analysis of Algorithms

The Smoothed Analysis of Algorithms: Simplex Methods and Beyond

Smoothed Maps

Learning Curve Analysis

Learning Task Analysis

Smoothed Seismicity Rates

Smoothed Seismicity Rates

Triggering Models vs. Smoothed Seismicity

Multimodal Learning Environments Critique and Analysis

Learning Community Analysis

Learning Community Analysis

SRR: The Smoothed Round Robin Scheduler

Modeling and Analysis of e-Learning

Learning Curve Analysis

Smoothed Particle Hydrodynamics

AC274: Smoothed Particle Hydrodynamics S. Succi

The Smoothed Analysis of Algorithms: Simplex Methods and Beyond

Pareto Optimal Solutions for Smoothed Analysts

Smoothed Particle Hydrodynamics

Relativistic Smoothed Particle Hydrodynamics

Smoothed Maps