Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua

On Agnostic Boosting and Parity Learning Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua

Defs • Agnostic Learning = learning with adversarial noise • Boosting = turn weak learner into strong learner • Parities = parities of subsets of the bits • f:{0,1}n→{0,1}. f(x)=x1x3x7 • Agnostic Boosting • Turning a weak agnostic learner to a strong agnostic learner • 2O(n/logn)-time algorithm for agnostically learning parities over any distribution Outline

Agnostic Booster Agnostic boosting Weak learner. For any noise rate < ½ produces a better-than-trivial hypothesis Strong Learner. Produces almost-optimal hypothesis Runs weak learner as black box

Learning with Noise It’s, like, a really hard model!!! * up to well-studied open problems (i.e. we know where we’re stuck)

Agnostic Learning: some known results

Agnostic Learning: some known results Due to hardness, or lack of tools??? Agnostic boosting: strong tool, makes it easier to design algorithms.

Why care about agnostic learning? • More relevant in practice • Impossibility results might be useful for building cryptosystems

Noisy learning f:{0,1}n→{0,1} from class F. alg gets samples <x,f(x)> where x is drawn from distribution D. • No noise • Random noise • Adversarial (≈agnostic) noise f Learning algorithm. Should approximate f up to error  Learning algorithm. Should approximate f up to error  f %noise g Learning algorithm. Should approximate g up to error  +  f allowed to corrupt -fraction

Agnostic learning (geometric view) F f opt opt +  g PROPER LEARNING Parameters: F, metric Input: oracle for g Goal: return some element of blue ball

Agnostic boosting definition D weak learner w.h.p. h g errD(g,h)· ½ - 100 opt · ½ - 

Agnostic boosting Agnostic Booster w.h.p. h’ Samples from g errD(g,h’) · opt +  D weak learner w.h.p. h g errD(g,h)· ½ - 100 opt · ½ -  Runs weak learner poly(1/100)times

Agnostic boosting Agnostic Booster w.h.p. h’ Samples from g errD(g,h’) ·opt +  +  D (,)-weak learner w.h.p. h g errD(g,h)·½ -  opt ·½ -  Runs weak learner poly(1/, 1/)times

Agnostic Booster Agnostic boosting Weak learner. For any noise rate < ½ produces a better-than-trivial hypothesis Strong Learner. Produces almost-optimal hypothesis

“Approximation Booster” Analogy poly-time MAX-3-SAT algorithm that when opt=7/8+ε produces solution with value 7/8+ε100 algorithm for MAX-3-SAT produces solution with value opt +  running time poly(n,1/)

Gap 0 ½ 1 No hardness gap close to ½ booster no gap anywhere (additive PTAS)

Agnostic boosting • New Analysis for Mansour-McAllester booster. • uses branching programs; nodes are weak hypotheses • Previous Agnostic Boosting: • Ben-David+Long+Mansour, and Gavinsky, defined agnostic boosting differently. • Their result cannot be used for our application

Booster x h1 h1(x)=0 h1(x)=1 1 0

Booster: Split step x different distribution different distribution h1 h1 h1(x)=0 h1(x)=0 h1(x)=1 h1(x)=1 h2’ 0 h2 1 h2‘(x)=0 h2‘(x)=1 h2(x)=0 h2(x)=1 1 0 1 0 choose the “better” option

Booster: Split step x h1 h1(x)=0 h1(x)=1 h2 1 h2(x)=0 h2(x)=1 h3 0 H3(x)=0 h3(x)=1 1 0

Booster: Split step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h4(x)=1 h2(x)=0 h2(x)=1 1 0 h3 0 H3(x)=0 h3(x)=1 … 1 0

Booster: Merge step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h4(x)=1 h2(x)=0 h2(x)=1 1 0 h3 0 H3(x)=0 Merge if “similar” h3(x)=1 1 0

Booster: Merge step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h2(x)=0 h2(x)=1 h4(x)=1 0 h3 0 h3(x)=1 H3(x)=0 1 0

Booster: Another split step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h2(x)=0 h2(x)=1 h4(x)=1 0 h3 0 h3(x)=1 H3(x)=0 h5 … 0 0 1

Booster: final result x h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 0 1

Agnostically learning parities

Application: Parity with Noise * non-proper learner. hypothesis is circuit with 2O(n/logn) gates Feldman et al give black-box reduction to random-noise case. We give direct result • Theorem:ε, have weak learner that for noise ½-ε produces an hypothesis which is wrong on ½-(2ε)n0.001/2 fraction of space. Running time 2O(n/logn)

Corollary: Learners for many classes (without noise) • Can learn without noise any class with “guaranteed correlated parity”, in time 2O(n/logn) • e.g. DNF, any others? • A weak parity learner that runs in 2O(n0.32) time would beat the best algorithm known for learning DNF • Good evidence that parity with noise is hard efficient cryptosystems [Hopper-Blum, Blum-Furst-etal, and many others] ?

Main Idea: 1. Take Learner which resists random noise (BKW) 2. Add Randomness to its behavior, until you get a Weak Agnostic learner. Idea of weak agnostic parity learner “Between two evils, I pick the one I haven’t tried before”– Mae West “Between two evils, I pick uniformly at random” – CS folklore

Summary Problem: It is difficult but perhaps possible to design agnostic learning algorithms. Proposed Solution: Agnostic Boosting. Contributions: • Right(er) definition for weak agnostic learner • Agnostic boosting • Learning Parity with noise in hardest noise model • Entertaining STOC ’08 participants

Open Problems • Find other applications for Agnostic Boosting • Improve PwN algorithms. • Get proper learner for parity with noise • Reduce PwN with agnostic noise to PwN with random noise • Get evidence that PwN is hard • Prove that if parity with noise is easy then FACTORING is easy. 128$ reward!

May the parity be with you! The end.

Sketch of weak parity learner

Weak parity learner • Sample labeled points from distribution, sample unlabeled x, let’s guess f(x) Bucket according to last 2n/logn bits + + + to next round

Weak parity learner LAST ROUND: • √n vectors with sum=0. gives guess for f(x) + + + =0 =0 =0

Weak parity learner LAST ROUND: • √n vectors with sum=0. gives guess for f(x) • by symmetry, prob. of mistake = %mistakes • Claim: %mistakes (Cauchy-Schwartz) + + + =0 =0 =0

Intuition behind two main parts

Intuition behind Boosting

Intuition behind Boosting decrease weight increase weight

Intuition behind Boosting 1 • Run, reweight, run, reweight, … . Take majority of hypotheses. • Algorithmic & Efficient Yao-von Neumann Minimax Principle decrease weight 1 increase weight 2 0

Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua

Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua

Presentation Transcript

Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua

Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni, M

Elad Verbin Aarhus University

Elad Verbin Aarhus University

HEP Tel Aviv University

HEP Tel Aviv University

Machine Learning: Foundations Course TAU – 2012A Prof. Yishay Mansour

TEL AVIV

Georgia Tech

TEL AVIV

Tel Aviv Global City

Tel Aviv/Jaffa

Tel Aviv Global City

Holiday Apartments Tel Aviv

Shooting in Tel Aviv

Ispra tel aviv

Exceptional Budget Hostel Tel Aviv

HEP Tel Aviv University

Yael Tauman Kalai

Tel Aviv

Tel Aviv painting