Hardness of Learning Halfspaces with Noise

Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami

Spam Problem 1 2 X 1 + 3 X 1 + 3 X 0 + 1 X 1 + 7 X 0 = 6 2 1 3 3 6 > 3 Output SPAM 3 0 1 1 7 PERCEPTRON 0

Halfspace Learning Problem Input: Training Samples Vectors : W1,W2,…Wm {-1,1}n Labels : l1, l2,…lm {-1,1} SPAM + + + + + + Y - + + - - Output: Separating Halfspace:(A, θ) A ∙ Wi < θif li =-1 A ∙ Wi ≥ θif li =1 θ-Threshold - - - - NOT SPAM - X

Perspective • Perceptron classifiers are the simplest neural networks – widely used for classification. • Perceptron learning algorithms can learn if the data is perfectly separated. SPAM + + + - + + + - - - + + + - - - - + - - + - NOT SPAM - + X

Inseparability • Who said Halfspaces can classify SPAM vs NOT SPAM? Data is inherently inseparable • Agnostic Learning • Even if data is separable, what about Noise? inherent in many forms of data PAC learning

In Presence of Noise Classifies correctly 16 of the 20 examples : Agreement = 0.8 or 80% Agreement : fraction of the examples classified correctly + + + + + + - ‘Find the hyperplane that maximizes the agreement with training examples’ Halfspace Maximum Agreement (HSMA) Problem Y - + + + - - - - - + - - - X

Related Work : Positive Results Each label flipped with probability less than 1/2 Random Classification Noise • [Blum-Freize-Kannan-Vempala 96]: a PAC learning algorithm that outputs a decision list of halfspaces • [Cohen 97] : a proper learning algorithm(outputs a halfspace) for learning halfspaces Distribution of examples • [Kalai-Klivans-Mansour-Servedio 05] : an algorithm that finds a close to optimal halfspace when examples are from uniform or any log-concave distribution.

Related Work : Negative Results • [Amaldi-Kann 98, Ben-David-Eiron-Long 92] HSMA is NP-hard to approximate with some constant factor [261/262, 415/418] • [Bshouty-Burroghs 02] HSMA is NP-hard to approximate better than 84/85 • [Arora-Babai-Stern-Sweedyk 97, Amaldi-Kann 98]NP-hard to minimize disagreements within a factor of 2O(log n)

Open Problem Given that 99.9% of the examples are correct : No algorithm known that finds a halfspace with agreement of 51% No hardness result ruled out getting an agreement of 99% • Closing this gap was stated as an open problem by [Blum-Frieze-Kannan-Vempala 96] • Highlighted in recent work by [Feldman 06] on (1-ε,1/2 +δ) tight hardness of learning monomials

Our Result For any ε,δ > 0 , given a set of training examples, it is NP-hard to distinguish between following two cases: • There is a halfspace with agreement 1- ε • No halfspace has agreement greater than ½ + δ Even with 99.9% of examples non-noisy, the best we can do is output a random/trivial halfspace!

Remarks • [Feldman-Gopalan-Khot-Ponnuswami 06] independently showed a similar result. • Our Hardness result holds even for boolean examples {-1,1}n(their result holds for Rn) • [Feldman et al.]’shardness result gives stronger hardness in the sub-constant regime • We also show: Given a set of linear equations over integers that is 1-εsatisfiable it isNP-hard to find an assignment that satisfies more than δ fraction of the equations

Linear Inequalities Unknowns A = (a1,a2,a3,a4) θ Let halfspace be a1x1 + a2x2 +… +an xn≥ θ Suppose W1 = (-1, 1, -1, 1) l1 = 1 Constraint : a1(-1) + a2(1) + a3(-1)+ a4(1)≥ θ a1 + a2 + a3 + a4 ≥ θ a1 + a2 + a3 - a4 < θ a1 + a2 - a3 + a4 <θ a1 + a2 - a3 + a4 ≥θ a1 - a2 + a3 - a4 ≥θ a1 - a2 + a3 + a4 <θ a1 + a2 - a3 - a4 <θ Solving a system of linear inequalities Learning a Halfspace

Label Cover Problem U, V : set of vertices E : set of edges {1,2… R} : set of labels πe: constraint on edge e An assignment A satisfies an edge e = (u,v) E if πe (A(u)) = A(v) 1 2 3 . . R 1 6 1 2 3 . . R 3 5 3 u 3 πe 2 2 v π e (3)=2 1 5 4 7 U V Find an assignment A that satisfies maximum number of edges

Hardness of Label Cover [Raz 98] There exists γ > 0 such that Given a label cover instance Г =(U,V,E,R,π), it is NP-hard to distinguish between : • Г is completely satisfiable • No assignment satisfies more than 1/Rγ fraction of the edges.

Aim Variables : a1,a2,a3,a4,θ a1 + a2 + a3 + a4 ≥ θ a1 + a2 + a3 - a4 < θ a1 + a2 - a3 + a4 <θ a1 + a2 - a3 + a4 ≥θ a1 - a2 + a3 - a4 ≥θ a1 - a2 + a3 + a4 <θ a1 + a2 - a3 - a4 <θ SATISFIABLE 1/Rγ SATISFIABLE U V Homogenous inequalities with +1, -1 coefficients

Variables For each vertex u, R variables : u1,u2,…,uR 1 2 3 . .R If u is assigned label k then uk = 1 and uj = 0 for all j ≠k U V

Equation Tuples EQUATION TUPLE For all u u1 + u2 +.. uR = 1 For all u,v u1 + u2 +.. uR - (v1 + v2 +.. vR) = 0 All vertices are assigned exactly one label 1 2 3 . . R For all constraints πe all 1 ≤ k ≤ R ∑ui = vksummation over all i,πe(i) = k 1 2 3 . . R u1 – v1 = 0 u2 + u3 – v2 = 0 u πe v Pick randomly t variables ui ui = 0 Most of the variables are zero OVER ALL RANDOM CHOICES

Equation Tuples SATISFIABLE 1/Rγ SATISFIABLE There is an assignment that satisfies most of the equation tuples Scaling Factor : u1 + u2 +.. uR Suppose u2 + u3 – v2 = 0 is an equation |u2 + u3 – v2| > ε (u1 + u2 +.. uR)

Next Step Each variable appears exactly once in a tuple, with coefficient+1, -1 u1 – v1 = 0 u2 + u3 – v2 = 0 u1 + u2 + u3 – v1 –v2– v3 = 0 u1 = 0 u3 + v1 – v2 = 0 • Introduce Several copies of the variables • Add consistency checks between the different copies of the same variable Most tuples have C equations that are not even approximately satisfied One Unsatisfied equation

Recap Each variable appears exactly once in a tuple, with coeffcient+1, -1 SATISFIABLE 1/Rγ SATISFIABLE • Using linear inequalities distinguish between a tuple that is • Completely Satisfied • Atleast C of its equations are not even approximately satisfied Most tuples have C equations that are not even approximately satisfied Most tuples are completely satisfied

Observation B > 0 |A| < B A – B < 0 A + B ≥ 0 Pick one of the equation tuples at random Scaling Factor : u1 + u2 +.. uR X 1+ X 1+ X -1+ X 1+ X -1+ u1 – v1 = 0 u4 + u5 – v2 = 0 u6 + u2 + u7 – v4 –v5– v6 = 0 u3 = 0 u8 + v3 – v7 = 0 = u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 - u1 + u2 +.. uR < 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 + u1 + u2 +.. uR≥ 0

Good Case With high probability over the choice of tuples u1 – v1 u2 + u3 – v2 u1 + u2 + u3 – v1 –v2– v3 u1 u3 + v1 – v2 = 0 = 0 = 0 = 0 = 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 = 0 The assignment also satisfies, u1 + u2 +.. uR = 1 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 - u1 + u2 +.. uR < 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 + u1 + u2 +.. uR≥ 0 BOTH INEQUALITIES SATISFIED

Bad Case With high probability over choice of equaton tuple, > ε (u1 + u2 +.. uR ) > ε (u1 + u2 +.. uR ) > ε (u1 + u2 +.. uR ) u1 – v1 u2 + u3 – v2 u1 + u2 + u3 – v1 –v2– v3 u1 u3 + v1 – v2 For large enough C ,With high probability over choice of +1,-1 combination, | u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 | > (u1 + u2 +.. uR ) u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 - u1 + u2 +.. uR < 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 + u1 + u2 +.. uR≥ 0 ATMOST ONE OF INEQUALITIES SATISFIED

Interesting Set of Vectors All possible {-1,1} combinations is exponentially large set. Construct a polynomial size subset S of {-1,1}nsuch that For any vector v = (v1 ,v2 ,…vn) with sufficiently many large coordinates(>ε), at least 1- δfractionof the vectors u S satisfy |u∙v| > 1 Construction using 4-wise independent family and random grouping of coordinates

Construction -1 1 -1 1 -1 1 1 V1 V2 V3 V4 V5 V6 V7 > ε > ε > ε > ε ∙ -V1 +V2 -V3 +V4 -V5 +V6 +V7 > 1 = Four-wise independent family : some constant probability All 2n combinations : probability close to 1

4-wise independent set Construction V1 V2 V3 V4 V5 V6 V7 .. .. V89 V99 V100 V101 V102 V103 ε -1 .. 1 V2 .. V5 ε ε = S1 1 -1 +1 1 All 2n combinations ε V9 … V89 V6 1 … -1 1 ε =S2 ε ε ε = L 1 V1 …. V8 1 …. 1 ε =S5 ε ε ε All 2n combinations ε V100 … V101 V103 -1 … 1 -1 By independence of grouping By Chernoff Bounds =S10 ε ε

Conclusion • Either an assumption on the distribution of examples or the noise is necessary for efficient halfspace learning algorithms. • [Raghavendra-Venkatesan] Similar hardness result for learning Support vector machines in presence of adversarial noise.

THANK YOU

Details • All possible {-1,1} combinations is an exponentially large set. • No variable should occur more than once in an equation tuple, to ensure that ultimately the inequalities all have coefficients in {-1,1} Construction using 4-wise independent family and random grouping of coordinates Use different copies of the variables for different equations, and careful choice of consistency checks

Interesting Set of Vectors All possible {-1,1} combinations is exponentially large set. Construct a polynomial size subset S of {-1,1}n such that For any vector v = (v1 ,v2 ,…vn) with sufficiently many large coordinates(>ε), atmostδfractionof the vectors u S satisfy |u∙v| < 1 Construction using 4-wise independent family and random grouping of coordinates

Equation Tuple ε-Satisfaction An assignment A is said to ε-satisfy an equation E tuple if it satisfies all the equations in the tuple u1 – v1 = 0 u2 + u3 – v2 = 0 u1 + u2 + u3 – v1 –v2– v3 = 0 u1 = 0 u3 + v1 – v2 = 0 u2 + u3 – v2 < ε (u1 + u2 + u3)

Hardness of Learning Halfspaces with Noise

Hardness of Learning Halfspaces with Noise

Presentation Transcript

Agnostically learning halfspaces

Hardness

Hardness

Pseudorandom Generators for Halfspaces

Hardness

Hardness of Water

Hardness

“Hardness of Hearts”

Hardness of Water

Learning intersections and thresholds of halfspaces

Coping with Hardness

Learning, testing, and approximating halfspaces

Hardness

Water Hardness: Determination with EDTA

HARDNESS OF APPROXIMATIONS

Agnostic Learning of Conjunctions by Halfspaces is Hard

Hardness of Water

Living with Noise

Learning intersections and thresholds of halfspaces

Noise Tolerant Learning

Hardness of Water