310 likes | 439 Views
Hardness of Learning Halfspaces with Noise. Prasad Raghavendra Advisor Venkatesan Guruswami. Spam Problem. 1. 2 X 1 + 3 X 1 + 3 X 0 + 1 X 1 + 7 X 0 = 6. 2. 1. 3. 3. 6 > 3 Output SPAM. 3. 0. 1. 1. 7. PERCEPTRON. 0. Halfspace Learning Problem. Input: Training Samples
E N D
Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami
Spam Problem 1 2 X 1 + 3 X 1 + 3 X 0 + 1 X 1 + 7 X 0 = 6 2 1 3 3 6 > 3 Output SPAM 3 0 1 1 7 PERCEPTRON 0
Halfspace Learning Problem Input: Training Samples Vectors : W1,W2,…Wm {-1,1}n Labels : l1, l2,…lm {-1,1} SPAM + + + + + + Y - + + - - Output: Separating Halfspace:(A, θ) A ∙ Wi < θif li =-1 A ∙ Wi ≥ θif li =1 θ-Threshold - - - - NOT SPAM - X
Perspective • Perceptron classifiers are the simplest neural networks – widely used for classification. • Perceptron learning algorithms can learn if the data is perfectly separated. SPAM + + + - + + + - - - + + + - - - - + - - + - NOT SPAM - + X
Inseparability • Who said Halfspaces can classify SPAM vs NOT SPAM? Data is inherently inseparable • Agnostic Learning • Even if data is separable, what about Noise? inherent in many forms of data PAC learning
In Presence of Noise Classifies correctly 16 of the 20 examples : Agreement = 0.8 or 80% Agreement : fraction of the examples classified correctly + + + + + + - ‘Find the hyperplane that maximizes the agreement with training examples’ Halfspace Maximum Agreement (HSMA) Problem Y - + + + - - - - - + - - - X
Related Work : Positive Results Each label flipped with probability less than 1/2 Random Classification Noise • [Blum-Freize-Kannan-Vempala 96]: a PAC learning algorithm that outputs a decision list of halfspaces • [Cohen 97] : a proper learning algorithm(outputs a halfspace) for learning halfspaces Distribution of examples • [Kalai-Klivans-Mansour-Servedio 05] : an algorithm that finds a close to optimal halfspace when examples are from uniform or any log-concave distribution.
Related Work : Negative Results • [Amaldi-Kann 98, Ben-David-Eiron-Long 92] HSMA is NP-hard to approximate with some constant factor [261/262, 415/418] • [Bshouty-Burroghs 02] HSMA is NP-hard to approximate better than 84/85 • [Arora-Babai-Stern-Sweedyk 97, Amaldi-Kann 98]NP-hard to minimize disagreements within a factor of 2O(log n)
Open Problem Given that 99.9% of the examples are correct : No algorithm known that finds a halfspace with agreement of 51% No hardness result ruled out getting an agreement of 99% • Closing this gap was stated as an open problem by [Blum-Frieze-Kannan-Vempala 96] • Highlighted in recent work by [Feldman 06] on (1-ε,1/2 +δ) tight hardness of learning monomials
Our Result For any ε,δ > 0 , given a set of training examples, it is NP-hard to distinguish between following two cases: • There is a halfspace with agreement 1- ε • No halfspace has agreement greater than ½ + δ Even with 99.9% of examples non-noisy, the best we can do is output a random/trivial halfspace!
Remarks • [Feldman-Gopalan-Khot-Ponnuswami 06] independently showed a similar result. • Our Hardness result holds even for boolean examples {-1,1}n(their result holds for Rn) • [Feldman et al.]’shardness result gives stronger hardness in the sub-constant regime • We also show: Given a set of linear equations over integers that is 1-εsatisfiable it isNP-hard to find an assignment that satisfies more than δ fraction of the equations
Linear Inequalities Unknowns A = (a1,a2,a3,a4) θ Let halfspace be a1x1 + a2x2 +… +an xn≥ θ Suppose W1 = (-1, 1, -1, 1) l1 = 1 Constraint : a1(-1) + a2(1) + a3(-1)+ a4(1)≥ θ a1 + a2 + a3 + a4 ≥ θ a1 + a2 + a3 - a4 < θ a1 + a2 - a3 + a4 <θ a1 + a2 - a3 + a4 ≥θ a1 - a2 + a3 - a4 ≥θ a1 - a2 + a3 + a4 <θ a1 + a2 - a3 - a4 <θ Solving a system of linear inequalities Learning a Halfspace
Label Cover Problem U, V : set of vertices E : set of edges {1,2… R} : set of labels πe: constraint on edge e An assignment A satisfies an edge e = (u,v) E if πe (A(u)) = A(v) 1 2 3 . . R 1 6 1 2 3 . . R 3 5 3 u 3 πe 2 2 v π e (3)=2 1 5 4 7 U V Find an assignment A that satisfies maximum number of edges
Hardness of Label Cover [Raz 98] There exists γ > 0 such that Given a label cover instance Г =(U,V,E,R,π), it is NP-hard to distinguish between : • Г is completely satisfiable • No assignment satisfies more than 1/Rγ fraction of the edges.
Aim Variables : a1,a2,a3,a4,θ a1 + a2 + a3 + a4 ≥ θ a1 + a2 + a3 - a4 < θ a1 + a2 - a3 + a4 <θ a1 + a2 - a3 + a4 ≥θ a1 - a2 + a3 - a4 ≥θ a1 - a2 + a3 + a4 <θ a1 + a2 - a3 - a4 <θ SATISFIABLE 1/Rγ SATISFIABLE U V Homogenous inequalities with +1, -1 coefficients
Variables For each vertex u, R variables : u1,u2,…,uR 1 2 3 . .R If u is assigned label k then uk = 1 and uj = 0 for all j ≠k U V
Equation Tuples EQUATION TUPLE For all u u1 + u2 +.. uR = 1 For all u,v u1 + u2 +.. uR - (v1 + v2 +.. vR) = 0 All vertices are assigned exactly one label 1 2 3 . . R For all constraints πe all 1 ≤ k ≤ R ∑ui = vksummation over all i,πe(i) = k 1 2 3 . . R u1 – v1 = 0 u2 + u3 – v2 = 0 u πe v Pick randomly t variables ui ui = 0 Most of the variables are zero OVER ALL RANDOM CHOICES
Equation Tuples SATISFIABLE 1/Rγ SATISFIABLE There is an assignment that satisfies most of the equation tuples Scaling Factor : u1 + u2 +.. uR Suppose u2 + u3 – v2 = 0 is an equation |u2 + u3 – v2| > ε (u1 + u2 +.. uR)
Next Step Each variable appears exactly once in a tuple, with coefficient+1, -1 u1 – v1 = 0 u2 + u3 – v2 = 0 u1 + u2 + u3 – v1 –v2– v3 = 0 u1 = 0 u3 + v1 – v2 = 0 • Introduce Several copies of the variables • Add consistency checks between the different copies of the same variable Most tuples have C equations that are not even approximately satisfied One Unsatisfied equation
Recap Each variable appears exactly once in a tuple, with coeffcient+1, -1 SATISFIABLE 1/Rγ SATISFIABLE • Using linear inequalities distinguish between a tuple that is • Completely Satisfied • Atleast C of its equations are not even approximately satisfied Most tuples have C equations that are not even approximately satisfied Most tuples are completely satisfied
Observation B > 0 |A| < B A – B < 0 A + B ≥ 0 Pick one of the equation tuples at random Scaling Factor : u1 + u2 +.. uR X 1+ X 1+ X -1+ X 1+ X -1+ u1 – v1 = 0 u4 + u5 – v2 = 0 u6 + u2 + u7 – v4 –v5– v6 = 0 u3 = 0 u8 + v3 – v7 = 0 = u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 - u1 + u2 +.. uR < 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 + u1 + u2 +.. uR≥ 0
Good Case With high probability over the choice of tuples u1 – v1 u2 + u3 – v2 u1 + u2 + u3 – v1 –v2– v3 u1 u3 + v1 – v2 = 0 = 0 = 0 = 0 = 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 = 0 The assignment also satisfies, u1 + u2 +.. uR = 1 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 - u1 + u2 +.. uR < 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 + u1 + u2 +.. uR≥ 0 BOTH INEQUALITIES SATISFIED
Bad Case With high probability over choice of equaton tuple, > ε (u1 + u2 +.. uR ) > ε (u1 + u2 +.. uR ) > ε (u1 + u2 +.. uR ) u1 – v1 u2 + u3 – v2 u1 + u2 + u3 – v1 –v2– v3 u1 u3 + v1 – v2 For large enough C ,With high probability over choice of +1,-1 combination, | u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 | > (u1 + u2 +.. uR ) u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 - u1 + u2 +.. uR < 0 u1 - u2 + u3 +u4+u5-u6-u7-u8–v1 –v2 – v3 +v4+v5+v6+v7 + u1 + u2 +.. uR≥ 0 ATMOST ONE OF INEQUALITIES SATISFIED
Interesting Set of Vectors All possible {-1,1} combinations is exponentially large set. Construct a polynomial size subset S of {-1,1}nsuch that For any vector v = (v1 ,v2 ,…vn) with sufficiently many large coordinates(>ε), at least 1- δfractionof the vectors u S satisfy |u∙v| > 1 Construction using 4-wise independent family and random grouping of coordinates
Construction -1 1 -1 1 -1 1 1 V1 V2 V3 V4 V5 V6 V7 > ε > ε > ε > ε ∙ -V1 +V2 -V3 +V4 -V5 +V6 +V7 > 1 = Four-wise independent family : some constant probability All 2n combinations : probability close to 1
4-wise independent set Construction V1 V2 V3 V4 V5 V6 V7 .. .. V89 V99 V100 V101 V102 V103 ε -1 .. 1 V2 .. V5 ε ε = S1 1 -1 +1 1 All 2n combinations ε V9 … V89 V6 1 … -1 1 ε =S2 ε ε ε = L 1 V1 …. V8 1 …. 1 ε =S5 ε ε ε All 2n combinations ε V100 … V101 V103 -1 … 1 -1 By independence of grouping By Chernoff Bounds =S10 ε ε
Conclusion • Either an assumption on the distribution of examples or the noise is necessary for efficient halfspace learning algorithms. • [Raghavendra-Venkatesan] Similar hardness result for learning Support vector machines in presence of adversarial noise.
Details • All possible {-1,1} combinations is an exponentially large set. • No variable should occur more than once in an equation tuple, to ensure that ultimately the inequalities all have coefficients in {-1,1} Construction using 4-wise independent family and random grouping of coordinates Use different copies of the variables for different equations, and careful choice of consistency checks
Interesting Set of Vectors All possible {-1,1} combinations is exponentially large set. Construct a polynomial size subset S of {-1,1}n such that For any vector v = (v1 ,v2 ,…vn) with sufficiently many large coordinates(>ε), atmostδfractionof the vectors u S satisfy |u∙v| < 1 Construction using 4-wise independent family and random grouping of coordinates
Equation Tuple ε-Satisfaction An assignment A is said to ε-satisfy an equation E tuple if it satisfies all the equations in the tuple u1 – v1 = 0 u2 + u3 – v2 = 0 u1 + u2 + u3 – v1 –v2– v3 = 0 u1 = 0 u3 + v1 – v2 = 0 u2 + u3 – v2 < ε (u1 + u2 + u3)