Sketching for M-Estimators: A Unified Approach to Robust Regression

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden

Regression Linear Regression • Statistical method to study linear dependencies between variables in the presence of noise. Example • Ohm's law V = R ∙ I • Find linear function that best fits the data

Regression 1 d 0 1 1 d d Standard Setting • One measured variable b • A set of predictor variables a ,…, a • Assumption: b = x + a x + … + a x + e e is assumed to be noise and the xi are model parameters we want to learn • Can assume x0 = 0 • Now consider n observations of b

Regression Matrix form Input: nd-matrix A and a vector b=(b1,…, bn)n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close • Consider the over-constrained case, when n À d

Fitness Measures What about the many other fitness measures used in practice? Least Squares Method • Find x* that minimizes |Ax-b|22 • Ax* is the projection of b onto the column span of A • Certain desirable statistical properties • Closed form solution: x* = (ATA)-1 AT b Method of least absolute deviation (l1 -regression) • Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>| • Cost is less sensitive to outliers than least squares • Can solve via linear programming

M-Estimators • Measure function • G: R -> R¸ 0 • G(x) = G(-x), G(0) = 0 • G is non-decreasing in |x| • |y|M = Σi=1n G(yi) • Solve minx |Ax-b|M • Least squares and L1-regression are special cases

Huber Loss Function G(x) = x2/(2c) for |x| · c G(x) = |x|-c/2 for |x| > c Enjoys smoothness properties of l22 and robustness properties of l1

Other Examples • L1-L2 G(x) = 2((1+x2/2)1/2 – 1) • Fair estimator G(x) = c2 [ |x|/c - log(1+|x|/c) ] • Tukey estimator G(x) = c2/6 (1-[1-(x/c)2]3) if |x| · c = c2/6 if |x| > c

Nice M-Estimators • An M-Estimator is nice if it has at least linear growth and at most quadratic growth • There is CG > 0 so that for all a, a’ with |a| ¸ |a’| > 0, |a/a’|2¸ G(a)/G(a’) ¸ CG |a/a’| • Any convex G satisfies the linear lower bound • Any sketchable G satisfies the quadratic upper bound • sketchable => there is a distribution on t x n matrices S for which |Sx|M = £(|x|M) with probability 2/3 and t is independent of n

Our Results Let nnz(A) denote # of non-zero entries of an n x d matrix A • [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p. |Ax’-b|H· (1+ε) minx |Ax-b|H • [Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p. |Ax’-b|M· C*minx |Ax-b|M Remarks: - For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time - Our algorithm for nice M-estimators is universal

Talk Outline • Huber result • Nice M-Estimators result

A - x b minx Naive Sampling Algorithm M S¢A - x S¢b x’ = argminx M S uniformly samples poly(d/ε) rows – this is a terrible algorithm

For l2, the qi are the squared row norms in an orthonormal basis of A • For lp, the qi are p-th powers of the p-norms of rows in a “well conditioned basis” • [Dasgupta et al.] Leverage Score Sampling • For lp-norms, there are probabilities q1, …, qn with Σi qi = poly(d/ε) so that sampling works A - x b minx M S¢A - x S¢b x’ = argminx M • All qi can be found in O(nnz(A)log n) + poly(d) time • S is diagonal. Si,i = 1/qi if row i is sampled, 0 otherwise

Huber Regression Algorithm • [Huber inequality]: For z 2 Rn, £(n-1/2) min(|z|1, |z|22/(2c)) · |z|H· |z|1 • Proof by case analysis • Sample from a mixture of l1-leverage scores and l2-leverage scores • pi = n1/2¢(qi(1) + qi(2)) • Our nnz(A)log n + poly(d/ε) algorithm • After one step, number of rows < n1/2 poly(d/ε) • Recursively solve a weighted Huber • Weights do not grow quickly • Once size is < n.01 poly(d/ε), solve by convex programming

Talk Outline • Huber result • Nice M-Estimators result

[ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 CountSketch • For l2 regression, CountSketch with poly(d) rows works [Clarkson, W]: • Compute S*A in nnz(A) time • Compute x’ argminx |SAx-Sb|2 in poly(d) time S =

The same M-Sketch works for all nice M-estimators! • x’ = argminx |TAx-Tb|M, w M-Sketch [ [ Note: many uses of this data structure do not work since they involve a median operation S1¢ R0 S2¢ R1 S3¢ R2 … Slog n¢ Rlog n • Sketch used for estimating frequency moments [Indyk, W] and earthmover distance [Verbin, Zhang] • Si are independent CountSketch matrices with poly(d) rows • Ri is n x n diagonal and uniformly samples a 1/bi fraction of [n]

M-Sketch Intuition • Consider a fixed y = Ax-b • For M-Sketch T, output |Ty|w, M = Σi wi G((Ty)i) • [Contraction] |Ty|w,M¸ ½ |y|M w.pr. 1-exp(-d log d) • [Dilation] |Ty|w,M· 2 |y|M w.pr. 9/10 • Contraction allows for a net argument (no scale-invariance!) • Dilation implies the optimal y* does not dilate much

M-Skech Analysis • Partition into weight classes: • Si = {j | G(yj) 2 (|y|M/bi, |y|M/bi-1]} • If |Si| > d log d, there’s a “sampling level” containing about dlog d elements of Si (gives exp(-dlog d) failure probability) • Elements from Sj for j · i do not collide • Elements from Sj for j > i cancel in a bucket (concentrate to 2-norm) • If |Si| small, all elements are found in the top level or Si is not important (relate M to l2) • If G close to quadratic growth, need to “clip” top buckets • Ky-Fan norm

Conclusions Summary: • [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm • [Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm Questions: 1. Is there a sketch-based estimator for (1+ε)-approximation? 2. (Meta-question) Apply streaming techniques to linear algebra - countsketch –> l_2-regression - p-stable random variables -> l_p regression for p in [1,2] - countsketch + heavy hitters -> nice M-estimators - Pagh’s tensorsketch -> polynomial kernel regression …

Sketching for M-Estimators: A Unified Approach to Robust Regression