Sparse Recovery ( Using Sparse Matrices)

Sparse Recovery(Using SparseMatrices) PiotrIndyk MIT

Heavy Hitters • Also called frequent elements and elephants • Define HHpφ(x) = { i: |xi| ≥ φ ||x||p} • Lp Heavy Hitter Problem: • Parameters: φ and φ’ (often φ’ = φ-ε) • Goal: return a set S of coordinates s.t. • S contains HHpφ(x) • S is included in HHpφ’ (x) • Lp Point Query Problem: • Parameter:  • Goal: at the end of the stream, given i, report x*i=xi ||x||p Can solve L2 point query problem, with parameter  and failure probability P by storing O(1/2 log(1/P))numbers sketches [A-Gibbons-M-S’99],[Gilbert-Kotidis-Muthukrishnan-Strauss’01]

Ax A = x Sparse Recovery(approximation theory, learning Fourier coeffs, linear sketching, finite rate of innovation, compressed sensing...) • Setup: • Data/signal in n-dimensional space : x E.g., xis an 256x256 image n=65536 • Goal: compress x into a “sketch” Ax , where A is a mxn “sketch matrix”, m << n • Requirements: • Plan A: want to recover x from Ax • Impossible: underdetermined system of equations • Plan B: want to recover an “approximation” x* of x • Sparsity parameter k • Informally: want to recover largestkcoordinates of x • Formally: want x* such that ||x*-x||pC(k) minx’ ||x’-x||q over all x’ that are k-sparse (at mostk non-zero entries) • Want: • Good compression (small m=m(k,n)) • Efficient algorithms for encoding and recovery • Why linear compression ?

Application I: Monitoring Network Traffic Data Streams • Router routs packets • Where do they come from ? • Where do they go to ? • Ideally, would like to maintain a traffic matrix x[.,.] • Easy to update: given a (src,dst) packet, increment xsrc,dst • Requires way too much space! (232 x 232 entries) • Need to compressx, increment easily • Using linear compression we can: • Maintain sketch Ax under increments to x, since A(x+) = Ax + A • Recover x* from Ax destination source x

Applications, ctd. • Single pixel camera [Wakin, Laska, Duarte, Baron, Sarvotham, Takhar, Kelly, Baraniuk’06] • Pooling Experiments [Kainkaryam, Woolf’08], [Hassibi et al’07], [Dai-Sheikh, Milenkovic, Baraniuk], [Shental-Amir-Zuk’09],[Erlich-Shental-Amir-Zuk’09]

Constructing matrix A • “Most” matrices A work • Sparse matrices: • Data stream algorithms • Coding theory (LDPCs) • Dense matrices: • Compressed sensing • Complexity/learning theory (Fourier matrices) • “Traditional” tradeoffs: • Sparse: computationally more efficient, explicit • Dense: shorter sketches • Recent results: the “best of both worlds”

Scale: Excellent Very Good Good Fair Results Caveats: (1) most “dominated” results not shown (2) only results for general vectors x are displayed (3) sometimes the matrix type matters (Fourier, etc)

Part I This talk: • Assume x≥0 – this simplifies the algorithm and analysis; see the original paper for the general case • Prove the following l∞/l1 guarantee: ||x-x*||∞≤ C Err1 /k This is actually stronger than the l1/l1 guarantee (cf. [CM’06], see also the Appendix) Note: [CM’04] originally proved a weaker statement where ||x-x*||∞≤ C||x||1 /k. The stronger guarantee follows from the analysis of [CCF’02] (cf. [GGIKMS’02]) who applied it to Err2 Theorem: There is a distribution over mxn matrices A, m=O(klogn), such that for any x, given Ax, we can recover x* such that ||x-x*||1≤ C Err1, where Err1=mink-sparsex’ ||x-x’||1 with probability 1-1/n. The recovery algorithm runs in O(n log n) time.

First attempt • Matrix view: • A 0-1 wxn matrix A, with one 1 per column • The i-th column has 1 at position h(i), where h(i) be chosen uniformly at random from {1…w} • Hashing view: • Z=Ax • hhashes coordinates into “buckets” Z1…Zw • Estimator: xi*=Zh(i) 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 xi xi* Z1 ………..Z(4/)k Closely related: [Estan-Varghese’03], “counting” Bloom filters

Analysis x • We show xi* ≤ xi Err/k with probability >1/2 • Assume |xi1| ≥ … ≥ |xim| and let S={i1…ik} (“elephants”) • When is xi* > xi Err/k? • Event 1: S and i collide, i.e., h(i) in h(S-{i}) Probability: at most k/(4/)k = /4<1/4 (if <1) • Event 2: many “mice” collide with i., i.e., ∑tnotinSu{i}, h(i)=h(t)xt> Err/k Probability: at most ¼ (Markov inequality) • Total probability of “bad” events <1/2 xi1 xi2 … xik xi Z1 ………..Z(4/)k

Second try • Algorithm: • Maintain d functions h1…hdand vectors Z1…Zd • Estimator: Xi*=mintZtht(i) • Analysis: • Pr[|xi*-xi| ≥ α Err/k ] ≤ 1/2d • Setting d=O(logn) (and thus m=O(k log n) ) ensures that w.h.p |x*i-xi|< α Err/k

Part II

Compressed Sensing • A.k.a. compressive sensing or compressive sampling [Candes-Romberg-Tao’04; Donoho’04] • Signal acquisition/processing framework: • Want to acquire a signal x=[x1…xn] • Acquisition proceeds by computing Ax+noise of dimension m<<n • We ignore noise in this lecture • From Ax we want to recover an approximation x* of x • Note: x* does not have to be k-sparse in general • Method: solve the following program: minimize ||x*||1 subject to Ax*=Ax • Guarantee: for some C>1 (1) ||x-x*||1≤ C mink-sparse x” ||x-x”||1 (2) ||x-x*||2≤ C mink-sparse x” ||x-x”||1 /k1/2

Linear Programming • Recovery: • minimize ||x*||1 • subject to Ax*=Ax • This is a linear program: • minimize ∑i ti • subject to • -ti ≤ x*i ≤ ti • Ax*=Ax • Can solve in nΘ(1) time

Intuition • LP: • minimize ||x*||1 • subject to Ax*=Ax • On the first sight, somewhat mysterious • But the approach has a long history in signal processing, statistics, etc. • Intuition: • The actual goal is to minimize ||x*||0 • But this comes down to the same thing (if A is “nice”) • The choice of L1 is crucial (L2 does not work) Ax*=Ax n=2, m=1, k=1

Restricted Isometry Property • A matrix A satisfiesRIP of order k with constant δ if for any k-sparse vector x we have (1-δ) ||x||2 ≤ ||Ax||2≤ (1+δ) ||x||2 • Theorem 1: If each entry of A is i.i.d. as N(0,1), and m=Θ(klog(n/k)), then A satisfies (k,1/3)-RIP w.h.p. • Theorem 2: (O(k),1/3)-RIP implies L2/L1 guarantee

dense vs. sparse • Restricted Isometry Property (RIP) [Candes-Tao’04]  is k-sparse ||||2 ||A||2 C ||||2 • Consider mxn 0-1 matrices with d ones per column • Do they satisfy RIP ? • No, unless m=(k2)[Chandar’07] • However, they can satisfy the following RIP-1 property [Berinde-Gilbert-Indyk-Karloff-Strauss’08]:  is k-sparse d (1-) ||||1 ||A||1  d||||1 • Sufficient (and necessary) condition: the underlying graph is a ( k, d(1-/2) )-expander

N(S) S Expanders n m • A bipartite graph is a (k,d(1-))-expander if for any left set S, |S|≤k, we have |N(S)|≥(1-)d |S| • Objects well-studied in theoretical computer science and coding theory • Constructions: • Probabilistic: m=O(k log (n/k)) • Explicit: m=k quasipolylog n • High expansion implies RIP-1:  is k-sparse d (1-) ||||1 ||A||1  d||||1 [Berinde-Gilbert-Indyk-Karloff-Strauss’08] d m n

Proof: d(1-/2)-expansion  RIP-1 • Want to show that for any k-sparse  we have d (1-) ||||1 ||A ||1  d||||1 • RHS inequality holds for any • LHS inequality: • W.l.o.g. assume |1|≥… ≥|k| ≥ |k+1|=…= |n|=0 • Consider the edges e=(i,j) in a lexicographic order • For each edge e=(i,j) define r(e) s.t. • r(e)=-1 if there exists an edge (i’,j)<(i,j) • r(e)=1 if there is no such edge • Claim 1: ||A||1 ≥∑e=(i,j) |i|re • Claim 2: ∑e=(i,j) |i|re ≥ (1-) d||||1 d m n

Recovery: algorithms

L1 minimization • Algorithm: • minimize ||x*||1 • subject to Ax=Ax* • Theoretical performance: • RIP1 replaces RIP2 • L1/L1 guarantee • Experimental performance • Thick line: Gaussian matrices • Thin lines: prob. levels for sparse matrices (d=10) • Same performance! k/m m/n

Matching Pursuit(s) A x*-x Ax-Ax* • Iterative algorithm: given current approximation x* : • Find (possibly several) i s. t. Ai “correlates” with Ax-Ax* . This yields i and z s. t. ||x*+zei-x||p << ||x* - x||p • Update x* • Sparsify x* (keep only k largest entries) • Repeat • Norms: • p=2 : CoSaMP, SP, IHTetc (RIP) • p=1 : SMP, SSMP (RIP-1) • p=0 : LDPC bit flipping (sparse matrices) = i i

Sequential Sparse Matching Pursuit [Berinde-Indyk’09] • Algorithm: • x*=0 • Repeat T times • Repeat S=O(k) times • Find i and zthat minimize* ||A(x*+zei)-Ax||1 • x* = x*+zei • Sparsify x* (set all but k largest entries of x* to 0) • Similar to SMP, but updates done sequentially A i N(i) Ax-Ax* x-x* * Set z=median[ (Ax*-Ax)N(I).Instead, one could first optimize (gradient) i and then z [ Fuchs’09]

Experiments 256x256 SSMP is ran with S=10000,T=20. SMP is ran for 100 iterations. Matrix sparsity is d=8.

Conclusions • Sparse approximation using sparse matrices • State of the art: deterministically can do 2 out of 3: • Near-linear encoding/decoding • O(k log (n/k)) measurements • Approximation guarantee with respect to L2/L1 norm • Open problems: • 3 out of 3 ? • Explicit constructions ? This talk

References • Survey: A. Gilbert, P. Indyk, “Sparse recovery using sparse matrices”, Proceedings of IEEE, June 2010. • Courses: • “Streaming, sketching, and sub-linear space algorithms”, Fall’07 • “Sub-linear algorithms” (with RonittRubinfeld), Fall’10 • Blogs: • Nuit blanche: nuit-blanche.blogspot.com/

Sparse Recovery ( Using Sparse Matrices)

Sparse Recovery ( Using Sparse Matrices)

Presentation Transcript

Sparse Recovery Using Sparse (Random) Matrices

1eps-Approximate Sparse Recovery

Recovery of Clustered Sparse Signals from Compressive Measurements

Matrix Extensions to Sparse Recovery

Sparse arrays

Sparse Recovery Using Sparse (Random) Matrices

1+eps-Approximate Sparse Recovery

Sparse Approximations

Fast and robust sparse recovery

Sparse Representation

On the Power of Adaptivity in Sparse Recovery

Sparse Matrices

Sparse Coding in Sparse Winner networks

Nonlinear Approximation Based Image Recovery Using Adaptive Sparse Reconstructions

Sparse Knowledge

Sparse LA

Sparse Matrices

Exact Recovery of Low-Rank Plus Compressed Sparse Matrices

Distilled Sensing: Selective Sampling for Sparse Signal Recovery

1+eps-Approximate Sparse Recovery

Sparse Recovery Using Sparse (Random) Matrices

Sparse matrix