730 likes | 861 Views
Fast Regression Algorithms Using Spectral Graph Theory. Richard Peng. Outline. Regression: why and how Spectra: fast solvers Graphs: tree embeddings. Learning / Inference. Find (hidden) pattern in (noisy) data. Input signal, s:. Output:. Regression. p ≥ 1: convex
E N D
Fast Regression Algorithms Using Spectral Graph Theory Richard Peng
Outline • Regression: why and how • Spectra: fast solvers • Graphs: tree embeddings
Learning / Inference Find (hidden) pattern in (noisy) data Input signal, s: Output:
Regression • p ≥ 1: convex • Convex constraints e.g. linear equalities minimize Mininimize: |x|pSubject to: constraints on x
Application 0: LASSO Ax Widely used in practice: • Structured output • Robust to noise [Tibshirani `96]:Min |x|1s.t.Ax = s
Application 1: ImageS • MinΣi~j∈E(xi-xj-si~j)2 • Poisson image processing • No bears were harmed in the making of these slides
Application 2: Min cut 0 0 1 1 s t • Min Σij∈E|xi-xj| • s.t.xs=0, xt=1 0 0 1 1 • Fractional solution = integral solution • Remove fewest edges to separate vertices s and t
Regression Algorithms Convex optimization • 1940~1960: simplex, tractable • 1960~1980: ellipsoid, poly time • 1980~2000: interior point, efficient minimize • m = # non-zeros • Õ hides log factors • Õ(m1/2) interior steps
Efficiency Matters • m > 106 for most images • Even bigger (109): • Videos • 3D medical data
Key Subroutine Ax=b minimize • Õ(m1/2) Each step of interior point algorithms finds a step direction • Linear system solves
More Reasons for Fast Solvers [Boyd-Vanderberghe `04], Figure 11.20: The growth in the average number of Newton iterations (on randomly generated SDPs)… is very small
Linear System Solvers • [1st century CE] Gaussian Elimination: O(m3) • [Strassen `69] O(m2.8) • [Coppersmith-Winograd `90] O(m2.3755) • [Stothers `10] O(m2.3737) • [Vassilevska Williams`11] O(m2.3727) • Total: > m2
Not fast not used: • Preferred in practice: coordinate descent, subgradient methods • Solution quality traded for time
Fast Graph Based L2 Regression[Spielman-Teng ‘04] More in 12 slides Ax=b Input: Linear system where A is related to graphs, b Output: Solution to Ax=b Runtime: Nearly Linear, Õ(m)
Graphs Using Algebra Ax=b Fast convergence + Low cost per step = state of the art algorithms
Laplacian Paradigm Ax=b [Daitch-Spielman `08]: mincostfow [Christiano-Kelner-Mądry-Spielman-Teng `11]: approx maximum flow /min cut
Extension 1 [Chin-Mądry-Miller-P `12]: regression, image processing, grouped L2
Extension 2 [Kelner-Miller-P `12]: k-commodity flow Dual: k-variate labeling of graphs t s
Extension 3 [Miller-P `13]: faster for structured images / separable graphs
Need: Fast Linear System Solvers Ax=b minimize Implication of fast solvers: • Fast regression routines • Parallel, work efficient graph algorithms
Other Applications • [Tutte `66]: planar embedding • [Boman-Hendrickson-Vavasis`04]: PDEs • [Orecchia-Sachedeva-Vishnoi`12]: balanced cut / graph separator
Outline • Regression: why and how • Spectra: Linear system solvers • Graphs: tree embeddings
Problem Given: matrix A, vector b Size of A: • n-by-n • m non-zeros Ax=b
Special Structure of A • A = Deg – Adj • Deg: diag(degree) • Adj: adjacency matrix Aij= deg(i) if i=j w(ij) otherwise ` [Gremban-Miller `96]: extensions to SDD matrices
Unstructured Graphs • Social network • Intermediate systems of other algorithms are almost adversarial
Nearly Linear Time Solvers[Spielman-Teng ‘04] Input: n by n graph LaplacianA with m non-zeros, vector b Where: b = Ax for some x Output: Approximate solution x’ s.t. |x-x’|A<ε|x|A Runtime: Nearly Linear. O(m logcn log(1/ε)) expected • runtime is cost per bit of accuracy. • Error in the A-norm: |y|A=√yTAy.
How Many Logs Runtime: O(mlogcnlog(1/ ε)) Value of c: I don’t know [Spielman]: c≤70 [Miller]: c≤32 [Koutis]: c≤15 [Teng]: c≤12 [Orecchia]: c≤6 When n = 106, log6n > 106
Practical Nearly Linear Time Solvers[Koutis-Miller-P `10] Input: n by n graph LaplacianA with m non-zeros, vector b Where: b = Ax for some x Output: Approximate solution x’ s.t. |x-x’|A<ε|x|A Runtime: O(mlog2n log(1/ ε)) • runtime is cost per bit of accuracy. • Error in the A-norm: |y|A=√yTAy.
Practical Nearly Linear Time Solvers[Koutis-Miller-P `11] Input: n by n graph LaplacianA with m non-zeros, vector b Where: b = Ax for some x Output: Approximate solution x’ s.t. |x-x’|A<ε|x|A Runtime: O(mlognlog(1/ ε)) • runtime is cost per bit of accuracy. • Error in the A-norm: |y|A=√yTAy.
Stages of The Solver • Iterative Methods • Spectral Sparsifiers • Low Stretch Spanning Trees
Iterative Methods Numerical analysis: Can solve systems in A by iteratively solving spectrally similar, but easier, B
What is Spectrally Similar? A ≺ B ≺ kA for some small k • Ideas from scalars hold! • A ≺ B: for any vector x, |x|A2 < |x|B2 [Vaidya `91]: Since A is a graph, B should be too! [Vaidya `91]: Since G is a graph, H should be too!
`Easier’ H • Ways of easier: • Fewer vertices • Fewer edges Can reduce vertex count if edge count is small Goal: H with fewer edges that’s similar to G
Graph Sparsifiers Sparse equivalents of graphs that preserve something • Spanners: distance, diameter. • Cut sparsifier: all cuts. • What we need: spectrum
What we need: ultraSparsifiers [Spielman-Teng `04]: ultrasparsifiers with n-1+O(mlogpn/k) edges imply solvers with O(mlogpn) running time. ` • Given: G with n vertices, m edgesparameter k • Output: H with n vertices, n-1+O(mlogpn/k) edges • Goal: G ≺ H ≺ kG `
Example: Complete Graph O(nlogn) random edges (with scaling) suffice w.h.p.
General Graph Sampling Mechanism • For edge e, flip coin Pr(keep) = P(e) • Rescale to maintain expectation Number of edges kept: ∑e P(e) Also need to prove concentration
Effective Resistance • View the graph as a circuit • R(u,v) = Pass 1 unit of currentfrom u to v, measure resistance of circuit `
EE101 • Effective resistance in general:solve Gx = euv, where euv is indicator vector, R(u,v) = xu – xv. `
(Remedial?) EE101 • w1 • R(u, v) = 1/w1 ` • u • v • w1 • w2 • R(u, v) = 1/w1 + 1/w2 ` • u • v • Single edge: R(e) = 1/w(e) • Series: R(u, v) = R(e1) + … + R(el)
Spectral Sparsification by Effective REsistance [Spielman-Srivastava `08]: Setting P(e) to W(e)R(u,v)O(logn) gives G ≺ H ≺ 2G* • [Foster `49]:∑e W(e)R(e) = n-1 • Spectral sparsifier with O(nlogn) edges • Ultrasparsifier? Solver??? • *Ignoring probabilistic issues
The Chicken and Egg Problem How to find effective resistance? • [Spielman-Srivastava `08]: use solver • [Spielman-Teng `04]: need sparsifier
Our Work Around • Use upper bounds of effective resistance, R’(u,v) • Modify the problem
Rayleigh’s Monotonicity Law ` • Rayleigh’s Monotonicity Law: R(u,v) only increase when edges are removed Calculate effective resistance w.r.t. a tree T
Sampling Probabilities According to Tree ` • Sample Probability: edge weight times effective resistance of tree path • stretch • Goal: small total stretch
Good Trees Exist More in 12 slides (again!) Every graph has a spanning tree with total stretch O(mlogn) • Hiding loglogn • ∑e W(e)R’(e) = O(mlogn) • O(mlog2n) edges, too many!
‘Good’ Tree??? ` • Stretch = 1+1 = 2 Unit weight case:stretch ≥ 1 for all edges
What Are We Missing? • Haven’t used k! ` • Need: • G≺ H ≺ kG • n-1+O(mlogpn/k) edges • Generated: • G≺ H ≺ 2G • n-1+O(mlog2n) edges `
Use k, somehow ` • G ≺ G’≺ kG • Tree is good! • Increase weights of tree edges by factor of k
Result ` • Stretch = 1/k+1/k = 2/k • Tree heavier by factor of k • Tree effective resistance decrease by factor of k