250 likes | 459 Views
Mining Discrete Patterns via Binary Matrix Factorization. Jieping Ye Arizona State University. Joint work with Baohong Shen and Shuiwang Ji. Rank-One Binary Matrix Factorization. compression, clustering, pattern discovery. features. 0001110…….1110110 0111000…….0001010 0011010…….1110110.
E N D
Mining Discrete Patterns via Binary Matrix Factorization Jieping Ye Arizona State University Joint work with Baohong Shen and Shuiwang Ji
Rank-One Binary Matrix Factorization compression, clustering, pattern discovery features 0001110…….1110110 0111000…….0001010 0011010…….1110110 samples …. 1 0 1 1 1 0 10110101110110 01110000000110 00110101110110 00110101110110 00110101110111 00000111101010 00110101110110 dominant pattern indicator vector
Application I: Image Compression 0001110…….1110110 0001110…….1110110 0111000…….0001010 0011010…….1110110 0111000…….0001010 …. 0011010…….1110110 Binary Matrix …. ….
An Example of Tree for 45 images from Stage Range 4-6Built byOur Algorithm Application II: Hierarchy Construction
An Example of Tree for 45 images from Stage Range 4-6Built byOur Algorithm Application III: Pattern Discovery M. Koyuturk, A. Grama, and N. Ramakrishnan, Compression, clustering and pattern discovery in very high dimensional discrete-attribute datasets, IEEE TKDE, 2005.
Binary Rank-One Approximation: Challenges • Can we compute an approximate solution with a guaranteed error bound? • Can we compute it efficiently? • Conjectured to be NP-Hard. • Existing approach based on the iterative updating • Koyutürk, M. & Grama, A. PROXIMUS: A framework for analyzing very high dimensional discrete-attributed datasets. KDD'03. • Heuristics, without known guarantees on approximation errors. • It very often results in undesirable rank-one approximations.
Equivalent Reformulation Maximum Weight Problem (MWP):
Our Main Contributions An exact formulation for MWP, using integer linear programming. A formulation for error-bounded integer linear programming, using integer linear programming. The proof of an error bound . Efficient algorithms to solve the error-bounded approximation.
Overview • This is the first polynomial time algorithm that computes an approximate solution with a guaranteed error bound. reformulation Binary Rank-one Matrix Approximation Maximum Weight Problem (MWP) reformulation error-bounded approximation • This is the first work that explicitly connects binary matrix factorization and minimum s-t cut. Integer Linear Programming (ILP1) Integer Linear Programming (ILP2) LP relaxation reformulation minimum s-t cut problem Linear Programming Relaxation of ILP2
Formulation for Exact Solutions Notation: Integer linear programming formulation: equivalent Original formulation: • If x1i = x2j=1, then zi,j≤1. • Ui,j >0zi,j=1. • If one of x1i and x2j is o, then zi,j ≤0.5. • zi,j is an integer zi,j=0.
Formulation for Approximate Solutions II Proposition: The objective value of ILP2 is no less than that of ILP1 for the same problem instance.
Approximation Error ILP2 achieves an error-bounded approximation. Approximate objective Approximate bound Optimal objective
Linear Programming Relaxation of ILP2 • Proposition: The coefficient matrix of the constraints in ILP2 • is totally unimodular. • I. Heller and C. B. Tompkins. An extension of a theorem of Dantzig's. • Ann. of Math. Stud., no. 38, pages 247-254. 1956. • We can obtain an exact solution of ILP2 by solving its LP relaxation. • LP is still computationally expensive for a large matrix A.
Overview reformulation Binary Rank-one Matrix Approximation Maximum Weight Problem (MWP) reformulation error-bounded approximation Integer Linear Programming (ILP1) Integer Linear Programming (ILP2) LP relaxation reformulation minimum s-t cut problem Linear Programming Relaxation of ILP2
Generalized Independent Set Problem Generalized Independent Set Problem (GIS) An undirected graph G=(V,E), A nonnegative weight w(v) for each vertex v in V, A nonnegative penalty p(e) for each edge e in E. GIS Problem: find a vertex subset S in V
Transform ILP2 into a GIS Problem • ILP2 defines an instance of GIS, and the corresponding graph is bipartite.
Efficient Approximation GIS is NP-Hard for general graphs. However, it can be solved in polynomial time for bipartite graphs. GIS for bipartite graphs can be solved by solving minimum s-t cuts / maximum flows. Hochbaum, D. S. & Pathria, A. Forest harvesting and minimum cuts: a new approach to handling spatial constraints, Forest Science, 1997, 43, 544-554
Experimental Evaluation: Error Bound We present results by the minimum s-t cut (P1), the improvement by iterative updating (P2), and theoretical upper bounds.
Experimental Evaluation: Error Bound We present results by the minimum s-t cut (P1), the improvement by iterative updating (P2), and theoretical upper bounds.
Experimental Evaluation: Running Time One dimension is fixed at 1000.
Conclusion reformulation Binary Rank-one Matrix Approximation Maximum Weight Problem (MWP) reformulation error-bounded approximation Integer Linear Programming (ILP1) Integer Linear Programming (ILP2) LP relaxation reformulation minimum s-t cut problem Linear Programming Relaxation of ILP2