130 likes | 225 Views
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. Mehmet Koyuturk 1 , Ananth Grama 1 , and Naren Ramakrishnan 2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech naren@cs.vt.edu. Motivation.
E N D
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets Mehmet Koyuturk1, Ananth Grama1, and Naren Ramakrishnan2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech naren@cs.vt.edu
Motivation • Handling large discrete-valued datasets • Extracting relations between data items • Summarizing data in an error-bounded fashion • Clustering of data items • Finding coinsize representations for clustered data
Background • Singular Value Decomposition (SVD) [Berry et.al., 1995] • Decompose matrix into A=USVT • U and V orthogonal matrices, Sdiagonal with singular values • Used for Latent Semantic Indexing in Information Retrieval • Truncate decomposition to compress data
Background • Semi-Discrete Decomposition (SDD) [Kolda and O’Leary, 1998] • Restrict entries of U and V to {-1,0,1} • Requires very small amount of storage • Can perform as well as SVD in LSI using less than one-tenth the storage • Effective in finding outlier clusters • works well for datasets containing a large number of small clusters
Rank-1 Approximation x : presence vector y : pattern vector
Problem:Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize = number of non-zeros in the error matrix solve for x to Maximize Heuristic: Fix y, set Rank-1 Approximation Iteratively solve for x and y until no improvement possible
Initialization of pattern vector • Crucial to escape from local optima • Must require at most (nz(A)) time, not to • Some possible schemes • AllOnes: Set all entries to 1, poor. • Threshold: Set only the entries that have corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima. • Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns. • Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.
- At any step, given rank-one approximation AxyT, split A to A1and A0 based on rows: - if x(i)=0 row i goes to A0 - Stop when • hamming radius of A1, maximum of the hamming distances • of A1pattern vector, is less then some threshold • all rows of A are present in A1 • (if A1does not satisfy hamming radius condition, can split A1 • based on hamming distances) Recursive Algorithm - if x(i)=1 row i goes to A1
runtime vs # columns runtime vs # rows runtime vs # nonzeros Run-time Scalability • Rank-1 approximation requires O(nz(A)) time • Total run-time at each level in the recursive tree cannot exceed • this since total # of nonzeros at each level is at most nz(A) • Run-time is linear in nz(A)
Conclusions and Ongoing Work • Proposed algorithm is • Scalable to exteremely high-dimensions • Effective in discovering dominant patterns • Hierarchical in nature, allowing multi-resolution analysis • Currently working on • Real-world applications of proposed method • Effective initialization schemes • Parallel implementation