1 / 13

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. Mehmet Koyuturk 1 , Ananth Grama 1 , and Naren Ramakrishnan 2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech naren@cs.vt.edu. Motivation.

maisie
Download Presentation

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algebraic Techniques for Analysis of Large Discrete-Valued Datasets Mehmet Koyuturk1, Ananth Grama1, and Naren Ramakrishnan2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech naren@cs.vt.edu

  2. Motivation • Handling large discrete-valued datasets • Extracting relations between data items • Summarizing data in an error-bounded fashion • Clustering of data items • Finding coinsize representations for clustered data

  3. Background • Singular Value Decomposition (SVD) [Berry et.al., 1995] • Decompose matrix into A=USVT • U and V orthogonal matrices, Sdiagonal with singular values • Used for Latent Semantic Indexing in Information Retrieval • Truncate decomposition to compress data

  4. Background • Semi-Discrete Decomposition (SDD) [Kolda and O’Leary, 1998] • Restrict entries of U and V to {-1,0,1} • Requires very small amount of storage • Can perform as well as SVD in LSI using less than one-tenth the storage • Effective in finding outlier clusters • works well for datasets containing a large number of small clusters

  5. Rank-1 Approximation x : presence vector y : pattern vector

  6. Problem:Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize = number of non-zeros in the error matrix solve for x to Maximize Heuristic: Fix y, set Rank-1 Approximation Iteratively solve for x and y until no improvement possible

  7. Initialization of pattern vector • Crucial to escape from local optima • Must require at most (nz(A)) time, not to • Some possible schemes • AllOnes: Set all entries to 1, poor. • Threshold: Set only the entries that have corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima. • Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns. • Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.

  8. - At any step, given rank-one approximation AxyT, split A to A1and A0 based on rows: - if x(i)=0 row i goes to A0 - Stop when • hamming radius of A1, maximum of the hamming distances • of A1pattern vector, is less then some threshold • all rows of A are present in A1 • (if A1does not satisfy hamming radius condition, can split A1 • based on hamming distances) Recursive Algorithm - if x(i)=1 row i goes to A1

  9. Recursive Algorithm

  10. Effectiveness of Analysis

  11. Effectiveness of Analysis

  12. runtime vs # columns runtime vs # rows runtime vs # nonzeros Run-time Scalability • Rank-1 approximation requires O(nz(A)) time • Total run-time at each level in the recursive tree cannot exceed • this since total # of nonzeros at each level is at most nz(A) • Run-time is linear in nz(A)

  13. Conclusions and Ongoing Work • Proposed algorithm is • Scalable to exteremely high-dimensions • Effective in discovering dominant patterns • Hierarchical in nature, allowing multi-resolution analysis • Currently working on • Real-world applications of proposed method • Effective initialization schemes • Parallel implementation

More Related