Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets Mehmet Koyuturk1, Ananth Grama1, and Naren Ramakrishnan2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech naren@cs.vt.edu

Motivation • Handling large discrete-valued datasets • Extracting relations between data items • Summarizing data in an error-bounded fashion • Clustering of data items • Finding coinsize representations for clustered data

Background • Singular Value Decomposition (SVD) [Berry et.al., 1995] • Decompose matrix into A=USVT • U and V orthogonal matrices, Sdiagonal with singular values • Used for Latent Semantic Indexing in Information Retrieval • Truncate decomposition to compress data

Background • Semi-Discrete Decomposition (SDD) [Kolda and O’Leary, 1998] • Restrict entries of U and V to {-1,0,1} • Requires very small amount of storage • Can perform as well as SVD in LSI using less than one-tenth the storage • Effective in finding outlier clusters • works well for datasets containing a large number of small clusters

Rank-1 Approximation x : presence vector y : pattern vector

Problem:Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize = number of non-zeros in the error matrix solve for x to Maximize Heuristic: Fix y, set Rank-1 Approximation Iteratively solve for x and y until no improvement possible

Initialization of pattern vector • Crucial to escape from local optima • Must require at most (nz(A)) time, not to • Some possible schemes • AllOnes: Set all entries to 1, poor. • Threshold: Set only the entries that have corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima. • Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns. • Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.

- At any step, given rank-one approximation AxyT, split A to A1and A0 based on rows: - if x(i)=0 row i goes to A0 - Stop when • hamming radius of A1, maximum of the hamming distances • of A1pattern vector, is less then some threshold • all rows of A are present in A1 • (if A1does not satisfy hamming radius condition, can split A1 • based on hamming distances) Recursive Algorithm - if x(i)=1 row i goes to A1

Recursive Algorithm

Effectiveness of Analysis

runtime vs # columns runtime vs # rows runtime vs # nonzeros Run-time Scalability • Rank-1 approximation requires O(nz(A)) time • Total run-time at each level in the recursive tree cannot exceed • this since total # of nonzeros at each level is at most nz(A) • Run-time is linear in nz(A)

Conclusions and Ongoing Work • Proposed algorithm is • Scalable to exteremely high-dimensions • Effective in discovering dominant patterns • Hierarchical in nature, allowing multi-resolution analysis • Currently working on • Real-world applications of proposed method • Effective initialization schemes • Parallel implementation

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Presentation Transcript

Visualization of large astrophysical simulations datasets

Challenges in survival analysis with large datasets

DIAL Distributed Interactive Analysis of Large datasets

Analysis with Extremely Large Datasets

Adding GO for Large Datasets

Algorithmic Analysis of Large Datasets

DIAL Distributed Interactive Analysis of Large datasets

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 

DIAL: Distributed Interactive Analysis of Large Datasets

Efficient Bitmap Indexing Techniques for Very Large Datasets

DIAL Distributed Interactive Analysis of Large datasets

A Suite of Web-Services for Wavelet-based Analysis of Large Atmospheric Datasets

DIAL Distributed Interactive Analysis of Large datasets

Discrete Techniques

Analysis with Extremely Large Datasets

Discrete Techniques

Discrete Techniques

Challenges in survival analysis with large datasets

Discrete Techniques

Algebraic Manipulation of Scientific Datasets

Discrete Techniques