310 likes | 519 Views
Multilinear Algebra for Analyzing Data with Multiple Linkages. Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs TRICAP 2006, Chania, Greece, June 4-9, 2006. Circle-Circle Co-Link Matrix. Square-Square Co-Link Matrix.
E N D
Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs TRICAP 2006, Chania, Greece, June 4-9, 2006
Circle-Circle Co-Link Matrix Square-Square Co-Link Matrix Linear Algebra for Data with Linkages Circle-Square Matrix SVD Rank-k Approximation (k=2)
Terms d1 car d2 service d3 military Documents repair Latent Semantic Indexing (LSI) for Text Retrieval • S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. Using latent semantic analysis to improve access to textual information. In CHI '88, pp. 281–285, 1988 • S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci., 41(6):391–407, 1990 • M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Rev., 37(4):573–595, 1995 Term-Document Matrix SMART Retrieval SystemG. Salton (1971) LSI S. Dumais et al. (1988) Query “Car Service”
car service military repair Applications of LSI Graph the Results using U2 and V2 Term-Document Similarities carservicemilitary repair Terms d1 car d1 d2 d3 d2 Term-Term service d3 military carservicemilitary repair Document-Document Documents repair
How to use ? Term-document matrix weighting is critical! Caveats for LSI Local WeightLogfij = frequency Global Term WeightInverse Document Frequency N = total docs ni = # docs with term i Normalization Factor“Cosine”
1 3 2 4 Citation/Link Analysis(Same Nodes) Link Matrix Hub Scores Doc 1 is the most important hub! Co-Citation Matrix Authority Scores Examples: Citation data, Web links Doc 3 is the most important authority! Co-Reference Matrix J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, 1999.
Multiple Links? Suppose the connections between nodes are “labeled” in some fashion. In other words, we have meta-data on the connections. Can we somehow use multilinear algebra for link analysis?
PARAFAC • PARAFAC = Parallel Factors • aka. CANDECOMP = Canonical Decomposition • Higher-order analogue of the SVD • Columns of A, B, and C are not orthonormal • If R is minimal, then R is called the rank of the tensor (Kruskal 1977) • Can have rank(X) > min{I,J,K} • Often guaranteed to be a unique rank decomposition! K x R C I x J x K I x R J x R B A = + … = + + I R x R x R • R. A. Harshman. Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16:1–84, 1970 • J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of `Eckart-Young' decomposition. Psychometrika, 35:283–319, 1970.
“Tucker Operator” Many ways to write PARAFAC “Kruskal Operator” Easy to write N-way case: J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl., 18(2):95–138, 1977.
Properties of the Kruskal Operator PARAFAC core for a Tucker decomposition: Matricize (arbitrary map of indices to rows and columns): Mode-n matricize: Norm of a PARAFAC decomposition:
PARAFAC for sparse data & approximations • Our interest in the mathematical operations is motivated on two fronts • (1) Sparse computations • (2) Using tensor decompositions for approximation • Ex: Considering how to efficiently implement PARAFAC-ALS for sparse data • Can PARAFAC be used for the best rank-k approximation, rather than finding an exact decomposition (excepting noise) • What does it even mean in this case??
Multilink Analysis using PARAFAC • Quick Review: Tensors for Web Link Analysis • page x page x anchor text (TOPHITS) • New work: Tensors for Publication Data Analysis • Case 1: doc x doc x similarity • Case 2: term x doc x author (HO-LSA??)
TOPHITS: PARAFAC for Web Link Analysis Graph representation shows basic connectivity A set of four hyperlinked web pages Labeled edges capture context
Analyzing Publication Data:Doc x Doc x Similarity Representation
Computing Different Doc-Doc Similarities Computing term-based similarities (k=1,2,3) • 5022 papers • 16617 unique terms (ignoring stop words, words with length less than 3 or greater than 30 characters, and words that appear less than 2 times) • Titles: 5164 • Abstracts: 15752 • Keywords: 5248 • 6891 authors • 2659 citations Enforces sparseness! Computing author similarities (k=4)
PARAFAC for Doc x Doc x Similarity • H = “hubs” • A = “authorities” • C = “connections” • Rank-30 decomposition Central idea: Each triplet provides a core “grouping” of the data, i.e., a specific topic.
Applications of the [H,A,C] Decomposition • Latent document similarities • Calculate S = ½ HHT + ½ AAT • Analyzing a body of work • ch = hub centroid, ca = authority centroid • s = ½ H ch + ½ A ca • Disambiguation (EXAMPLE) • Calculate centroids using A (could also use H or A+H) • Calculate simiarlities of centroids • Journal predicition • Use matrix A as features for input to a decision tree ensemeble classifier
Example of Disambiguation Results Two authors with missing middle initials. 3 possible matches Matrix of Similarities
term author doc Form tensor X as: Analyzing Publication Data:Term x Doc x Author Representation Terms must appear in at least 3 documents and no more than 10% of all documents. Moreover, it must have at least 2 characters and no more than 30. 767 documents 2251 terms 1072 authors 59738 nonzeros Element (i,j,k) is nonzero only if author k wrote document j using term i.
Different Graph Interpretations for Term x Doc x Author term-doc with author links term-author with doc links author-doc with term links term-doc-author with links Term Doc Different author links represented by different colors
Author Data is Too Sparse Result: Resulting tensor has just a few nonzero columns in each lateral slice. term author doc Experimentally, PARAFAC seems to overfit such data and not do a good job of “mixing” different authors.
(rank 75) (rank 50) Idea: Use Tucker Transformation to Compress We transform the tensor to a smaller tensor as follows: or, equivalently This transformation forces the authors to be mixed and produces a dense result. Main problem: How to transform sparse tensor without creating dense intermediate results? Compute rank-25 PARAFAC on compressed tensor and transform.
Tucker & PARAFAC • Want PARAFAC for X in term x doc x author space • First, apply dimensionality reduction to X to obtain Y • Y in “conceptual” space • Next, compute PARAFAC on Y • Finally, reassemble results to yield PARAFAC for X
Three-Way Fingerprints • Each of the Terms, Docs, and Authors has a rank-k (k=25) fingerprint from the PARAFAC approximation • All items can be directly compared in “concept space” • Thus, we can compare any of the following • Term-Term • Doc-Doc • Term-Doc • Author-Author • Author-Term • Author-Doc • The fingerprints can be used as inputs for clustering, classification, etc.
Summary & Future Work • PARAFAC provides a technique for analyzing semantic graphs • Third dimension captures different connection types • Or may consider it as the interconnection of 3 different node types • Analyzed journal articles using different tensor representations • Doc x Doc x Connection • Need to make definitive case of why 3D is better than 2D • Term x Doc x Author • Too sparse? • Still working towards large-scale, sparse problems • Need implicit compression for PARAFAC • ~5M nonzeros • Other decompositions? • Other hybrids • Symmetry
Acknowledgments & More Information Thank You! • Thanks to… • Brett Bader, Danny Dunlavy, Philip Kegelmeyer • Web data: Joe Kenny, Travis Bauer et al., Ken Kolda • Journal data: Kevin Boyack • Graph viz: Ann Yoshimura • Related papers • Algorithm xxx: MATLAB Tensor Classes for Fast Algorithm Prototyping (with B.W. Bader), ACM TOMS, to appear. • Multilinear algebra for analyzing data with multiple linkages (with D. Dunlavy and W. P. Kegelmeyer), Technical Report SAND2006-2079, Apr. 2006. • Temporal analysis of social networks using three-way DEDICOM (with B.W. Bader and R.Harshman), Technical Report SAND2006-2161, Apr. 2006. • Multilinear operators for higher-order decompositions. Technical Report SAND2006-2081, Apr. 2006. • The TOPHITS model for higher-order web link analysis (with B. Bader), in Proc. Workshop on Link Analysis, Counterterrorism and Security, SDM06, Apr. 2006 • Higher-order web link analysis using multilinear algebra (with B.W.Bader), ICDM 2005, pp. 242–249, Nov. 2005. • Contact Info: • tgkolda@sandia.gov • http://csmr.ca.sandia.gov/~tgkolda/