740 likes | 920 Views
Tools and Algorithms for Querying and Mining Large Graphs. Hanghang Tong Machine Learning Department Carnegie Mellon University htong@cs.cmu.edu http://www.cs.cmu.edu/~htong. Thesis Committee. Christos Faloutsos William Cohen Jeff Schneider Philip S. Yu. Graphs are everywhere!.
E N D
Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University htong@cs.cmu.edu http://www.cs.cmu.edu/~htong
Thesis Committee • Christos Faloutsos • William Cohen • Jeff Schneider • Philip S. Yu
Motivating Questions: (high level) • Given a large graph, we want to +Task A: Querying • +Task B: Mining CePS on DBLP [Tong+ KDD 06] T3 on CIKM [Tong+ CIKM 08] Will return to this later…
Motivating Questions (in details) • Querying[Goal: query complex relationship] • Q.1. Find complex user-specific patterns; • Q.2. Link Prediction & Proximity Tracking; • Q.3. Answer all the above questions quickly. • Mining[Goal: find interesting patterns] • M.1. Spot Anomalies; • M.2. Mine time & space; • M.3. Detect communities.
Thesis Overview Q1 Q2 Q2 Q3 Q3 M1 M1 M2 M2 M3 M3
Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3
Thesis Overview: Impact Querying Mining Footnote: Our work for Q1 has been transferred into IBM product (Cyano)
Roadmap • Preliminary • Q1 • Q2 • Q3 • Introduction • Completed Work • Querying • Mining • Proposed Work
Preliminary: Proximity Measurement a.k.a Relevance, Closeness, ‘Similarity’…
Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3
Competed work on Q1 • Goal: Find complex user-specific patterns, • Q1.1. Center-Piece Subgraph Discovery, • e.g., master-mind criminal given some suspects X, Y and Z? • Q1.2. Best Effort Pattern Match, • e.g., Money-laundry ring • Q1.3 Interactive querying (e.g. Negation) • e.g., find most similar conferences wrt KDD, but not like ICML?
Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06] Input Output CePS Node CePS Original Graph Q: How to find hub for the black nodes? Red: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))
CePS: Example (AND Query) • DBLP co-authorship network: • - 400,000 authors, 2,000,000 edges
K_SoftAND: Relaxation of AND Asking AND query? No Answer! Disconnected Communities Noise
CePS: 2 SoftAND DB Stat.
Q1.2. Best-Effort Pattern Match [Tong+ KDD 2007 b] Query Graph Interception Data Graph Matching Subgraph Input Output Q: How to find matching subgraph?
details G-Ray: How to? matching node matching node matching node matching node Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12) Observation: , etc.
Effectiveness: star-query Databases Intelligent Agent Bio-medical Query Result
Effectiveness: line-query Theory Databases Learning Bio-medical Query Result
Q1.3: Interactive Querying User Feedback User Feedback User Feedback User Feedback
Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08] what are most related conferences wrt KDD? (DBLP author-conference bipartite graph)
Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08] what are most related conferences wrt KDD? (DBLP author-conference bipartite graph)
Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08] what are most related conferences wrt KDD? (DBLP author-conference bipartite graph)
Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3
Q2.1 Link Prediction: direction [Tong+ KDD 07 a] ? • Q: Given the existence of the link, • what is the direction of the link? • A: (DAP) Compare Prox(ij) and Prox(ji) i i j i >70% density i Web Link - 4, 000 nodes - 10, 000 edges Prox (ij) - Prox (ji)
Q2.2 pTrack/cTrack: Challenge[Tong+ SDM 08] • Observations (CePS, GRay, ProSIN…) • All for static graphs • Proximity: main tool • Graphs are evolving over time! • New nodes/edges show up; • Existing nodes/edges die out; • Edge weights change… • Q: How to make everything incremental? • A: Track Proximity!
pTrack/cTrack: Trend analysis on graph level T. Sejnowski Rank of Influence G.Hinton C. Koch M. Jordan Year
pTrack: Problem Definitions • [Given] • (1) a large, skewed time-evolving bipartite graphs, • (2) the query nodes of interest • [Track] • (1) top-k most related nodes for each query node at each time step t; • (2) the proximity score (or rank of proximity) between any two query nodes at each time step t
pTrack: Philip S. Yu’s Top-5 conferences up to each year DBLP: (Au. x Conf.) - 400k aus, - 3.5k confs - 20 yrs Databases Performance Distributed Sys. Databases Data Mining
KDD’s Rank wrt. VLDB over years Prox. Rank (Closer) Data Mining and Databases are getting closer & closer Year
cTrack:10 most influential authors in NIPS community up to each year T. Sejnowski M. Jordan Author-paper bipartite graph from NIPS 1987-1999. 1740 papers, 2037 authors, spreading over 13 years
Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3
Proximity is the main tool • Q.1: CePS, G-Ray, ProSIN • Q.2: DAP, pTrack/cTrack a.k.a Relevance, Closeness, ‘Similarity’… Q: What is a `good’ Score?
0.03 0.04 10 9 0.10 12 2 0.08 0.02 0.13 8 1 0.13 11 3 0.04 4 0.05 6 5 0.13 7 0.05 Random walk with restart [Pan+ KDD 2004] Nearby nodes, higher scores Ranking vector More red, more relevant
Why RWR is a good score? j i : adjacency matrix. c: damping factor all paths from i to j with length 1 all paths from i to j with length 2 all paths from i to j with length 3 RWR summarizes all the weighted paths from i to j
Computing RWR • OntheFly • No Pre-Computation; • Light Storage Cost (W) • Slow On-Line Response: O(mE) • Pre-Compute • Fast On-Line Response • Prohibitive Pre-Compute Cost: O(n3) • Prohibitive Storage Cost: O(n2) ~
Q: How to Balance? On-line Off-line Goal: Efficiently Get (elements) of
10 10 9 9 12 12 2 2 8 8 1 1 11 11 3 3 4 4 10 10 9 9 6 6 5 5 12 12 2 2 8 8 1 1 11 11 7 7 3 3 4 4 6 6 5 5 7 7 0.04 10 10 0.03 9 9 10 9 12 12 0.10 2 2 12 0.13 0.08 2 8 8 1 1 0.02 11 11 8 3 3 1 11 0.13 3 0.04 4 4 4 6 6 5 5 0.05 6 5 0.13 7 7 7 0.05 B_Lin: Basic Idea[Tong+ ICDM 2006] Find Community Combine Fix the remaining
details B_Lin: details + = ~ ~ W 1: within community W ~ + ~ Cross community
details B_Lin: details ~ ~ W1 W -1 Easy to be inverted LRA difference -1 ~ ~ I – c I – c –cUSV Sherman–Morrison Lemma! If Then
B_Lin: summary • Pre-Compute Stage • Q: • A: A few small, instead of ONE BIG, matrices inversions • On-Line Stage • Q: Efficiently recover one column of Q • A: A few, instead of MANY, matrix-vector multiplications Efficiently compute and store Q
Query Time vs. Pre-Compute Time Log Query Time • Quality: 90%+ • On-line: • Up to 150x speedup • Pre-computation: • Two orders saving Our Results Log Pre-compute Time
More on Scalability Issues for Querying(the spectrum of ``FastProx’’) • B_Lin: one large linear system • [Tong+ ICDM06, KAIS08] • BB_Lin: the intrinsic complexity is small • [Tong+ KAIS08] • FastUpdate: time-evolving linear system • [Tong+ SDM08, SAM08] • FastAllDAP: multiple linear systems • [Tong+ KDD07 a] • Fast-ProSIN: dealing w/ on-line feedback • [Tong+ ICDM 2008]
Roadmap • M1: Spotting Anomalies • M2: Mining Time • Introduction • Completed Work • Querying • Mining • Proposed Work
Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3
Motivation [Tong+ KDD 08 b] • Q: How to find patterns? • e.g., communities, anomalies, etc. • A: Low-Rank Approximation (LRA) for Adjacency Matrix of the Graph. X X A L M R ~ ~
LRA for Graph Mining: Example Adj. matrix: A L M R John X X ICDM Tom KDD Conf. Cluster Bob Interaction Carl ISMB Au. clusters Van ~ ~ RECOMB Roy Author Conf. Recon. error is high ‘Carl’ is abnormal
Challenges: How to get (L, M, R)? • Efficiently • both time and space • Intuitively • easy for interpretation • Dynamically • track patterns over time None of Existing Methods Fully Meets Our Wish List!
Why Not SVD and CUR/CX? • SVD: Optimal in L2 and LF • Efficiency • Time: • Space: (L, R) are dense • Interpretation • Linear Combination of many columns • Dynamic: Not Easy • CUR: Example-based • Efficiency • Better than SVD • Redundancy in L • Interpretation • Actual Columns from A xxxx • Dynamic: Not Easy