640 likes | 661 Views
Mining Billion Node Graphs. Christos Faloutsos CMU. CONGRATULATIONS!. Welcome to CMU!. Outline. Q+A Problem definition / Motivation Graphs and power laws Streams, environment, data center monitoring Conclusions. Q+A. Are you recruiting? How many? How many do you have?
E N D
Mining Billion Node Graphs Christos Faloutsos CMU
CONGRATULATIONS! Welcome to CMU! C. Faloutsos
Outline • Q+A • Problem definition / Motivation • Graphs and power laws • Streams, environment, data center monitoring • Conclusions C. Faloutsos
Q+A • Are you recruiting? How many? • How many do you have? • How frequently you meet them? • What is your advising style? • How do you feel about summer internships? C. Faloutsos
Q+A • Yes, 1-2 • 5+2 • 1/week • • Yes/Maybe (Y!,G, MSR, IBM, ++) • Are you recruiting? How many? • How many do you have? • How frequently you meet them? • What is your advising style? • How do you feel about summer internships? C. Faloutsos
Outline • Problem definition / Motivation • Graphs and power laws • Patterns and anomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos
Motivation Temperature in datacenter • Data mining: ~ find patterns (rules, outliers) • How do real graphs look like? Anomalies? • Virus/influence propagation • Time series / env. Monitoring C. Faloutsos
Graphs - why should we care? C. Faloutsos
Graphs - why should we care? Friendship Network [Moody ’01] C. Faloutsos
Graphs - why should we care? Food Web [Martinez ’91] Friendship Network [Moody ’01] Internet Map [lumeta.com] C. Faloutsos
Problem #1 - network and graph mining • What does the Internet look like? • What does FaceBook look like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? • To spot anomalies (rarities), we have to discover patterns • Large datasets reveal patterns/anomalies that may be invisible otherwise… C. Faloutsos
Graph mining • Are real graphs random? C. Faloutsos
Laws and patterns NO!! • Diameter • in- and out- degree distributions • other (surprising) patterns C. Faloutsos
Outline • Problem definition / Motivation • Graphs and power laws • Patternsandanomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos
S1 – degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ?? degree 3 C. Faloutsos
WRONG ! count ?? count degree 3 3 S1– degree distributions • Q: avg degree is ~3 - what is the most probable degree? degree C. Faloutsos
Solution: The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) Frequency Exponent = slope O = -2.15 -2.15 Nov’97 Outdegree C. Faloutsos
Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles C. Faloutsos
Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles Friends of friends are friends Any patterns? C. Faloutsos
Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters X-axis: degree Y-axis: mean # triangles n friends -> ???? triangles C. Faloutsos
Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles Epinions C. Faloutsos
Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos 22
Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos 23
Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos 24
But: • Q1: How about graphs from other domains? • Q2: How about temporal evolution? C. Faloutsos
Time evolution • with Jure Leskovec (CMU -> Stanford) • and Jon Kleinberg (Cornell) (‘best paper’ KDD05) C. Faloutsos
T1 - Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: • diameter ~ O(log N) • diameter ~ O(log log N) • What is happening in real data? C. Faloutsos
T1 - Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: • diameter ~ O(log N) • diameter ~ O(log log N) • What is happening in real data? • Diameter shrinks over time • As the network grows the distances between nodes slowly decrease C. Faloutsos
Diameter – ArXiv citation graph diameter • Citations among physics papers • 1992 –2003 • One graph per year time [years] C. Faloutsos
Diameter – “Patents” diameter • Patent citation network • 25 years of data time [years] C. Faloutsos
And many more patterns… • #nodes vs #edges (power law(!)) • # conn. Components (power law, too) • Contact/phone-call duration (log-logistic) • Total node weight vs # edges (super-linear/power law) • …. C. Faloutsos
Outline • Problem definition / Motivation • Graphs and power laws • Patternsandanomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos
E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07] C. Faloutsos
E-bay Fraud detection C. Faloutsos
E-bay Fraud detection C. Faloutsos
E-bay Fraud detection - NetProbe C. Faloutsos
Popular press And less desirable attention: • E-mail from ‘Belgium police’ (‘copy of your code?’) C. Faloutsos
Outline • Problem definition / Motivation • Graphs and power laws • Patterns and anomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos
Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ C. Faloutsos
details fork fork fork Mapper Output File0 write Split0 read Mapper Split1 Output File1 Split2 Mapper User Program Master assign map assign reduce InputData (onHDFS) Reducer local write Reducer remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) C. Faloutsos
HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 • Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) C. Faloutsos
HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 • Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) • Our HADI: linear on E (~10B) • Near-linear scalability wrt # machines • Several optimizations -> 5x faster C. Faloutsos
???? Count 19+ [Barabasi+] ~1999, ~1M nodes Radius C. Faloutsos
???? ?? Count 19+ [Barabasi+] ~1999, ~1M nodes Radius • YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied. C. Faloutsos
???? Count 14 (dir.) ~7 (undir.) 19+? [Barabasi+] Radius • YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied. C. Faloutsos
???? Count 14 (dir.) ~7 (undir.) 19+? [Barabasi+] Radius • YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • 7 degrees of separation (!) • Diameter: shrunk C. Faloutsos
???? Count ~7 (undir.) Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? C. Faloutsos
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality (?!) C. Faloutsos
Radius Plot of GCC of YahooWeb. C. Faloutsos
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores . C. Faloutsos