320 likes | 451 Views
Topic 13 Network Models. Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes. Social Networks. Network: A collection of inter-connected things Also called “ graph mining ” Data consisting of nodes and edges
E N D
Topic 13Network Models Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes Data Mining - Volinsky - 2011 - Columbia University
Social Networks • Network: A collection of inter-connected things • Also called “graph mining” • Data consisting of nodes and edges • Note: different than “graphical models” (graphical representation of dependence of random variables) • Edges represent: • Relationship between nodes • Behavior observed between nodes • High similarity between nodes • Edges typically weighted • Nodes and edges both can have attributes associated • Can be directed or undirected • Directed: phone calls, emails • Undirected: collaboration, physical networks, friendship Data Mining - Volinsky - 2011 - Columbia University
Examples Data Mining - Volinsky - 2011 - Columbia University
Networks are everywhere! Data Mining - Volinsky - 2011 - Columbia University
Layout • Layout matters! • Especially with directed graphs Data Mining - Volinsky - 2011 - Columbia University
Facebook “Friend Wheel” Data Mining - Volinsky - 2011 - Columbia University
LinkedIn LinkedIN community from LinkedIn labs Data Mining - Volinsky - 2011 - Columbia University
Networks: A Matter of Scale Data Mining - Volinsky - 2011 - Columbia University
Measurements on networks: Nodes and Edges • Node degree (node) • Number of edges coming in and out of a node is its degree • If directed, in-degree and out-degree are different • Degree centrality (node): • How ‘central’ is a given data point • How many times does it appear in a ‘shortest path’ • Centrality = importance • Centrality (edge): • How central is an edge? • Similar ‘shortest path’ definition • Does removing it create more clusters? Data Mining - Volinsky - 2011 - Columbia University
Measurements on networks (graph) • Degree Distribution • The distribution of all edge degrees characterizes the graph • Normal or highly skewed? • Clustering Coefficient (graph): • How “dense” is the graph? • Given n nodes, how many possible edges? • Density = #Edges/Possible edges • How likely is it that your friends are friends • Count: how many triangles • Diameter (graph) • Largest shortest path • Shortest paths (graph) • Histogram of shortest paths • Connectivity (graph) • Fully connected? • Connected components • For directed: strongly connected components Data Mining - Volinsky - 2011 - Columbia University
Models on networks • Random (Erdos-Renyi) • All edges occur randomly w probability p • Degree distribution follows Poisson distribution • Exponential (p*) models • Statistical model: Extension of Erdos-Renyi • Defines a probability distribution over graph properties • Preferential attachment • Generative Model: New nodes create m links (based on Poisson) • attach to existing nodes proportional to degree of that node • Rich get richer Data Mining - Volinsky - 2011 - Columbia University
Real-world networks • Degree distributions in real-world networks are heavily skewed to the right • preferential attachment fits this model • Long tail of values above the mean • Large mean, small median, small diameter • Leads to a “power law” • Let k = degree and pk = the number of nodes that have that degree • A plot of log k vs. log pk should be linear. • Many real world data sets follow a power law: • Online sales • Word length distributions • Number of friends on Facebook! Data Mining - Volinsky - 2011 - Columbia University
More Power Law Data Mining - Volinsky - 2011 - Columbia University
Erdos-Renyi vs. Power-law From Leskovec & Faloutsos Data Mining - Volinsky - 2011 - Columbia University
Small World • Real-world data sets tend to have power-law distributions • Also, tend to have a “small world” property • Everyone is reachable via a small number of edges • Small diameters • Stanley Milgram experiment 1967 • People given letter, asked to forward to one friend • source: random residents of Omaha • target: stockbroker in Boston • Of completed chains, averaged 6 hops • hence, Data Mining - Volinsky - 2011 - Columbia University
Small World Networks • Watts and Strogatz [1998] introduced small-world. • Navigable Social Networks [Kleinberg 2000] • Showed how small world networks are created • put n people on a k-dimensional grid • connect each to its immediate neighbors • add one long-range link per person • Everyone will be connected via a short path • This is the way the real world works!!! Data Mining - Volinsky - 2011 - Columbia University
Small World Networks • Another look Data Mining - Volinsky - 2011 - Columbia University
Sampling Networks • How do you sample from a massive network? • Simplest method – Induced Subgraph • Randomly sampled nodes and edges between them • Not so great! Yellow nodes randomly sampled but don’t have the same graph properties! Data Mining - Volinsky - 2011 - Columbia University
Sampling Networks • Snowball Sampling: • Pick a random sample and then follow their ‘tree’ for a set number of ‘hops’ Still not perfect but better Other ideas abound but little agreement Great area for research! Data Mining - Volinsky - 2011 - Columbia University
Network Problems of Interest • Link Prediction: • can we use existing network data to infer links where they don’t exist? • Links in the future? • Missing data • Simple methods • Look for many common neighbors • Complex methods • Stochastic Blockmodels • Similar to using SVD to ‘fill in’ a matrix • Agarwal and Pregibon ‘04 Data Mining - Volinsky - 2011 - Columbia University
Network Problems of Interest • Graph Matching / Similarity • Fraud (‘repetitive debtors’) • Citation de-noising • Need a metric to define difference between graphs • Collective Inference • What can you learn about someone from their network? • Fraud (‘guilt by association’) • Viral marketing • Following example courtesy of Sofus MacSkassy Data Mining - Volinsky - 2011 - Columbia University
? A Relational Neighbor Classifier (wvRN)
Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?
Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?
Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?
Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?
Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?
Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ?
Network Problems of Interest • Diffusion • Information or virus diffusion • Community Detection • Subgroups have a higher density within the subgroup • Can remove edges with high centrality to try and find communities • Understanding of Social Networks • Facebook Data Mining - Volinsky - 2011 - Columbia University
References • Leskovec / Faloutsos Tutorial (mostly part 1) • Eric Kolacyzk Notes and book • Watts and Strogatz: “Collective dynamics of `small-world' networks”: Nature 393 p.440-442 • Networks. MEJ Newman book. • Linked: How Everything Is Connected to Everything Else and What It Means : Albert Barabasi • Enron Data • Tools • Graphviz.org for visualization • Igraph (R package) Data Mining - Volinsky - 2011 - Columbia University