Topic 13 Network Models

Topic 13Network Models Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes Data Mining - Volinsky - 2011 - Columbia University

Social Networks • Network: A collection of inter-connected things • Also called “graph mining” • Data consisting of nodes and edges • Note: different than “graphical models” (graphical representation of dependence of random variables) • Edges represent: • Relationship between nodes • Behavior observed between nodes • High similarity between nodes • Edges typically weighted • Nodes and edges both can have attributes associated • Can be directed or undirected • Directed: phone calls, emails • Undirected: collaboration, physical networks, friendship Data Mining - Volinsky - 2011 - Columbia University

Examples Data Mining - Volinsky - 2011 - Columbia University

Networks are everywhere! Data Mining - Volinsky - 2011 - Columbia University

Layout • Layout matters! • Especially with directed graphs Data Mining - Volinsky - 2011 - Columbia University

Facebook “Friend Wheel” Data Mining - Volinsky - 2011 - Columbia University

LinkedIn LinkedIN community from LinkedIn labs Data Mining - Volinsky - 2011 - Columbia University

Networks: A Matter of Scale Data Mining - Volinsky - 2011 - Columbia University

Measurements on networks: Nodes and Edges • Node degree (node) • Number of edges coming in and out of a node is its degree • If directed, in-degree and out-degree are different • Degree centrality (node): • How ‘central’ is a given data point • How many times does it appear in a ‘shortest path’ • Centrality = importance • Centrality (edge): • How central is an edge? • Similar ‘shortest path’ definition • Does removing it create more clusters? Data Mining - Volinsky - 2011 - Columbia University

Measurements on networks (graph) • Degree Distribution • The distribution of all edge degrees characterizes the graph • Normal or highly skewed? • Clustering Coefficient (graph): • How “dense” is the graph? • Given n nodes, how many possible edges? • Density = #Edges/Possible edges • How likely is it that your friends are friends • Count: how many triangles • Diameter (graph) • Largest shortest path • Shortest paths (graph) • Histogram of shortest paths • Connectivity (graph) • Fully connected? • Connected components • For directed: strongly connected components Data Mining - Volinsky - 2011 - Columbia University

Models on networks • Random (Erdos-Renyi) • All edges occur randomly w probability p • Degree distribution follows Poisson distribution • Exponential (p*) models • Statistical model: Extension of Erdos-Renyi • Defines a probability distribution over graph properties • Preferential attachment • Generative Model: New nodes create m links (based on Poisson) • attach to existing nodes proportional to degree of that node • Rich get richer Data Mining - Volinsky - 2011 - Columbia University

Real-world networks • Degree distributions in real-world networks are heavily skewed to the right • preferential attachment fits this model • Long tail of values above the mean • Large mean, small median, small diameter • Leads to a “power law” • Let k = degree and pk = the number of nodes that have that degree • A plot of log k vs. log pk should be linear. • Many real world data sets follow a power law: • Online sales • Word length distributions • Number of friends on Facebook! Data Mining - Volinsky - 2011 - Columbia University

More Power Law Data Mining - Volinsky - 2011 - Columbia University

Erdos-Renyi vs. Power-law From Leskovec & Faloutsos Data Mining - Volinsky - 2011 - Columbia University

Small World • Real-world data sets tend to have power-law distributions • Also, tend to have a “small world” property • Everyone is reachable via a small number of edges • Small diameters • Stanley Milgram experiment 1967 • People given letter, asked to forward to one friend • source: random residents of Omaha • target: stockbroker in Boston • Of completed chains, averaged 6 hops • hence, Data Mining - Volinsky - 2011 - Columbia University

Small World Networks • Watts and Strogatz [1998] introduced small-world. • Navigable Social Networks [Kleinberg 2000] • Showed how small world networks are created • put n people on a k-dimensional grid • connect each to its immediate neighbors • add one long-range link per person • Everyone will be connected via a short path • This is the way the real world works!!! Data Mining - Volinsky - 2011 - Columbia University

Small World Networks • Another look Data Mining - Volinsky - 2011 - Columbia University

Sampling Networks • How do you sample from a massive network? • Simplest method – Induced Subgraph • Randomly sampled nodes and edges between them • Not so great! Yellow nodes randomly sampled but don’t have the same graph properties! Data Mining - Volinsky - 2011 - Columbia University

Sampling Networks • Snowball Sampling: • Pick a random sample and then follow their ‘tree’ for a set number of ‘hops’ Still not perfect but better Other ideas abound but little agreement Great area for research! Data Mining - Volinsky - 2011 - Columbia University

Network Problems of Interest • Link Prediction: • can we use existing network data to infer links where they don’t exist? • Links in the future? • Missing data • Simple methods • Look for many common neighbors • Complex methods • Stochastic Blockmodels • Similar to using SVD to ‘fill in’ a matrix • Agarwal and Pregibon ‘04 Data Mining - Volinsky - 2011 - Columbia University

Network Problems of Interest • Graph Matching / Similarity • Fraud (‘repetitive debtors’) • Citation de-noising • Need a metric to define difference between graphs • Collective Inference • What can you learn about someone from their network? • Fraud (‘guilt by association’) • Viral marketing • Following example courtesy of Sofus MacSkassy Data Mining - Volinsky - 2011 - Columbia University

? A Relational Neighbor Classifier (wvRN)

A Relational Neighbor Classifier (wvRN) ? ? ? ?

Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?

Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ?

Network Problems of Interest • Diffusion • Information or virus diffusion • Community Detection • Subgroups have a higher density within the subgroup • Can remove edges with high centrality to try and find communities • Understanding of Social Networks • Facebook Data Mining - Volinsky - 2011 - Columbia University

References • Leskovec / Faloutsos Tutorial (mostly part 1) • Eric Kolacyzk Notes and book • Watts and Strogatz: “Collective dynamics of `small-world' networks”: Nature 393 p.440-442 • Networks. MEJ Newman book. • Linked: How Everything Is Connected to Everything Else and What It Means : Albert Barabasi • Enron Data • Tools • Graphviz.org for visualization • Igraph (R package) Data Mining - Volinsky - 2011 - Columbia University

Topic 13 Network Models

Topic 13 Network Models

Presentation Transcript

Topic 13

Topic 13

Topic 13

Topic models

Topic 13:

Topic 13

Topic – 13

Topic Models

NETWORK MODELS

Network Models

Network Models

Network Models

Topic 13

Network Models

Probabilistic Topic Models

Topic Models for Social Network Analysis and Bibliometrics

TOPIC 13

Network Models

Topic 13

Probabilistic Topic Models