EDA with Graphs

EDA with Graphs Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003

Introduction • Some suggestions about looking at graphs • Our way of analyzing graphs: COI • Two motivating examples • Challenges for the room Main point – sometimes EDA is all you need!

Preaching to the choir… • Visualize, even when you can’t • Speech example • Learn a little graph theory, even if you don’t want to • Expand your toolbox with: • bridges • cutpoints • centroids • pseudo cliques • strongly connected components • Etc. • Look at node and edge variables, even if they are not there • Variables induced by the graph itself are often useful (in-out degree, centrality, boundary)

Our data • Huge! Hundreds of millions of nodes and edges, mostly connected • Modelling, or even EDA, on the entire graph may not be possible • COI – Communities of Interest are one way of analyzing these data • Storage - Break it down • Analysis – Build up from signatures • Updating - Through time via exponential smoothing

Storage - Break it down • Consider the atomic units of the graph, which we call a COI signature: • For every node in the graph, store • Top k numbers inbound • Top k numbers outbound • Weights on each edge • overflow bin • In short, we are storing a huge graph as many little graphs, which are easily accessible (via indexed storage) for analysis.

Analysis – Build up from signatures • Fraud – we build signatures • When, how long, but not to whom • We use the COI signature to build a Community of Interest for everyone, and then use that for analysis • Example • Communities are everywhere (e.g. Amazon), but representing (and visualizing) as a graph gives a lot of insight.

Updating through time • our graph is dynamic • 3M new/old number per week! • We use an exponentially weighted moving average as a way to smoothly update through time…

Two motivating examples • Two examples where looking at local network behavior via COI helped answer the questions of interest, without modeling • Viral Marketing • Fraud

Viral Marketing plans • Viral Marketing – let your customers sell for you • COI was the perfect tool to throw at this…by capturing the local neighborhood of the enrollees, we can test the viral hypothesis • We can also track through time • What did we do? • For the enrollees, find the induced subgraph from their COI • Look at a control group

Cluster results… Lets look at some…

what’s up with the big cluster?

RDD: Repetitive Debtors Database • Lots of people cant pay their bill, but they want phone service anyway:

Connect pool (30 Days) T restricts RDD Process • A big matching problem…. • Every day • we get restricted TNs, 4K / day • we get connected TNs 40K / day • Look over a 30 day period (possible 4B comparisons!) • Compare the COI graphs of the disconnected number and the new number… • We need a metric for graph distance

TN-1 Connect TN-2 Restrict TN-3 TN-4 Connect TN-5 Matching Strategy • We use a combination of: • Intersection > 2 (to pare down) • Name/address overlap (to weed out) • $$ owed (to prioritize) • Here’s where modeling could help…or maybe not

Wrap up • Viral Marketing • Used connected components of reduced data as ‘clusters’ • Looked for ‘centers’ of clusters for retention • Visualized clusters for understanding • Used boundary to predict new customers • COI was the best predictive variable in a marketing study • Fraud • Attacked massive matching via simple measures of distance • Fraud reps use visualized clusters to work cases • We detected RDD with an 80% success rate Is this EDA?

Challenges • Viewing graphs through time • What if I don’t know what is coming next? • Graph distance metrics • What does “distance between graphs” mean? • Tools for looking at many graphs • what do union and intersection mean? • Modelling and EDA go hand in hand • Viral marketing models define network value, feed this into graph to do EDA….

An answer for Duncan… • What do I want and who is going to do it? • Tools that combine: • Interactive capability • Graph operations • Statistical analysis • It’s happening • It’s great!! • It’s a little confusing This model works for me….do you agree?

What I want…. • powerful ways to do union/intersection • unclear actually what that means • statistical measures of distances between graphs, what is the metric of interest, really? • use variables on nodes and edges to easily define new graphs, and automatically point me towards the interesting ones (largest, densest) • standard tools for finding graph theoretic concepts like cliques, pseudo cliques, density, bridge edges, boundary • ability to visualize the temporal component of graphs – is there another paradigm other than plot the ubergraph?

Points to make • if each tn is a graph, and we are looking for similar graphs, we could be doing millions or billions of these comparisons…sna stuff is great, but it doesn’t really work! • sometimes EDA is the answer, it is the best we can do, or perhaps it is sufficient for the user. • think graphs – and plot it! Even if you cant plot the whole thing, plot some of it – do speech example…. • “network value” might be important – this might not be the same as density – it may be a sunburst, which is not a high density subgraph, or highest value – it may depend on tine • Modelling can be great – find pseudo edges, use latent space models,etc…

Visualize, even when you cant • always a way to subset or threshold, or something • Speech example • learn some graph theoretics • bridge nodes/edges • Density, defs of cliques and pseudo cliques • dfs/bfs minimal spanning trees…. • Strongly conn comp • subset

Storing COI Signatures • COI sigs are stored in Hancock, a C-based domain-specific language designed for large amounts of signature-type data (Rogers, Fisher, et al) • Indexed by TN, so it is easy and fast to get COI for large lists of TN, and use spiders for recursion. • e.g. cycling over all TNs to learn something about our customer base takes minutes. We could never do this before!

B Z A O Informative overlap score • Calculate the “informative overlap” score: Where: wao = weight of edge from a to o wob = weight of edge from o to b wo= sum weight of edges to o dao, dob are the graph distances from a and b to o wob wao wo

Selecting q Calls fade out over time; The larger q is , the longer the call has non-negligible weight

EDA with Graphs

EDA with Graphs

Presentation Transcript

Displaying data with graphs

Working With Graphs

EDA Training…

Describing motion with Graphs

EDA (CS286.5b)

Displaying Distributions with Graphs

Travelling With Graphs

Displaying Data with Graphs

EDA Mission

EDA Awards

Partnering with the EDA

Communicating with Graphs

Displaying Distributions with Graphs

Graphs, graphs, graphs

Univariate EDA

Displaying Distributions With Graphs

Making Connections with Graphs

Graphs with SPSS

Solving Problems with Graphs