240 likes | 448 Views
EDA with Graphs. Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003. Introduction. Some suggestions about looking at graphs Our way of analyzing graphs: COI
E N D
EDA with Graphs Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003
Introduction • Some suggestions about looking at graphs • Our way of analyzing graphs: COI • Two motivating examples • Challenges for the room Main point – sometimes EDA is all you need!
Preaching to the choir… • Visualize, even when you can’t • Speech example • Learn a little graph theory, even if you don’t want to • Expand your toolbox with: • bridges • cutpoints • centroids • pseudo cliques • strongly connected components • Etc. • Look at node and edge variables, even if they are not there • Variables induced by the graph itself are often useful (in-out degree, centrality, boundary)
Our data • Huge! Hundreds of millions of nodes and edges, mostly connected • Modelling, or even EDA, on the entire graph may not be possible • COI – Communities of Interest are one way of analyzing these data • Storage - Break it down • Analysis – Build up from signatures • Updating - Through time via exponential smoothing
Storage - Break it down • Consider the atomic units of the graph, which we call a COI signature: • For every node in the graph, store • Top k numbers inbound • Top k numbers outbound • Weights on each edge • overflow bin • In short, we are storing a huge graph as many little graphs, which are easily accessible (via indexed storage) for analysis.
Analysis – Build up from signatures • Fraud – we build signatures • When, how long, but not to whom • We use the COI signature to build a Community of Interest for everyone, and then use that for analysis • Example • Communities are everywhere (e.g. Amazon), but representing (and visualizing) as a graph gives a lot of insight.
Updating through time • our graph is dynamic • 3M new/old number per week! • We use an exponentially weighted moving average as a way to smoothly update through time…
Two motivating examples • Two examples where looking at local network behavior via COI helped answer the questions of interest, without modeling • Viral Marketing • Fraud
Viral Marketing plans • Viral Marketing – let your customers sell for you • COI was the perfect tool to throw at this…by capturing the local neighborhood of the enrollees, we can test the viral hypothesis • We can also track through time • What did we do? • For the enrollees, find the induced subgraph from their COI • Look at a control group
Cluster results… Lets look at some…
RDD: Repetitive Debtors Database • Lots of people cant pay their bill, but they want phone service anyway:
Connect pool (30 Days) T restricts RDD Process • A big matching problem…. • Every day • we get restricted TNs, 4K / day • we get connected TNs 40K / day • Look over a 30 day period (possible 4B comparisons!) • Compare the COI graphs of the disconnected number and the new number… • We need a metric for graph distance
TN-1 Connect TN-2 Restrict TN-3 TN-4 Connect TN-5 Matching Strategy • We use a combination of: • Intersection > 2 (to pare down) • Name/address overlap (to weed out) • $$ owed (to prioritize) • Here’s where modeling could help…or maybe not
Wrap up • Viral Marketing • Used connected components of reduced data as ‘clusters’ • Looked for ‘centers’ of clusters for retention • Visualized clusters for understanding • Used boundary to predict new customers • COI was the best predictive variable in a marketing study • Fraud • Attacked massive matching via simple measures of distance • Fraud reps use visualized clusters to work cases • We detected RDD with an 80% success rate Is this EDA?
Challenges • Viewing graphs through time • What if I don’t know what is coming next? • Graph distance metrics • What does “distance between graphs” mean? • Tools for looking at many graphs • what do union and intersection mean? • Modelling and EDA go hand in hand • Viral marketing models define network value, feed this into graph to do EDA….
An answer for Duncan… • What do I want and who is going to do it? • Tools that combine: • Interactive capability • Graph operations • Statistical analysis • It’s happening • It’s great!! • It’s a little confusing This model works for me….do you agree?
What I want…. • powerful ways to do union/intersection • unclear actually what that means • statistical measures of distances between graphs, what is the metric of interest, really? • use variables on nodes and edges to easily define new graphs, and automatically point me towards the interesting ones (largest, densest) • standard tools for finding graph theoretic concepts like cliques, pseudo cliques, density, bridge edges, boundary • ability to visualize the temporal component of graphs – is there another paradigm other than plot the ubergraph?
Points to make • if each tn is a graph, and we are looking for similar graphs, we could be doing millions or billions of these comparisons…sna stuff is great, but it doesn’t really work! • sometimes EDA is the answer, it is the best we can do, or perhaps it is sufficient for the user. • think graphs – and plot it! Even if you cant plot the whole thing, plot some of it – do speech example…. • “network value” might be important – this might not be the same as density – it may be a sunburst, which is not a high density subgraph, or highest value – it may depend on tine • Modelling can be great – find pseudo edges, use latent space models,etc…
Visualize, even when you cant • always a way to subset or threshold, or something • Speech example • learn some graph theoretics • bridge nodes/edges • Density, defs of cliques and pseudo cliques • dfs/bfs minimal spanning trees…. • Strongly conn comp • subset
Storing COI Signatures • COI sigs are stored in Hancock, a C-based domain-specific language designed for large amounts of signature-type data (Rogers, Fisher, et al) • Indexed by TN, so it is easy and fast to get COI for large lists of TN, and use spiders for recursion. • e.g. cycling over all TNs to learn something about our customer base takes minutes. We could never do this before!
B Z A O Informative overlap score • Calculate the “informative overlap” score: Where: wao = weight of edge from a to o wob = weight of edge from o to b wo= sum weight of edges to o dao, dob are the graph distances from a and b to o wob wao wo
Selecting q Calls fade out over time; The larger q is , the longer the call has non-negligible weight