1 / 101

Mining Large Graphs Part 3: Case studies

Mining Large Graphs Part 3: Case studies. Jure Leskovec and Christos Faloutsos Machine Learning Department.

maya
Download Presentation

Mining Large Graphs Part 3: Case studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Large GraphsPart 3: Case studies Jure Leskovec and Christos Faloutsos Machine Learning Department Joint work with: Lada Adamic, DeepayChakrabarti, Natalie Glance, Carlos Guestrin, Bernardo Huberman, Jon Kleinberg, Andreas Krause, Mary McGlohon, Ajit Singh, and Jeanne VanBriesen.

  2. Tutorial outline • Part 1: Structure and models for networks • What are properties of large graphs? • How do we model them? • Part 2: Dynamics of networks • Diffusion and cascading behavior • How do viruses and information propagate? • Part 3: Case studies • 240 million MSN instant messenger network • Graph projections: how does the web look like Leskovec&Faloutsos ECML/PKDD 2007

  3. Part 3: Outline Case studies • Microsoft Instant Messenger Communication network • How does the world communicate • Web projections • How to do learning from contextual subgraphs • Finding fraudsters on eBay • Center piece subgraphs • How to find best path between the query nodes Leskovec&Faloutsos ECML/PKDD 2007

  4. Microsoft Instant Messenger Communication Network How does the whole world communicate? Leskovec and Horvitz: Worldwide Buzz: Planetary-Scale Views on an Instant-Messaging Network, 2007

  5. The Largest Social Network • What is the largest social network in the world (that one can relatively easily obtain)?  For the first time we had a chance to look at complete (anonymized) communication of the whole planet (using Microsoft MSN instant messenger network) Leskovec&Faloutsos ECML/PKDD 2007

  6. Instant Messaging • Contact (buddy) list • Messaging window Leskovec&Faloutsos ECML/PKDD 2007

  7. IM – Phenomena at planetary scale Observe social phenomena at planetary scale: • How does communication change with user demographics (distance, age, sex)? • How does geography affect communication? • What is the structure of the communication network? Leskovec&Faloutsos ECML/PKDD 2007

  8. Communication data The record of communication • Presence data • user status events (login, status change) • Communication data • who talks to whom • Demographics data • user age, sex, location Leskovec&Faloutsos ECML/PKDD 2007

  9. Data description: Presence • Events: • Login, Logout • Is this first ever login • Add/Remove/Block buddy • Add unregistered buddy (invite new user) • Change of status (busy, away, BRB, Idle,…) • For each event: • User Id • Time Leskovec&Faloutsos ECML/PKDD 2007

  10. Data description: Communication • For every conversation (session) we have a list of users who participated in the conversation • There can be multiple people per conversation • For each conversation and each user: • User Id • Time Joined • Time Left • Number of Messages Sent • Number of Messages Received Leskovec&Faloutsos ECML/PKDD 2007

  11. Data description: Demographics • For every user (self reported): • Age • Gender • Location (Country, ZIP) • Language • IP address (we can do reverse geo IP lookup) Leskovec&Faloutsos ECML/PKDD 2007

  12. Data collection • Log size: 150Gb/day • Just copying over the network takes 8 to 10h • Parsing and processing takes another 4 to 6h • After parsing and compressing ~ 45 Gb/day • Collected data for 30 days of June 2006: • Total: 1.3Tb of compressed data Leskovec&Faloutsos ECML/PKDD 2007

  13. Data statistics Activity over June 2006 (30 days) • 245 million users logged in • 180 million users engaged in conversations • 17,5 million new accounts activated • More than 30 billion conversations Leskovec&Faloutsos ECML/PKDD 2007

  14. Data statistics per day Activity on June 1 2006 • 1 billion conversations • 93 million users login • 65 million different users talk (exchange messages) • 1.5 million invitations for new accounts sent Leskovec&Faloutsos ECML/PKDD 2007

  15. User characteristics: age Leskovec&Faloutsos ECML/PKDD 2007

  16. Age piramid: MSN vs. the world Leskovec&Faloutsos ECML/PKDD 2007

  17. Conversation: Who talks to whom? • Cross gender edges: • 300 male-male and 235 female-female edges • 640 million female-male edges

  18. Number of people per conversation • Max number of people simultaneously talking is 20, but conversation can have more people Leskovec&Faloutsos ECML/PKDD 2007

  19. Conversation duration • Most conversations are short Leskovec&Faloutsos ECML/PKDD 2007

  20. Conversations: number of messages • Sessions between fewer people run out of steam Leskovec&Faloutsos ECML/PKDD 2007

  21. Time between conversations • Individuals are highly diverse • What is probability to login into the system after tminutes? • Power-law with exponent 1.5 • Task queuing model [Barabasi ’05] Leskovec&Faloutsos ECML/PKDD 2007

  22. Age: Number of conversations High User self reported age Low Leskovec&Faloutsos ECML/PKDD 2007

  23. Age: Total conversation duration High User self reported age Low Leskovec&Faloutsos ECML/PKDD 2007

  24. Age: Messages per conversation High User self reported age Low Leskovec&Faloutsos ECML/PKDD 2007

  25. Age: Messages per unit time High User self reported age Low Leskovec&Faloutsos ECML/PKDD 2007

  26. Who talks to whom: Number of conversations Leskovec&Faloutsos ECML/PKDD 2007

  27. Who talks to whom: Conversation duration Leskovec&Faloutsos ECML/PKDD 2007

  28. Geography and communication • Count the number of users logging in from particular location on the earth Leskovec&Faloutsos ECML/PKDD 2007

  29. How is Europe talking • Logins from Europe Leskovec&Faloutsos ECML/PKDD 2007

  30. Users per geo location Blue circles have more than 1 million logins. Leskovec&Faloutsos ECML/PKDD 2007

  31. Users per capita • Fraction of population using MSN: • Iceland: 35% • Spain: 28% • Netherlands, Canada, Sweden, Norway: 26% • France, UK: 18% • USA, Brazil: 8% Leskovec&Faloutsos ECML/PKDD 2007

  32. Communication heat map • For each conversation between geo points (A,B) we increase the intensity on the line between A and B Leskovec&Faloutsos ECML/PKDD 2007

  33. Correlation: • Probability: Age vs. Age

  34. IM Communication Network • Buddy graph: • 240 million people (people that login in June ’06) • 9.1 billion edges (friendship links) • Communication graph: • There is an edge if the users exchanged at least one message in June 2006 • 180 million people • 1.3 billion edges • 30 billion conversations Leskovec&Faloutsos ECML/PKDD 2007

  35. Buddy network: Number of buddies • Buddy graph: 240 million nodes, 9.1 billion edges (~40 buddies per user) Leskovec&Faloutsos ECML/PKDD 2007

  36. Communication Network: Degree • Number of people a users talks to in a month Leskovec&Faloutsos ECML/PKDD 2007

  37. Communication Network: Small-world • 6 degrees of separation [Milgram ’60s] • Average distance 5.5 • 90% of nodes can be reached in < 8 hops

  38. Communication network: Clustering • How many triangles are closed? • Clustering normally decays as k-1 • Communication network is highly clustered: k-0.37 High clustering Low clustering Leskovec&Faloutsos ECML/PKDD 2007

  39. Communication Network Connectivity Leskovec&Faloutsos ECML/PKDD 2007

  40. k-Cores decomposition • What is the structure of the core of the network? Leskovec&Faloutsos ECML/PKDD 2007

  41. k-Cores: core of the network • People with k<20 are the periphery • Core is composed of 79 people, each having 68 edges among them Leskovec&Faloutsos ECML/PKDD 2007

  42. Node deletion: Nodes vs. Edges Leskovec&Faloutsos ECML/PKDD 2007

  43. Node deletion: Connectivity Leskovec&Faloutsos ECML/PKDD 2007

  44. Web ProjectionsLearning from contextual graphs of the web How to predict user intention from the web graph?

  45. Motivation • Information retrieval traditionally considered documents as independent • Web retrieval incorporates global hyperlink relationships to enhance ranking (e.g., PageRank, HITS) • Operates on the entire graph • Uses just one feature (principal eigenvector) of the graph • Our work on Web projections focuses on • contextual subsets of the web graph; in-between the independent and global consideration of the documents • a rich set of graph theoretic properties Leskovec&Faloutsos ECML/PKDD 2007

  46. Web projections • Web projections: How they work? • Project a set of web pages of interest onto the web graph • This creates a subgraph of the web called projection graph • Use the graph-theoretic properties of the subgraph for tasks of interest • Query projections • Query results give the context (set of web pages) • Use characteristics of the resulting graphs for predictions about search quality and user behavior Leskovec&Faloutsos ECML/PKDD 2007

  47. Predictions Query projections Query Results Projection on the web graph • -- -- ---- • --- --- ---- • ------ --- • ----- --- -- • ------ ----- • ------ ----- Q Query connection graph Query projection graph Generate graphical features Construct case library

  48. Questions we explore • How do query search results project onto the underlying web graph? • Can we predict the quality of search results from the projection on the web graph? • Can we predict users’ behaviors with issuing and reformulating queries? Leskovec&Faloutsos ECML/PKDD 2007

  49. Is this a good set of search results? Leskovec&Faloutsos ECML/PKDD 2007

  50. Will the user reformulate the query? Leskovec&Faloutsos ECML/PKDD 2007

More Related