130 likes | 252 Views
Chapter 10 Link Analysis. Introduction. Airline Route Maps are useful Information can tell you about both history and politics Call Detail Records tell use about relationships between people Web is based on (hyper)links between documents
E N D
Introduction • Airline Route Maps are useful • Information can tell you about both history and politics • Call Detail Records tell use about relationships between people • Web is based on (hyper)links between documents • Claim that there are no more than 6 degrees of separation between any two people • Link Analysis is the data mining technique that addresses relationships and connections • Link Analysis is based on Graph Theory
Introduction • Effective in many situations • Identifying authoritative sources of information on the WWW by analyzing page links • Understanding physician referral patterns • Analyzing telephone call patterns • MCI Friends and Family • Could give out private info • You know Mary Smith, also on MCI, so join MCI • But your wife does not know Mary Smith • Far-fetched: Facebook does it all of the time!!!! • Can identify fraud: calling card thief's call same people • Can you think of other applications? Links?
Basic Graph Theory • Graphs are an abstraction used to represent relationships • Graphs consist of • Nodes (vertices) which are the things in the graph that have relationships • Edges are pairs of nodes connected by a relationship • Visualization is a key characteristic of a graph
Basic Graph Theory • A path is an ordered sequence of nodes connected by edges • Flight Segments (legs) such as LA – Denver – Boston • A weighted graph is one in which the edges have weights associated with them • Example: Weights support the association between two products being purchased together
Graph Theory Classic Problems • Finding an Euler path in the graph that visits every edge exactly one time (Seven Bridges – edges are bridges and nodes are land). Simple rule: at most 2 nodes with odd degree.* • Finding the shortest path that visits the nodes in the graph exactly one time (Traveling Salesman) • Completely connected graph with n nodes has n! paths an no algorithm exists for solving that is not exponential in n– No simple rul. *No simple algorithm to determine Hamiltonian path that visits each vertex exactly once
Directed vs Undirected Graphs • Undirected graphs – edges between nodes go in both directions (A to B; B to A) • Directed graphs – edges between nodes only go in one direction (A to B is different than B to A) • Ex: WWW
Web pages = nodes Hyperlinks = edges Spiders & Web crawlers updating Kleinberg’s Algorithm Hub – a page that links to many authorities Authority – a page that is linked to by many hubs Google – Directed Graph Example
Google – example continued • Authority versus mere popularity • Rank by number of unrelated sites linking to a site yields popularity • Rank by number of subject-related hubs that point to them yields authority • Helps to overcome the situation that often arises in popularity where the real authority (eg Home Page) is ranked lower because of lack of popularity of links to it
Kleinberg Algorithm • Search process: • begins with text based keyword matching that returns a root set of hundreds of good matches. • Identify candidate pages: add all pages linked to by the root set and a subset of pages that link to the root set. Leads to 1K-5K pages. • Rank hubs and authorities (see defn): iterative algorithm that rewards hubs that are associated with strong authorities and authorities associated with strong hubs • Pages start with weight of 1 and hubs (authorities) rewarded based on weights of associated authorities (hubs) • (see the pagerank example linked to on our schedule)
Link Analysis Applications • Can use link analysis to identify fax machines • Fax machines generally call other fax machines • Can run iterative algorithms to propagate information on how each phone number is used • At AT&T I used a non-link approach to identify voice vs. data vs. fax lines • Used call detail records to describe each phone number and used autodialer to generate training data
Using Links to Generate Recommendations • A grad student and I built a graph with nodes to represent movies and people • A link from a person to a movie indicated the review rating • Graph is very sparse: most people do not see most movies • Matched people to those with similar movie preferences and then filled in edges • Once more edges filled in, easier to compute similarity between users and process iterates