1 / 13

Chapter 10 Link Analysis

Chapter 10 Link Analysis. Introduction. Airline Route Maps are useful Information can tell you about both history and politics Call Detail Records tell use about relationships between people Web is based on (hyper)links between documents

biana
Download Presentation

Chapter 10 Link Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 10Link Analysis

  2. Introduction • Airline Route Maps are useful • Information can tell you about both history and politics • Call Detail Records tell use about relationships between people • Web is based on (hyper)links between documents • Claim that there are no more than 6 degrees of separation between any two people • Link Analysis is the data mining technique that addresses relationships and connections • Link Analysis is based on Graph Theory

  3. Introduction • Effective in many situations • Identifying authoritative sources of information on the WWW by analyzing page links • Understanding physician referral patterns • Analyzing telephone call patterns • MCI Friends and Family • Could give out private info • You know Mary Smith, also on MCI, so join MCI • But your wife does not know Mary Smith • Far-fetched: Facebook does it all of the time!!!! • Can identify fraud: calling card thief's call same people • Can you think of other applications? Links?

  4. Basic Graph Theory • Graphs are an abstraction used to represent relationships • Graphs consist of • Nodes (vertices) which are the things in the graph that have relationships • Edges are pairs of nodes connected by a relationship • Visualization is a key characteristic of a graph

  5. Basic Graph Theory • A path is an ordered sequence of nodes connected by edges • Flight Segments (legs) such as LA – Denver – Boston • A weighted graph is one in which the edges have weights associated with them • Example: Weights support the association between two products being purchased together

  6. Graph Theory Classic Problems • Finding an Euler path in the graph that visits every edge exactly one time (Seven Bridges – edges are bridges and nodes are land). Simple rule: at most 2 nodes with odd degree.* • Finding the shortest path that visits the nodes in the graph exactly one time (Traveling Salesman) • Completely connected graph with n nodes has n! paths an no algorithm exists for solving that is not exponential in n– No simple rul. *No simple algorithm to determine Hamiltonian path that visits each vertex exactly once

  7. Directed vs Undirected Graphs • Undirected graphs – edges between nodes go in both directions (A to B; B to A) • Directed graphs – edges between nodes only go in one direction (A to B is different than B to A) • Ex: WWW

  8. Web pages = nodes Hyperlinks = edges Spiders & Web crawlers updating Kleinberg’s Algorithm Hub – a page that links to many authorities Authority – a page that is linked to by many hubs Google – Directed Graph Example

  9. Google – example continued • Authority versus mere popularity • Rank by number of unrelated sites linking to a site yields popularity • Rank by number of subject-related hubs that point to them yields authority • Helps to overcome the situation that often arises in popularity where the real authority (eg Home Page) is ranked lower because of lack of popularity of links to it

  10. Kleinberg Algorithm • Search process: • begins with text based keyword matching that returns a root set of hundreds of good matches. • Identify candidate pages: add all pages linked to by the root set and a subset of pages that link to the root set. Leads to 1K-5K pages. • Rank hubs and authorities (see defn): iterative algorithm that rewards hubs that are associated with strong authorities and authorities associated with strong hubs • Pages start with weight of 1 and hubs (authorities) rewarded based on weights of associated authorities (hubs) • (see the pagerank example linked to on our schedule)

  11. Link Analysis Applications • Can use link analysis to identify fax machines • Fax machines generally call other fax machines • Can run iterative algorithms to propagate information on how each phone number is used • At AT&T I used a non-link approach to identify voice vs. data vs. fax lines • Used call detail records to describe each phone number and used autodialer to generate training data

  12. Using Links to Generate Recommendations • A grad student and I built a graph with nodes to represent movies and people • A link from a person to a movie indicated the review rating • Graph is very sparse: most people do not see most movies • Matched people to those with similar movie preferences and then filled in edges • Once more edges filled in, easier to compute similarity between users and process iterates

  13. End of Chapter 10

More Related