CS 124

1. Dan Jurafsky Lecture 18: Networks part I: Link Analysis, PageRank CS 124/LINGUIST 180: From Languages to Information

2. Slide from Chris Manning

3. The Web as a Directed Graph Slide from Chris Manning 3

4. Anchor TextWWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBM�s home page (mostly graphical) IBM�s copyright page (high term freq. for �ibm�) Rival�s spam page (arbitrarily high term freq.) Slide from Chris Manning 4

5. Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Slide from Chris Manning 5

6. Indexing anchor text Can sometimes have unexpected side effects � like what? Can score anchor text with weight depending on the authority of the anchor page�s website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust the anchor text from them Slide from Chris Manning 6

7. Anchor Text Other applications Weighting/filtering links in the graph Generating page descriptions from anchor text Slide from Chris Manning 7

8. Roots of Web Link Analysis Bibliometrics Social network analysis Slide from Chris Manning 8

9. Citation Analysis: Impact Factor Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y?1 or Y?2. Does not account for the quality of the citing article. Slide from Ray Mooney

10. Citations vs. Links Web links are a bit different than citations: Many links are navigational. Many pages with high in-degree are portals not content providers. Not all links are endorsements. Company websites don�t point to their competitors. Citations to relevant literature is enforced by peer-review. Slide from Ray Mooney

11. Social network analysis Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph, each vertex (or node) represents an actor and each link represents a relationship. CS583, Bing Liu, UIC 11

12. Centrality Important or prominent actors are those that are linked or involved with other actors extensively. A person with extensive contacts (links) or communications with many other people in the organization is considered more important than a person with relatively fewer contacts. The links can also be called ties. A central actor is one involved in many ties. CS583, Bing Liu, UIC 12

13. Prestige Prestige is a more refined measure of prominence of an actor than centrality. Distinguish: ties sent (out-links) and ties received (in-links). A prestigious actor is one who is object of extensive ties as a recipient. To compute the prestige: we use only in-links. Difference between centrality and prestige: centrality focuses on out-links prestige focuses on in-links. PageRank is based on prestige CS583, Bing Liu, UIC 13

14. Drawing on the citation work First attempt to do link analysis Slide from Chris Manning 14

15. Query-independent ordering First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: Each page gets a score = the number of in-links plus the number of out-links (3+2=5). Directed popularity: Score of a page = number of its in-links (3). Slide from Chris Manning 15

16. Query processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page). More nuanced � use link counts as a measure of static goodness, combined with text match score Slide from Chris Manning

17. Spamming simple popularity Exercise: How do you spam each of the following heuristics so your page gets a high score? Each page gets a static score = the number of in-links plus the number of out-links. Static score of a page = number of its in-links. Slide from Chris Manning

18. Intuition of PageRank

19. Pagerank scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably �In the steady state� each page has a long-term visit rate - use this as the page�s score. Slide from Chris Manning

20. Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. Slide from Chris Manning

21. Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter. Slide from Chris Manning 21

22. Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited (not obvious, will show this). How do we compute this visit rate? Slide from Chris Manning

23. Markov chains A Markov chain consists of n states, plus an n?ntransition probability matrixP. At each step, we are in exactly one of the states. For 1 ? i,j ? n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. Slide from Chris Manning

24. Markov chains Clearly, for all i, Markov chains are abstractions of random walks. Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain, for this case: Slide from Chris Manning

25. Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other For any start state, after a finite transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero. Slide from Chris Manning

26. Ergodic Markov chains For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn�t matter where we start. Slide from Chris Manning

27. Probability vectors A probability (row) vector x= (x1, � xn) tells us where the walk is at any point. E.g., (000�1�000) means we�re in state i. Slide from Chris Manning

28. Change in probability vector If the probability vector is x= (x1, � xn) at this step, what is it at the next step? Recall that row i of the transition prob. Matrix P tells us where we go next from state i. So from x, our next state is distributed as xP. Slide from Chris Manning

29. Steady state example The steady state looks like a vector of probabilities a= (a1, � an): ai is the probability that we are in state i. Slide from Chris Manning 29

30. How do we compute this vector? Let a= (a1, � an) denote the row vector of steady-state probabilities. If we our current position is described by a, then the next step is distributed as aP. But a is the steady state, so a=aP. Solving this matrix equation gives us a. So a is the (left) eigenvector for P. (Corresponds to the �principal� eigenvector of P with the largest eigenvalue.) Transition probability matrices always have largest eigenvalue 1. Slide from Chris Manning

31. One way of computing a Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(10�0)). After one step, we�re at xP; after two steps at xP2 , then xP3 and so on. �Eventually� means for �large� k, xPk = a. Algorithm: multiply x by increasing powers of P until the product looks stable. Slide from Chris Manning

32. Pagerank summary Preprocessing: Given graph of links, build matrix P. From it compute a. The entry ai is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent. Slide from Chris Manning

33. The reality Pagerank is used in Google, but is hardly the full story of ranking Many sophisticated features are used Some address specific query classes Machine learned ranking heavily used Slide from Chris Manning

34. Pagerank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? Surfer behavior sharply skewed towards short paths Search engines, bookmarks & directories make jumps non-random. Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) Slide from Chris Manning 34

35. Topic Specific Pagerank Goal � pagerank values that depend on query topic Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories Teleport to a page uniformly at random within the chosen category Sounds hard to implement: can�t compute PageRank at query time! Slide from Chris Manning

36. Topic Specific Pagerank Offline: Compute pagerank for individual categories Query independent as before Each page has multiple pagerank scores � one for each ODP category, with teleportation only to that category Online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of category-specific pageranks Slide from Chris Manning

37. Input: Web graph W Influence vector v over topics v : (page ? degree of influence) Output: Rank vector r: (page ? page importance wrt v) r= PR(W , v) Influencing PageRank (�Personalization�) Slide from Chris Manning basis web graph!!basis web graph!!

38. Non-uniform Teleportation Slide from Chris Manning

39. Interpretation of Composite Score Given a set of personalization vectors {vj} ?j [wj� PR(W , vj)] = PR(W , ?j [wj�vj]) Given a user�s preferences over topics, express as a combination of the �basis� vectors vj Slide from Chris Manning

40. Interpretation Slide from Chris Manning



43. Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a subject. e.g., �Bob�s list of cancer-related links.� Authority pages occur recurrently on good hubs for the subject. Best suited for �broad topic� queries rather than for page-finding queries. Gets at a broader slice of common opinion. Slide from Chris Manning

44. Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation. Slide from Chris Manning

45. The hope Slide from Chris Manning

46. High-level scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages; iterative algorithm. Slide from Chris Manning

47. Spam in Search Or, �Search Engine Optimization�

48. The trouble with paid search ads � It costs money. What�s the alternative? Search Engine Optimization: �Tuning� your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants (�Search engine optimizers�) for their clients Some perfectly legitimate, some very shady Slide from Chris Manning

49. Search engine optimization (Spam) Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services Forums E.g., Web master world ( www.webmasterworld.com ) Search engine specific tricks Discussions about academic papers Slide from Chris Manning

50. Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui�s and resort�s SEOs responded with dense repetitions of chosen terms e.g., mauiresort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Slide from Chris Manning

51. Variants of keyword stuffing Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc. Slide from Chris Manning

52. Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Slide from Chris Manning

53. More spam techniques Doorway pages Pages optimized for a single keyword that re-direct to the real target page Link spamming Mutual admiration societies, hidden links, awards Domain flooding: numerous domains that point or re-direct to a target page Robots Millions of submissions via Add-Url Slide from Chris Manning

54. The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection Slide from Chris Manning

55. More on spam Web search engines have policies on SEO practices they tolerate/block http://help.yahoo.com/help/us/ysearch/index.html http://www.google.com/intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEO�s and web search engines Research http://airweb.cse.lehigh.edu/ Slide from Chris Manning

56. NY Times article on JC Penney link spam http://www.nytimes.com/2011/02/13/business/13search.html �in the last several months, JCPenny.com in the #1 spot for: �dresses�, �bedding�, �area rugs� �bedding� �area rugs� �Someone paid to have thousands of links placed on hundreds of sites scattered around the Web �2,015 pages with phrases like �casual dresses,� �evening dresses,� �little black dress� or �cocktail dress�. The NY Times informed Google At 7 p.m., J. C. Penney #1 result for �Samsonite carry on luggage.� Two hours later, it was at No. 71

57. New Problem: Content Farms Demand Media: eHow, etc http://www.wired.com/magazine/2009/10/ff_demandmedia/all/1 Demand Media�s �legion of low-paid writers� �pump out 4,000 videoclips and articles a day. It starts with an algorithm� based on: Search terms (popular terms from more than 100 sources comprising 2 billion searches a day), The ad market (a snapshot of which keywords are sought after and how much they are fetching), The competition (what�s online already and where a term ranks in search results). Wired on Google�s change: http://www.wired.com/epicenter/2011/02/google-clamp-down-content-factories/ - previousafb4700d42dc9287ba6d0e0a00756dfb �Google updated its core ranking algorithm�to decrease the prevalence of�content farms in top search results.�

58. How to address content farms? From Google blog: �We�ve been exploring different algorithms to detect content farms, which are sites with shallow or low-quality content. One of the signals we're exploring is explicit feedback from users. To that end, today we�re launching an early, experimental Chrome extension so people can block sites from their web search results. If installed, the extension also sends blocked site information to Google, and we will study the resulting feedback and explore using it as a potential ranking signal for our search results.�

CS 124

CS 124

Presentation Transcript

AFI 63-124

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information

Lesson 124

ALC 124

CS 124/LINGUIST 180 From Languages to Information

FRS 124

Homework, Page 124

CS/ECEn 124 – Computer Systems

Mod 124

miR-124

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information

AFI 63-124

The ST-124

S-124