290 likes | 479 Views
Outline. Anchor textBackground on networksBibliometric (citation) networksSocial networksLink analysis for rankingPageRankHITSSearch Engine Optimization. Slide from Chris Manning. The Web as a Directed Graph. Slide from Chris Manning. 3. Assumption 1:A hyperlink between pages denotes author perceived relevance (quality signal).
E N D
1.
Dan Jurafsky
Lecture 18: Networks part I:
Link Analysis, PageRank
CS 124/LINGUIST 180: From Languages to Information
2. Slide from Chris Manning
3. The Web as a Directed Graph Slide from Chris Manning 3
4. Anchor TextWWW Worm - McBryan [Mcbr94] For ibm how to distinguish between:
IBM’s home page (mostly graphical)
IBM’s copyright page (high term freq. for ‘ibm’)
Rival’s spam page (arbitrarily high term freq.) Slide from Chris Manning 4
5. Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Slide from Chris Manning 5
6. Indexing anchor text Can sometimes have unexpected side effects –
like what?
Can score anchor text with weight depending on the authority of the anchor page’s website
E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust the anchor text from them Slide from Chris Manning 6
7. Anchor Text Other applications
Weighting/filtering links in the graph
Generating page descriptions from anchor text Slide from Chris Manning 7
8. Roots of Web Link Analysis Bibliometrics
Social network analysis Slide from Chris Manning 8
9. Citation Analysis: Impact Factor Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals.
Measure of how often papers in the journal are cited by other scientists.
Computed and published annually by the Institute for Scientific Information (ISI).
The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y?1 or Y?2.
Does not account for the quality of the citing article. Slide from Ray Mooney
10. Citations vs. Links Web links are a bit different than citations:
Many links are navigational.
Many pages with high in-degree are portals not content providers.
Not all links are endorsements.
Company websites don’t point to their competitors.
Citations to relevant literature is enforced by peer-review. Slide from Ray Mooney
11. Social network analysis Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships.
The interactions and relationships can be represented with a network or graph,
each vertex (or node) represents an actor and
each link represents a relationship. CS583, Bing Liu, UIC 11
12. Centrality Important or prominent actors are those that are linked or involved with other actors extensively.
A person with extensive contacts (links) or communications with many other people in the organization is considered more important than a person with relatively fewer contacts.
The links can also be called ties. A central actor is one involved in many ties. CS583, Bing Liu, UIC 12
13. Prestige Prestige is a more refined measure of prominence of an actor than centrality.
Distinguish: ties sent (out-links) and ties received (in-links).
A prestigious actor is one who is object of extensive ties as a recipient.
To compute the prestige: we use only in-links.
Difference between centrality and prestige:
centrality focuses on out-links
prestige focuses on in-links.
PageRank is based on prestige CS583, Bing Liu, UIC 13
14. Drawing on the citation work First attempt to do link analysis Slide from Chris Manning 14
15. Query-independent ordering First generation: using link counts as simple measures of popularity.
Two basic suggestions:
Undirected popularity:
Each page gets a score = the number of in-links plus the number of out-links (3+2=5).
Directed popularity:
Score of a page = number of its in-links (3). Slide from Chris Manning 15
16. Query processing First retrieve all pages meeting the text query (say venture capital).
Order these by their link popularity (either variant on the previous page).
More nuanced – use link counts as a measure of static goodness, combined with text match score
Slide from Chris Manning
17. Spamming simple popularity Exercise: How do you spam each of the following heuristics so your page gets a high score?
Each page gets a static score = the number of in-links plus the number of out-links.
Static score of a page = number of its in-links.
Slide from Chris Manning
18. Intuition of PageRank
19. Pagerank scoring Imagine a browser doing a random walk on web pages:
Start at a random page
At each step, go out of the current page along one of the links on that page, equiprobably
“In the steady state” each page has a long-term visit rate - use this as the page’s score. Slide from Chris Manning
20. Not quite enough The web is full of dead-ends.
Random walk can get stuck in dead-ends.
Makes no sense to talk about long-term visit rates. Slide from Chris Manning
21. Teleporting At a dead end, jump to a random web page.
At any non-dead end, with probability 10%, jump to a random web page.
With remaining probability (90%), go out on a random link.
10% - a parameter. Slide from Chris Manning 21
22. Result of teleporting Now cannot get stuck locally.
There is a long-term rate at which any page is visited (not obvious, will show this).
How do we compute this visit rate? Slide from Chris Manning
23. Markov chains A Markov chain consists of n states, plus an n?ntransition probability matrixP.
At each step, we are in exactly one of the states.
For 1 ? i,j ? n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. Slide from Chris Manning
24. Markov chains Clearly, for all i,
Markov chains are abstractions of random walks.
Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain, for this case: Slide from Chris Manning
25. Ergodic Markov chains A Markov chain is ergodic if
you have a path from any state to any other
For any start state, after a finite transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero. Slide from Chris Manning
26. Ergodic Markov chains For any ergodic Markov chain, there is a unique long-term visit rate for each state.
Steady-state probability distribution.
Over a long time-period, we visit each state in proportion to this rate.
It doesn’t matter where we start. Slide from Chris Manning
27. Probability vectors A probability (row) vector x= (x1, … xn) tells us where the walk is at any point.
E.g., (000…1…000) means we’re in state i. Slide from Chris Manning
28. Change in probability vector If the probability vector is x= (x1, … xn) at this step, what is it at the next step?
Recall that row i of the transition prob. Matrix P tells us where we go next from state i.
So from x, our next state is distributed as xP. Slide from Chris Manning
29. Steady state example The steady state looks like a vector of probabilities a= (a1, … an):
ai is the probability that we are in state i. Slide from Chris Manning 29
30. How do we compute this vector? Let a= (a1, … an) denote the row vector of steady-state probabilities.
If we our current position is described by a, then the next step is distributed as aP.
But a is the steady state, so a=aP.
Solving this matrix equation gives us a.
So a is the (left) eigenvector for P.
(Corresponds to the “principal” eigenvector of P with the largest eigenvalue.)
Transition probability matrices always have largest eigenvalue 1. Slide from Chris Manning
31. One way of computing a Recall, regardless of where we start, we eventually reach the steady state a.
Start with any distribution (say x=(10…0)).
After one step, we’re at xP;
after two steps at xP2 , then xP3 and so on.
“Eventually” means for “large” k, xPk = a.
Algorithm: multiply x by increasing powers of P until the product looks stable. Slide from Chris Manning
32. Pagerank summary Preprocessing:
Given graph of links, build matrix P.
From it compute a.
The entry ai is a number between 0 and 1: the pagerank of page i.
Query processing:
Retrieve pages meeting query.
Rank them by their pagerank.
Order is query-independent. Slide from Chris Manning
33. The reality Pagerank is used in Google, but is hardly the full story of ranking
Many sophisticated features are used
Some address specific query classes
Machine learned ranking heavily used Slide from Chris Manning
34. Pagerank: Issues and Variants How realistic is the random surfer model?
What if we modeled the back button?
Surfer behavior sharply skewed towards short paths
Search engines, bookmarks & directories make jumps non-random.
Biased Surfer Models
Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection)
Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) Slide from Chris Manning 34
35. Topic Specific Pagerank Goal – pagerank values that depend on query topic
Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule:
Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories
Teleport to a page uniformly at random within the chosen category
Sounds hard to implement: can’t compute PageRank at query time! Slide from Chris Manning
36. Topic Specific Pagerank Offline: Compute pagerank for individual categories
Query independent as before
Each page has multiple pagerank scores – one for each ODP category, with teleportation only to that category
Online: Distribution of weights over categories computed by query context classification
Generate a dynamic pagerank score for each page - weighted sum of category-specific pageranks Slide from Chris Manning
37. Input:
Web graph W
Influence vector v over topics
v : (page ? degree of influence)
Output:
Rank vector r: (page ? page importance wrt v)
r= PR(W , v) Influencing PageRank (“Personalization”) Slide from Chris Manning basis
web graph!!basis
web graph!!
38. Non-uniform Teleportation Slide from Chris Manning
39. Interpretation of Composite Score Given a set of personalization vectors {vj}
?j [wj· PR(W , vj)] = PR(W , ?j [wj·vj])
Given a user’s preferences over topics, express as a combination of the “basis” vectors vj Slide from Chris Manning
40. Interpretation Slide from Chris Manning
41. Interpretation Slide from Chris Manning
42. Interpretation Slide from Chris Manning
43. Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:
Hub pages are good lists of links on a subject.
e.g., “Bob’s list of cancer-related links.”
Authority pages occur recurrently on good hubs for the subject.
Best suited for “broad topic” queries rather than for page-finding queries.
Gets at a broader slice of common opinion.
Slide from Chris Manning
44. Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic.
A good authority page for a topic is pointed to by many good hubs for that topic.
Circular definition - will turn this into an iterative computation. Slide from Chris Manning
45. The hope Slide from Chris Manning
46. High-level scheme Extract from the web a base set of pages that could be good hubs or authorities.
From these, identify a small set of top hub and authority pages;
iterative algorithm. Slide from Chris Manning
47. Spam in Search Or, “Search Engine Optimization”
48. The trouble with paid search ads … It costs money. What’s the alternative?
Search Engine Optimization:
“Tuning” your web page to rank highly in the algorithmic search results for select keywords
Alternative to paying for placement
Thus, intrinsically a marketing function
Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients
Some perfectly legitimate, some very shady Slide from Chris Manning
49. Search engine optimization (Spam) Motives
Commercial, political, religious, lobbies
Promotion funded by advertising budget
Operators
Contractors (Search Engine Optimizers) for lobbies, companies
Web masters
Hosting services
Forums
E.g., Web master world ( www.webmasterworld.com )
Search engine specific tricks
Discussions about academic papers Slide from Chris Manning
50. Simplest forms First generation engines relied heavily on tf/idf
The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s
SEOs responded with dense repetitions of chosen terms
e.g., mauiresort maui resort maui resort
Often, the repetitions would be in the same color as the background of the web page
Repeated terms got indexed by crawlers
But not visible to humans on browsers Slide from Chris Manning
51. Variants of keyword stuffing Misleading meta-tags, excessive repetition
Hidden text with colors, style sheet tricks, etc. Slide from Chris Manning
52. Cloaking Serve fake content to search engine spider
DNS cloaking: Switch IP address. Impersonate Slide from Chris Manning
53. More spam techniques Doorway pages
Pages optimized for a single keyword that re-direct to the real target page
Link spamming
Mutual admiration societies, hidden links, awards
Domain flooding: numerous domains that point or re-direct to a target page
Robots
Millions of submissions via Add-Url Slide from Chris Manning
54. The war against spam Quality signals - Prefer authoritative pages based on:
Votes from authors (linkage signals)
Votes from users (usage signals)
Robust link analysis
Ignore statistically implausible linkage (or text)
Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning
Training set based on known spam
Family friendly filters
Linguistic analysis, general classification techniques, etc.
For images: flesh tone detectors, source text analysis, etc.
Editorial intervention
Blacklists
Top queries audited
Complaints addressed
Suspect pattern detection Slide from Chris Manning
55. More on spam Web search engines have policies on SEO practices they tolerate/block
http://help.yahoo.com/help/us/ysearch/index.html
http://www.google.com/intl/en/webmasters/
Adversarial IR: the unending (technical) battle between SEO’s and web search engines
Research http://airweb.cse.lehigh.edu/ Slide from Chris Manning
56. NY Times article on JC Penney link spam http://www.nytimes.com/2011/02/13/business/13search.html
“in the last several months, JCPenny.com in the #1 spot for:
“dresses”, “bedding”, “area rugs”
“bedding”
“area rugs”
“Someone paid to have thousands of links placed on hundreds of sites scattered around the Web
“2,015 pages with phrases like “casual dresses,” “evening dresses,” “little black dress” or “cocktail dress”.
The NY Times informed Google
At 7 p.m., J. C. Penney #1 result for “Samsonite carry on luggage.”
Two hours later, it was at No. 71
57. New Problem: Content Farms Demand Media: eHow, etc
http://www.wired.com/magazine/2009/10/ff_demandmedia/all/1
Demand Media’s “legion of low-paid writers” “pump out 4,000 videoclips and articles a day. It starts with an algorithm” based on:
Search terms (popular terms from more than 100 sources comprising 2 billion searches a day),
The ad market (a snapshot of which keywords are sought after and how much they are fetching),
The competition (what’s online already and where a term ranks in search results).
Wired on Google’s change:
http://www.wired.com/epicenter/2011/02/google-clamp-down-content-factories/ - previousafb4700d42dc9287ba6d0e0a00756dfb
“Google updated its core ranking algorithm…to decrease the prevalence of…content farms in top search results.”
58. How to address content farms? From Google blog: “We’ve been exploring different algorithms to detect content farms, which are sites with shallow or low-quality content. One of the signals we're exploring is explicit feedback from users. To that end, today we’re launching an early, experimental Chrome extension so people can block sites from their web search results. If installed, the extension also sends blocked site information to Google, and we will study the resulting feedback and explore using it as a potential ranking signal for our search results.”