230 likes | 358 Views
9 Algorithms: PageRank. Ranking. After matching, have to rank:. Index Based Ranking. Strategies we could (do) use: Frequency Position Metadata. Missing Ingredient. Index lacks intra-page information. Link Quality. More links is easy to abuse. Spam Link Pages. Link Quality.
E N D
Ranking • After matching, have to rank:
Index Based Ranking • Strategies we could (do) use: • Frequency • Position • Metadata
Missing Ingredient • Index lacks intra-page information
Link Quality • More links is easy to abuse Spam Link Pages
Link Quality • Not all links are equal • Who do you trust? • CS Prof • World Famous Chef
Identifying Authority • Links into a page give it authority • Page value = sum of authorities of pages linking to it
Issues • Spam Links • Discourage with negative weight Spam Link Pages -1 -1 -1 -1 -1 -1
Issues • Spam Links • Discourage with negative weight • Cycles:
Issues • Spam Links • Discourage with negative weight • Cycles:
Issues • Spam Links • Discourage with negative weight • Cycles: …
Random Surfer • Simulating a web surfing session • Start at random page • At each page have a chance to • Pick a random link to go to • Jump to a completely random page
Results • Results of many random sessions:
Results • Expressed as percentages, results stabilize • Law of large numbers
Cycle Buster • Random surfer not phased by cycles:
Random Surfer In Use • The recipe pages visited by random surfers:
Simulator • PageRank Simulator: http://caccio.blogdns.net/software/pagerank-simulator
The Real Math • Markov Chains • Set of states • Each state has probability of leading to other states • Represent as matrix
Excel Simulation • Three pages:
Limitations • Still have issues/room for growth • Link Spam • Context of link • Where link is on page • "Bob's recipe is terrible" vs "Bob's recipe is great" • Lack of semantic knowledge • Page's Authority should not be the same for all domains
Power • Controlling search is power: http://www.bitsbook.com/ "If you're not paying for the product, you are the product."