BrowseRank : letting the web users vote for page importance

Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR 2008 2009. 04.10. Summarized & presented by Babar Tareen, IDS Lab., Seoul National University BrowseRank: letting the web users vote for page importance

Introduction • Page importance is a key factor for web search • Currently page importance is measured by using the link graph • HITS • PageRank • If many important pages link to a page then the page is also likely to be important

PageRank Drawbacks • Link graph is not reliable • Links can easily be created and deleted on the web • Can easily be manipulated by web spammers using link farms • PageRank does not considers the length of time which a web surfer spends on the web page

BrowseRank • Utilize user browsing graph • Generated from user behavior data • Behavior data can be recorded by Internet browsers at web clients and collected at web servers • Behavior data includes • URL • Time • Method of visiting (URL input or hyperlink click)

BrowseRank (2) More visits of the page and longer time spent on a page indicates that the page is important Uses continuous-time Markov process as model on user browsing graph Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states

Originality Propose the use of browsing graph for computing page importance Propose the use of continuous-time Markov process to model a random walk on the user browsing graph

User Behavior Data • When user surfs on the web • Can input the URL • Choose to click on a hyperlink • Behavior data can be stored as triples • <URL, TIME, TYPE>

User Behavior Data (2) • Session Segmentation • Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session • Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session • URL Pair construction • Within session, URL’s are placed in adjacent records • Indicates that the user transits from the first page to the second page

User Behavior Data (3) • Reset probability estimation • For sessions segmented by type rule, the first URL is input by the user • Assign reset probabilities to those URL’s • Staying time extraction • For each URL pair, use the time difference of second and first page as staying time • For last session either use random time [for time rule] or time difference from next session [for type rule]

User Browsing Graph 3 18 25 15 30 7 6 45 14 17 • Vertex: Represent a URL • Metadata: Reset Probabilities, Staying Time • Directed Edge: Represents Transition between pages • Edge Weight: Number of transitions

Model • Continuous-time time-homogeneous Markov Process model • Assumptions • Independence of users ad sessions • Markov Property • Time-homogeneity

Continuous-time Markov Model Xs represents page which the surfer is visiting at time s, s > 0 Continuous-time time-homogenous Markov Process Pij(t) denotes the transition probability from page i to page j for time interval t Stationary probability distribution Π unique and independent of t Computing matrix P is difficult because it is hard to get information for all time intervals Algorithm is based on

Algorithm

Experiments • Website-Level BrowseRank • Finding important websites and depressing spam sites • Page-Level BrowseRank • Improving relevance ranking • Dataset • 3 billion records • 950 million unique URL’s • Website Level Graph • 5.6 million vertices • 53 million edges • 40 million websites

Top-20 Websites

Spam fighting 2714 websites labeled spam by human experts

Page Level Testing • Adopted 3 measures to evaluate performance • MAP • Precission (P@n) • Normalized Discounted Cummulative Gain (NDCG@n)

Results (1)

Results (2)

Technical Issues User behavior data tends to be sparse User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages Time homogeneity assumption is mainly for technical convenience Content information and metadata was not used in BrowseRank

Discussion Better approach to find page importance Already highlights technical issues Spammers can alter BrowseRank by sending fake user behavior data. This will be easy too as behavior data is collected from client.

BrowseRank : letting the web users vote for page importance

BrowseRank : letting the web users vote for page importance

Presentation Transcript

What is page importance?

Vermicomposting: Letting worms do the dirty work

What is Integrated Pest Management (IPM) ?

What is page importance?

CS 333 Introduction to Operating Systems Class 14 – Page Replacement

State of Connecticut

Vote privacy: models and cryptographic underpinnings

Voting in America

Electrophoresis / SDS-PAGE

The Election

MultiChat 15

Voting Behavior

Voter Turnout and Voting Behavior

Chapter Intro-page 336

Voting Rights

Paging Example

How Letting Go of Goals Helps Creativity and Discovery

VOTE : 22 Defence and Military Veterans

Chapter Intro-page 280

Voting Rights

Civics Final Exam Review

THE GROWING IMPORTANCE OF HR IN THE NEW KNOWLEDGE ECONOMY