210 likes | 322 Views
Yuting Liu, Bin Gao , Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR 2008 2009. 04.10. Summarized & presented by Babar Tareen , IDS Lab., Seoul National University. BrowseRank : letting the web users vote for page importance. Introduction.
E N D
Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR 2008 2009. 04.10. Summarized & presented by Babar Tareen, IDS Lab., Seoul National University BrowseRank: letting the web users vote for page importance
Introduction • Page importance is a key factor for web search • Currently page importance is measured by using the link graph • HITS • PageRank • If many important pages link to a page then the page is also likely to be important
PageRank Drawbacks • Link graph is not reliable • Links can easily be created and deleted on the web • Can easily be manipulated by web spammers using link farms • PageRank does not considers the length of time which a web surfer spends on the web page
BrowseRank • Utilize user browsing graph • Generated from user behavior data • Behavior data can be recorded by Internet browsers at web clients and collected at web servers • Behavior data includes • URL • Time • Method of visiting (URL input or hyperlink click)
BrowseRank (2) More visits of the page and longer time spent on a page indicates that the page is important Uses continuous-time Markov process as model on user browsing graph Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states
Originality Propose the use of browsing graph for computing page importance Propose the use of continuous-time Markov process to model a random walk on the user browsing graph
User Behavior Data • When user surfs on the web • Can input the URL • Choose to click on a hyperlink • Behavior data can be stored as triples • <URL, TIME, TYPE>
User Behavior Data (2) • Session Segmentation • Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session • Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session • URL Pair construction • Within session, URL’s are placed in adjacent records • Indicates that the user transits from the first page to the second page
User Behavior Data (3) • Reset probability estimation • For sessions segmented by type rule, the first URL is input by the user • Assign reset probabilities to those URL’s • Staying time extraction • For each URL pair, use the time difference of second and first page as staying time • For last session either use random time [for time rule] or time difference from next session [for type rule]
User Browsing Graph 3 18 25 15 30 7 6 45 14 17 • Vertex: Represent a URL • Metadata: Reset Probabilities, Staying Time • Directed Edge: Represents Transition between pages • Edge Weight: Number of transitions
Model • Continuous-time time-homogeneous Markov Process model • Assumptions • Independence of users ad sessions • Markov Property • Time-homogeneity
Continuous-time Markov Model Xs represents page which the surfer is visiting at time s, s > 0 Continuous-time time-homogenous Markov Process Pij(t) denotes the transition probability from page i to page j for time interval t Stationary probability distribution Π unique and independent of t Computing matrix P is difficult because it is hard to get information for all time intervals Algorithm is based on
Experiments • Website-Level BrowseRank • Finding important websites and depressing spam sites • Page-Level BrowseRank • Improving relevance ranking • Dataset • 3 billion records • 950 million unique URL’s • Website Level Graph • 5.6 million vertices • 53 million edges • 40 million websites
Spam fighting 2714 websites labeled spam by human experts
Page Level Testing • Adopted 3 measures to evaluate performance • MAP • Precission (P@n) • Normalized Discounted Cummulative Gain (NDCG@n)
Technical Issues User behavior data tends to be sparse User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages Time homogeneity assumption is mainly for technical convenience Content information and metadata was not used in BrowseRank
Discussion Better approach to find page importance Already highlights technical issues Spammers can alter BrowseRank by sending fake user behavior data. This will be easy too as behavior data is collected from client.