1 / 21

BrowseRank : letting the web users vote for page importance

Yuting Liu, Bin Gao , Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR 2008 2009. 04.10. Summarized & presented by Babar Tareen , IDS Lab., Seoul National University. BrowseRank : letting the web users vote for page importance. Introduction.

Download Presentation

BrowseRank : letting the web users vote for page importance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR 2008 2009. 04.10. Summarized & presented by Babar Tareen, IDS Lab., Seoul National University BrowseRank: letting the web users vote for page importance

  2. Introduction • Page importance is a key factor for web search • Currently page importance is measured by using the link graph • HITS • PageRank • If many important pages link to a page then the page is also likely to be important

  3. PageRank Drawbacks • Link graph is not reliable • Links can easily be created and deleted on the web • Can easily be manipulated by web spammers using link farms • PageRank does not considers the length of time which a web surfer spends on the web page

  4. BrowseRank • Utilize user browsing graph • Generated from user behavior data • Behavior data can be recorded by Internet browsers at web clients and collected at web servers • Behavior data includes • URL • Time • Method of visiting (URL input or hyperlink click)

  5. BrowseRank (2) More visits of the page and longer time spent on a page indicates that the page is important Uses continuous-time Markov process as model on user browsing graph Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states

  6. Originality Propose the use of browsing graph for computing page importance Propose the use of continuous-time Markov process to model a random walk on the user browsing graph

  7. User Behavior Data • When user surfs on the web • Can input the URL • Choose to click on a hyperlink • Behavior data can be stored as triples • <URL, TIME, TYPE>

  8. User Behavior Data (2) • Session Segmentation • Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session • Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session • URL Pair construction • Within session, URL’s are placed in adjacent records • Indicates that the user transits from the first page to the second page

  9. User Behavior Data (3) • Reset probability estimation • For sessions segmented by type rule, the first URL is input by the user • Assign reset probabilities to those URL’s • Staying time extraction • For each URL pair, use the time difference of second and first page as staying time • For last session either use random time [for time rule] or time difference from next session [for type rule]

  10. User Browsing Graph 3 18 25 15 30 7 6 45 14 17 • Vertex: Represent a URL • Metadata: Reset Probabilities, Staying Time • Directed Edge: Represents Transition between pages • Edge Weight: Number of transitions

  11. Model • Continuous-time time-homogeneous Markov Process model • Assumptions • Independence of users ad sessions • Markov Property • Time-homogeneity

  12. Continuous-time Markov Model Xs represents page which the surfer is visiting at time s, s > 0 Continuous-time time-homogenous Markov Process Pij(t) denotes the transition probability from page i to page j for time interval t Stationary probability distribution Π unique and independent of t Computing matrix P is difficult because it is hard to get information for all time intervals Algorithm is based on

  13. Algorithm

  14. Experiments • Website-Level BrowseRank • Finding important websites and depressing spam sites • Page-Level BrowseRank • Improving relevance ranking • Dataset • 3 billion records • 950 million unique URL’s • Website Level Graph • 5.6 million vertices • 53 million edges • 40 million websites

  15. Top-20 Websites

  16. Spam fighting 2714 websites labeled spam by human experts

  17. Page Level Testing • Adopted 3 measures to evaluate performance • MAP • Precission (P@n) • Normalized Discounted Cummulative Gain (NDCG@n)

  18. Results (1)

  19. Results (2)

  20. Technical Issues User behavior data tends to be sparse User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages Time homogeneity assumption is mainly for technical convenience Content information and metadata was not used in BrowseRank

  21. Discussion Better approach to find page importance Already highlights technical issues Spammers can alter BrowseRank by sending fake user behavior data. This will be easy too as behavior data is collected from client.

More Related