290 likes | 408 Views
VIPAS: Virtual Link Powered Authority Search in the Web. Chi-Chun Lin and Ming-Syan Chen Network Database Laboratory National Taiwan University. Outline. Motivation and Goal Preliminaries and Related work Introduction to Link-analysis
E N D
VIPAS: Virtual Link Powered Authority Search in the Web Chi-Chun Lin and Ming-Syan Chen Network Database Laboratory National Taiwan University
Outline • Motivation and Goal • Preliminaries and Related work • Introduction to Link-analysis • Defects of Traditional Link-analysis and Ideas for Improvement • System Framework and Algorithms • Implementation and Experimental Results • Conclusions NTU
Motivation and Goal • To find the most relevant pages satisfying the user’s information need in the Web • Traditional means for this task • Keyword-based search engines • Problems • Some relevant pages do not contain the keywords in the page text • An alternative method • Analyze the links contained in Web pages instead of ranking by keywords NTU
HITS (1/3) • Authority pages • A page pointed to by many other pages • Hub pages • A page pointing to many other pages • Mutual reinforcement • An authority pointed to by many hub pages is an even better authority • A hub pointing to many authority pages is an even better hub • Based on this argument, the goal of HITS is to find the set of best authority pages NTU
HITS (2/3) • Let xp and yp denote the authority and hub score of page p, respectively q1 page p xp := sum of yqfor all qp q1 q2 q3 q2 page p yp := sum of xqfor all pq q3 NTU
HITS (3/3) • Iterative algorithm • Obtain a set of Web pages using a keyword-based query and expand it to form a base set • Assign each page of the base set an initial authority and hub score of 1 • According to its links, update the scores of each page • Normalize the scores so that(xp)2=1 and (yp)2=1 for all p in the base set • Do steps 3 and 4 iteratively until the scores converge NTU
The Problem with HITS • Links in Web pages only reflect page creators’ judgment • Sometimes a link will not be put in the page even though its destination is very relevant • e.g: There will be no link to a company’s competitor in the same industry in its homepage • We argue: Page readers’ considerationshould be of equal importance NTU
The Notion of Virtual Links • The basic idea • Identify pages that are heavily accessed within a period, and form a “hot set” from these pages • Create “virtual links” for pages in the hot set and incorporate them into the computation of authority scores • Design a Web warehouse for this task and utilize it to identify authoritative Web pages NTU
System Framework Page Archive Query Interface Web Pages page content & links keywords virtual links Keyword & Ranking Database Virtual Link Creator Authority Evaluator scores query results Clickstream Database Clicking Observer NTU
Creating Virtual Links • Scenario: A user interested in Java-related Web pages came to our system • She submitted a query with keyword “java” • Assume that the query result contains 100 URLs • She clicked top 1-10 of the 100 URLs except the 6th • The hot set consists of the 9 URLs clicked NTU
Creating Virtual Links (cont’d) • 2 criteria URL 1 URL 1 URL 2 URL 2 Hub 1 URL 5 URL 5 Hub 2 Virtual Hub URL 6 URL 6 Hub n URL 7 URL 7 URL 10 URL 10 NTU
Algorithm VIPAS(Virtual LInk Powered Authority Search) • Initialization Phase • For a query term, perform the regular HITS analysis • Collect a base set of pages with computed authority and hub scores and store them in the database • Virtual Link Collection Phase • Monitor the user behavior to see whether a URL in the list is clicked by the user or not • After a period of user behavior observation, put URLs that are often accessed into the “hot set” • Create virtual links for pages in the hot set NTU
Algorithm VIPAS (cont’d) • Refinement Phase • For each page in the hot set, compute its new authority and hub scores • Run several iterations of score updating for pages in the base set • 2flavors • VIPAS-VH(VIPAS with virtual links from a Virtual Hub) • VIPAS-TH(VIPAS with virtual links from Top Hubs) NTU
Finding Hot Sets • In an observing period, pay attention to clicks of continuous URLs in the list • When a user continuously clicks several URLs and then skips some URLs following, we mark those that have been skipped • Exclude pages marked with a frequency greater than from the forming of hot sets • Among pages left, those that are accessed by at least % users are put into the hot set • Some relevant URLs that have already been browsed by the user will be skipped NTU
Finding Hot Sets (cont’d) • http://java.sun.com/ • http://www.sun.com/java/ • http://www.javaworld.com/ • http://java.oreilly.com/ • http://www.jars.com/ • ………….. clicked clicked URL 4 is marked clicked skipped clicked • http://java.sun.com/ • http://www.sun.com/java/ • http://www.javaworld.com/ • http://java.oreilly.com/ • http://www.jars.com/ • ………….. skipped clicked URL 4 is marked,but URL 1 is not clicked skipped clicked NTU
Assigning Weights to Virtual Links n pages in the hot set: t1,t2,…,tn Clickstream 1: (t1,t2,t3,t4,x1,x2) Clickstream 2: (t3,x1,t1) NTU
Assigning Weights to Virtual Links (cont’d) • Final weight: • For period Ti where i 2 (1/3 is the degeneration factor) NTU
Computing the New Scores • Let xp and yp denote the authority and hub score of page p, respectively • For each page p, we update p’s authority score by • Similarly, we update p’s hub score by NTU
Query result for keyword: “Java” plain URL http://java.sun.com/ replaced by wrapper.asp?URL=http://java.sun.com/ • The Source of Java(TM) Technologyhttp://java.sun.com/ • ………………….http://…. • ………http://… • Increment the click count ofhttp://java.sun.com/ • Record the time • Redirect the user tohttp://java.sun.com/ Query result page User-behavior Observation • Use an ASP script NTU
Implementation and Experiments • Experimental testbed • NTUEE website(http://www.ee.ntu.edu.tw/) • Data collection • 03/28/’02 ~ 05/31/’02 • Parameters NTU
Evaluation Method • For a keyword, we manually select a list of authority pages and compare it with the output of each algorithm • Discrepancycoefficient NTU
Discrepancy Coefficient –Regular HITS R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228) NTU
Discrepancy Coefficient –VIPAS-VH R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228) NTU
Evaluation Method • Grouping coefficient • Stability • The standard deviation of each algorithm’s discrepancy coefficients for all of the keywords NTU
Grouping Coefficient –Regular HITS R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228) NTU
Grouping Coefficient –VIPAS-VH R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228) NTU
Conclusions • Link-analysis algorithms are popular in Web information retrieval • But they need further improvement • In our work, we built a Web warehouse • Incorporate user feedback into the identification of authoritative resources(Algorithm VIPAS) • Experimental results show that VIPAS is very effective and the warehouse is able to retrieve much more valuable information for users NTU