250 likes | 341 Views
Answering Relationship Queries on the Web. Gang Luo, Chunqiang Tang and Ying-li Tian IBM T.J. Watson Research Center. WWW 2007. E 1. E 2. Motivation. Relationship between people Dr. John Robert Schrieffer: A professor at Florida State University Nobel prize laureate in physics
E N D
AnsweringRelationship Querieson the Web Gang Luo, Chunqiang Tang and Ying-li Tian IBM T.J. Watson Research Center WWW 2007
E1 E2 Motivation • Relationship between people • Dr. John Robert Schrieffer: • A professor at Florida State University • Nobel prize laureate in physics • He is invited to a party. • Mr. Glenn Klausman: • Glenn plans to attend the party. • He would like to chat with him. • Florida attorney practicing personal injury law • A answering relationship query (RQ) asks for the relationships between two or more entities.
E1 E2 Challenge • Web pages : Unstructured Documents • Large mount of “noise” (i.e., irrelevant information) in the web pages • How to capture potential connecting terms between E1 and E2. • How to compute term weights based on the characteristics of the two Web pages sets.
E1 E2 Answering Relationship Query • Searchers may not be able to find desired relationships between E1 and E2 • (1) Retrieved pages do not contain any desired relationship • (2) No Web page mentions both E1 and E2 and their relationship • (3) Web pages may either (a) mention some relationships, or (b) just happen to incidentally mention both E1 and E2. • (4) No desired relationship exists between E1 and E2
E1 E2 Answering RQ – User Interface 1/2
Answering RQ - Step1 • Step 1 : Obtaining Web pages • Using entity Ei (i=1,2 ) as a query keyword retrieves the URLs of the top Mi Web pages • For each URL, the corresponding Web page is retrieved from the Web. • M1=M2=50
query keyword in Ki Ki: set of keywords of entity Ei Answering RQ - Step 2 • Step 2 : Document Pre-processing • Operation 1: All Html comments, JavaScript code, tags, and non-alphabetic characters are removed. • Operation 2: stemming (Porter stemmer) • Operation 3: stopwords are removed by SMART stopword list • Operation 4: Connecting term • W: windows size (25~35) • They assume that the most useful information is typically centered around query keywords and use windowing to obtain this information. • Operation 4 attempts to strike a balance between noise reduction and omission of useful information
Answering RQ - Step 3 (1/3) • Step 3 : Computing Similarity Values • For each pair of Web pages (P1,P2), where P1S1 and P2S2, they compute a similarity value. • For each connecting term t that appears in both P1 and P2, they compute a term weight. • The weight reflects the likelihood that t captures the relationship between E1 and E2. • Using Okapi formula computes both term weights and the similarity values of Web page pairs.
Answering RQ - Step 3 (2/3) • Okapi formula - Q: Query and S: document set • For each term t in the vocabulary and DS, Okapi uses the following formulas: qtf : t’s frequency in Q N : total number of documents in S df : the number of documents in S that contain t dl : length of D in bytes avdl : the average length in bytes
Answering RQ - Step 3 (3/3) • For RQ – Two Web pages sets rather than one document set and one query. • The idea is to replace (D,Q) with (P1S, P2S) • The method reuses equation f1, drops f3, and changes f2, f4, and f5 into f2’,f4’, and f5’ • Top C potential connecting terms: 20 ~ 30
Answering RQ - Step 4 • Step 4 :Sorting Web page Pairs • All the Web page pairs are sorted in descending order of their similarity values • Top ten Web page pairs are returned to the searcher in the first result page
Experimental Results - Example 1 • Scenario I: Relationship between People • Example 1 (Nobel Example) • (P1,1 ,P2,28)
Experimental Results - Example 2 • Example 2 (Lomet Example) • Suppose Arthur will attend a conference and he notices that David will attend the same conference. • Assume that Arthur does not know David and would like to chat with him. • (P1, 48 ,P2, 5)
Experimental Results - Example 3(1/3) • Scenario II: Relationship between Places • Example 3 (Yorktown Example) • The first Web page pair : ( P1, 31 , P2, 21)
Experimental Results - Example 3(2/3) • Example 3 (Yorktown Example) • The second Web page pair : (P1, 46 , P2, 19)
Experimental Results - Example 3(3/3) • Example 3 (Yorktown Example)
Experimental Results - Example 4 • Example 4 (Hartlepool Example) • E1: HartlepoolE2: Three Gorges • The fourth Web page pair (P1,17 , P2,49)
Experimental Results - Example 5 • Scenario III: Relationship between Companies • Example 5 (Bank Example) • E1: St. Petersburg Real Estate Holding Co E2: Union Bank of Switzerland scandal • The first Web page pair ( P1,15 , P2, 45)
Experimental Results - Example 6 • Scenario IV: Relationship between Institutes • Example 6 (CMU Example) • Anton graduated from the Computer Science Department of CMU. • He is currently a researcher at Microsoft Research (MSR). • He will go back to CMU to recruit new employees for MSR. • E1:Microsoft Research. E2:Carnegie Mellon University computer science. • ( P1,39 ,P2,11 )
Experimental Results - Example 7 • Scenario V: Relationship between Document Sets • Example 7 (Paper Example) • Cathy is a manager at a research lab.
Experimental Results -Sensitivity Analysis of Parameter Values (1/3) • The score is defined as the sum of reciprocal ranks of relevant Web page pairs in the returned top ten Web page pairs. • For example, if in the returned top ten Web page pairs, the first, second, and eighth Web page pairs are relevant ones, the score would be 1+1/ 2 + 1/ 8 = 1.625 . • Relevant Web page pairs contain desired relationships between the two entities and are manually identified. • 30 examples
Experimental Results -Sensitivity Analysis of Parameter Values (2/3)
Experimental Results -Sensitivity Analysis of Parameter Values (3/3) • Tech1: Use windowing in document pre-processing (Operation 4 in Step 2). • Tech2: Use max(W’idf ,1 W’idf , 2) in the term weighting formula (Step 3). • Tech3: Only consider the top C potential connecting terms in computing the similarity value of a Web page pair (Step 3). • Tech4: For either i (i=1, 2), compute a set of global statistics (Ni, avdli, dfi) on the Web page set Si(Step 3).
Conclusion • They believe that they are among the first to study the problem of answering relationship queries on the Web. • To effectively filter out the large amount of noise in the Web pages without losing much useful information • They do windowing around query keywords, compute term weights based on the characteristics of the two Web page sets • Only use the top potential connecting terms to compute the similarity values of Web page pairs