Answering Relationship Queries on the Web

AnsweringRelationship Querieson the Web Gang Luo, Chunqiang Tang and Ying-li Tian IBM T.J. Watson Research Center WWW 2007

E1 E2 Motivation • Relationship between people • Dr. John Robert Schrieffer: • A professor at Florida State University • Nobel prize laureate in physics • He is invited to a party. • Mr. Glenn Klausman: • Glenn plans to attend the party. • He would like to chat with him. • Florida attorney practicing personal injury law • A answering relationship query (RQ) asks for the relationships between two or more entities.

E1 E2 Challenge • Web pages : Unstructured Documents • Large mount of “noise” (i.e., irrelevant information) in the web pages • How to capture potential connecting terms between E1 and E2. • How to compute term weights based on the characteristics of the two Web pages sets.

E1 E2 Answering Relationship Query • Searchers may not be able to find desired relationships between E1 and E2 • (1) Retrieved pages do not contain any desired relationship • (2) No Web page mentions both E1 and E2 and their relationship • (3) Web pages may either (a) mention some relationships, or (b) just happen to incidentally mention both E1 and E2. • (4) No desired relationship exists between E1 and E2

E1 E2 Answering RQ – User Interface 1/2

Answering RQ – User Interface 2/2

Answering RQ - Step1 • Step 1 : Obtaining Web pages • Using entity Ei (i=1,2 ) as a query keyword retrieves the URLs of the top Mi Web pages • For each URL, the corresponding Web page is retrieved from the Web. • M1=M2=50

query keyword in Ki Ki: set of keywords of entity Ei Answering RQ - Step 2 • Step 2 : Document Pre-processing • Operation 1: All Html comments, JavaScript code, tags, and non-alphabetic characters are removed. • Operation 2: stemming (Porter stemmer) • Operation 3: stopwords are removed by SMART stopword list • Operation 4: Connecting term • W: windows size (25~35) • They assume that the most useful information is typically centered around query keywords and use windowing to obtain this information. • Operation 4 attempts to strike a balance between noise reduction and omission of useful information

Answering RQ - Step 3 (1/3) • Step 3 : Computing Similarity Values • For each pair of Web pages (P1,P2), where P1S1 and P2S2, they compute a similarity value. • For each connecting term t that appears in both P1 and P2, they compute a term weight. • The weight reflects the likelihood that t captures the relationship between E1 and E2. • Using Okapi formula computes both term weights and the similarity values of Web page pairs.

Answering RQ - Step 3 (2/3) • Okapi formula - Q: Query and S: document set • For each term t in the vocabulary and DS, Okapi uses the following formulas: qtf : t’s frequency in Q N : total number of documents in S df : the number of documents in S that contain t dl : length of D in bytes avdl : the average length in bytes

Answering RQ - Step 3 (3/3) • For RQ – Two Web pages sets rather than one document set and one query. • The idea is to replace (D,Q) with (P1S, P2S) • The method reuses equation f1, drops f3, and changes f2, f4, and f5 into f2’,f4’, and f5’ • Top C potential connecting terms: 20 ~ 30

Answering RQ - Step 4 • Step 4 :Sorting Web page Pairs • All the Web page pairs are sorted in descending order of their similarity values • Top ten Web page pairs are returned to the searcher in the first result page

Experimental Results - Example 1 • Scenario I: Relationship between People • Example 1 (Nobel Example) • (P1,1 ,P2,28)

Experimental Results - Example 2 • Example 2 (Lomet Example) • Suppose Arthur will attend a conference and he notices that David will attend the same conference. • Assume that Arthur does not know David and would like to chat with him. • (P1, 48 ,P2, 5)

Experimental Results - Example 3(1/3) • Scenario II: Relationship between Places • Example 3 (Yorktown Example) • The first Web page pair : ( P1, 31 , P2, 21)

Experimental Results - Example 3(2/3) • Example 3 (Yorktown Example) • The second Web page pair : (P1, 46 , P2, 19)

Experimental Results - Example 3(3/3) • Example 3 (Yorktown Example)

Experimental Results - Example 4 • Example 4 (Hartlepool Example) • E1: HartlepoolE2: Three Gorges • The fourth Web page pair (P1,17 , P2,49)

Experimental Results - Example 5 • Scenario III: Relationship between Companies • Example 5 (Bank Example) • E1: St. Petersburg Real Estate Holding Co E2: Union Bank of Switzerland scandal • The first Web page pair ( P1,15 , P2, 45)

Experimental Results - Example 6 • Scenario IV: Relationship between Institutes • Example 6 (CMU Example) • Anton graduated from the Computer Science Department of CMU. • He is currently a researcher at Microsoft Research (MSR). • He will go back to CMU to recruit new employees for MSR. • E1:Microsoft Research. E2:Carnegie Mellon University computer science. • ( P1,39 ,P2,11 )

Experimental Results - Example 7 • Scenario V: Relationship between Document Sets • Example 7 (Paper Example) • Cathy is a manager at a research lab.

Experimental Results -Sensitivity Analysis of Parameter Values (1/3) • The score is defined as the sum of reciprocal ranks of relevant Web page pairs in the returned top ten Web page pairs. • For example, if in the returned top ten Web page pairs, the first, second, and eighth Web page pairs are relevant ones, the score would be 1+1/ 2 + 1/ 8 = 1.625 . • Relevant Web page pairs contain desired relationships between the two entities and are manually identified. • 30 examples

Experimental Results -Sensitivity Analysis of Parameter Values (2/3)

Experimental Results -Sensitivity Analysis of Parameter Values (3/3) • Tech1: Use windowing in document pre-processing (Operation 4 in Step 2). • Tech2: Use max(W’idf ,1 W’idf , 2) in the term weighting formula (Step 3). • Tech3: Only consider the top C potential connecting terms in computing the similarity value of a Web page pair (Step 3). • Tech4: For either i (i=1, 2), compute a set of global statistics (Ni, avdli, dfi) on the Web page set Si(Step 3).

Conclusion • They believe that they are among the first to study the problem of answering relationship queries on the Web. • To effectively filter out the large amount of noise in the Web pages without losing much useful information • They do windowing around query keywords, compute term weights based on the characteristics of the two Web page sets • Only use the top potential connecting terms to compute the similarity values of Web page pairs

Answering Relationship Queries on the Web

Answering Relationship Queries on the Web

Presentation Transcript

On Answering Queries in the Presence of Limited Access Patterns

Answering Queries Using Views: The Last Frontier

Answering Imprecise Queries over Autonomous Web Databases

Answering Queries Using Views LMSS 95

Answering Similar Region Search Queries

Answering Queries and Hypertree Decompositions

Answering queries across mappings

Retroactive Answering of Search Queries

Answering Queries: Problems

Answering Approximate Queries Efficiently

Answering Imprecise Queries over Web Databases

Answering Queries using views: A survey

Answering Top-k Queries Using Views

Answering Conceptual Queries with Ferret

Efficiently Answering Reachability Queries on Large Directed Graphs

Answering Queries Using Views

Retroactive Answering of Search Queries

A review on “Answering Relationship Queries on the Web”

Answering Queries Using Views

Answering Approximate Queries Efficiently

Answering Imprecise Queries over Web Databases

Efficiently Answering Reachability Queries on Large Directed Graphs