CS533 Information Retrieval

CS533 Information Retrieval Dr. Weiyi Meng Lecture #17 April 4, 2000

Metasearch Engine Two observations about search engines: • Web pages a user needs are frequently stored in multiple search engines. • The coverage of each search engine is limited. • Combining multiple search engines may increase the coverage. A metasearch engine is a good mechanism for solving these problems.

Metasearch Engine Solution query result user user interface query dispatcherresult merger search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n

Some Observations When n is small (say < 10), we can afford to send each query to all local search engines When n is large (imagine n is in 1000s), then • most sources are not useful for a given query • sending a query to a useless source would • incur unnecessary network traffic • waste local resources evaluating query • increase the cost of merging the results

A More Efficient Metasearch Engine query result user user interface database selector document selector query dispatcherresult merger search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n

Introduction to Metasearch Engine (1) Database Selection Problem • Select potentially useful databases for a given query • essential if the number of local databases is large • reduce network traffic • avoid wasting local resources

Introduction to Metasearch Engine (2) • Potentially useful database: a database that contains potentially useful documents. • Potentially useful document: • Its global similarity with the query is above a threshold. • Its global similarity with the query is among the m highest for some m.

Introduction to Metasearch Engine (3) • Need some knowledge about each database in advance in order to perform database selection • Database Representative

Introduction to Metasearch Engine (4) Document Selection Problem Select potentially useful documents from each selected local database efficiently • Retrieve all potentially useful documents while minimizing the retrieval of useless documents • from global similarity threshold to tightest local similarity threshold

Introduction to Metasearch Engine (5) Result Merging Problem Objective: Merge returned documents from multiple sources into a single ranked list. DB1 d11, d12, ... . . . . . . Merger d12, d54, ... dN1, dN2, ... DBN

Introduction to Metasearch Engine (6) An “Ideal” Metasearch Engine: • Retrieval effectiveness: same as that as if all documents were in the same collection. • Efficiency: optimize the retrieval process.

Introduction to Metasearch Engine (7) Implications of ideal metasearch engine: should aimed at: • selecting only useful search engines • retrieving and transmitting only useful documents • ranking documents according to their degrees of relevance

Database Selection: Basic Idea Goal: Identify potentially useful databases for each user query. General approach: • use representative to indicate approximately the content of each database • use these representatives to select databases for each query

Solution Classification • Naive Approach: Select all databases (e.g. MetaCrawler, NCSTRL) • Qualitative Approaches: estimate the quality of each local database • based on rough representatives • based on detailed representatives

Solution Classification (cont.) • Quantitative Approaches: estimate quantities that measure the quality of each local database more directly and explicitly • Learning-based Approaches: database representatives are obtained through training or learning

Qualitative Approaches Using Rough Representatives (1) • typical representative: • a few words or a few paragraphs in certain format • manual construction often needed General remarks: • may work well for special-purpose local search engines • selection can be inaccurate

Approaches Using Rough Representatives (2) Example: ALIWEB (Koster 94) Template-Type: DOCUMENT Title: Perl Description: Information on the Perl Programming Language. Includes a local Hypertext Perl Manual, and the latest FAQ in Hypertext. Keywords: perl, perl-faq, language

Qualitative Approaches Using Detailed Representatives (1) • Use detailed statistics for each term. • Estimate the usefulness or quality of each search engine for each query. • The usefulness measures are less direct/explicit compared to those used in quantitative approaches. • Scalability starts to become an issue.

Qualitative Approaches Using Detailed Representatives (2) Example: gGlOSS (generalized Glossary-Of-Servers Server, Gravano 95) • representative: (dfi , Wi) for term ti dfi -- document frequency of ti Wi -- the sum of weights of ti in all documents

gGlOSS (continued) • database usefulness: sum of high similarities usefulness(q, D, T) = where D is a database and T is a threshold.

gGlOSS (continued) Suppose for query q , we have D1 d11: 0.6, d12: 0.5 D2 d21: 0.3, d22: 0.3, d23: 0.2 D3 d31: 0.7, d32: 0.1, d33: 0.1 usefulness(q, D1, 0.3) = 1.1 usefulness(q, D2, 0.3) = 0.6 usefulness(q, D3, 0.3) = 0.7

gGlOSS (continued) Usefulness is estimated based on two cases: • high-correlation case: if dfi dfj , then every document having ti also has tj . • disjoint case: for any two query terms ti and tj , no document contains both ti and tj .

gGlOSS (continued) Example (high-correlation case) : Consider q = (1, 1, 1) with df1 = 2, df2= 3, df3 = 4, W1 = 0.6, W2 = 0.6 and W3 = 1.2. t1 t2 t3 t1 t2 t3 d1 0.2 0.1 0.3 0.3 0.2 0.3 d2 0.4 0.3 0.2 0.3 0.2 0.3 d3 0 0.2 0.4 0 0.2 0.3 d4 0 0 0.3 0 0 0.3 • usefulness(q, D, 0.5) = 2.1

Quantitative Approaches Two types of quantities may be estimated: • the number of documents in a database D with similarities higher than a threshold T: NoDoc(q, D, T) = |{ d : d  D and sim(q, d) > T }| 2. global similarity of the most similar document in D: msim(q, D) = max { sim(q, d) } dD

Quantitative Approaches Qualitative approaches versus quantitative approaches: • Usefulness measures in quantitative approaches are easier to understand and easier to use. • Quantitative measures are usually more difficult to estimate and need more information to estimate.

Estimating NoDoc(q, D, T) (1) Basic Approach(Meng 98) • representative: (pi , wi) for term ti pi : probability ti appears in a document wi : average weight of ti among documents having ti Ex: Normalized weights of tiin 10 docs are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4, 0.6, 0.6). pi = 0.6, wi = 0.4

Estimating NoDoc(q, D, T) (2) Example: Consider query q = (1, 1). Suppose p1 = 0.2, w1 = 2, p2 = 0.4, w2 = 1. A generating function: (0.2 X 2 + 0.8) (0.4 X + 0.6) = 0.08 X 3 + 0.12 X 2 + 0.32 X + 0.48 a X b : a is the probability that a document in D has similarity b with q. NoDoc(q, D, 1) = 10*(0.08 + 0.12) = 2

Estimating NoDoc(q, D, T) (3) Example: Consider query q = (1, 1, 1) and documents: (0, 2, 2), (1, 0, 1), (0, 2, 0), (0, 0, 3), (0, 0, 0). (p1, w1) = (0.2, 1), (p2, w2) = (0.4, 2), (p3, w3) = (0.6, 2) The generating function for this query: (0.2 X + 0.8) (0.4 X 2 + 0.6) (0.6 X 2 + 0.4) = 0.048 X 5 + 0.192 X 4 + 0.104 X 3 + 0.416 X 2 + 0.048 X + 0.192 The accurate function for this query: 0 X 5 + 0.2 X 4 + 0.2 X 3 + 0.4 X 2 + 0 X + 0.2

Estimating NoDoc(q, D, T) (4) Consider query q = (q1, ..., qr). Proposition. If the terms are independent and the weight of term tiwhenever present in a document is wi , then the coefficient of X s in the following generating function is the probability that a document in D has similarity s with q.

Estimating NoDoc(q, D, T) (5) Suppose expanded generating function is: a1 X b1 + a2 X b2 + … + ac X bc , b1 > … > bc For a given threshold T, let v be the largest integer to satisfy bv > T. Then NoDoc(q, D, T) can be estimated by: n (a1 + a2 + … + av) where n is the number of documents in D.

Database Selection Using msim(q,D) Optimal Ranking of Databases(Yu 99) User: for query q, find the m most similar documents. Definition: Databases [D1, D2, …, Dp] are optimally ranked with respect to q if there exists a k such that each of the databases D1, …, Dk contains one of the m most similar documents, and all of these m documents are contained in these k databases.

Database Selection Using msim(q,D) Optimal Ranking of Databases Example: For a given query q: D1 d1: 0.8, d2: 0.5, d3: 0.2, ... D2 d9: 0.7, d2: 0.6, d10: 0.4, ... D3 d8: 0.9, d12: 0.3, … other databases have documents with small similarities When m = 5: pick D1, D2, D3

Database Selection Using msim(q,D) Proposition: Databases [D1, D2, …, Dp] are optimally ranked with respect to a query q if and only if msim(q, Di)  msim(q, Dj), i < j Example: D1 d1: 0.8, … D2 d9: 0.7, … D3 d8: 0.9, … Optimal rank: [D3, D1, D2, …]

Estimating msim(q, D) • global database representative: global dfi of term ti • local database representative: anwi : average normalized weight of ti mnwi: maximum normalized weight of ti Ex: term ti : d1 0.3, d2 0.4, d3 0, d4 0.74 anwi = (0.3 + 0.4 + 0 + 0.7)/4 = 0.35 mnwi = 0.74

Estimating msim(q, D) term weighting scheme query term: tf*gidf document term: tf query q = (q1, q2 , …, qk) modified query: q’ = (q1gidf1, …, qkgidfk) msim(q, D) = max { qi*gidfi*mnwi +  qj*gidfj*anwj }/|q’| 1i k ji

Learning-based Approaches Basic idea: Use past retrieval experiences to predict future database usefulness. Different types of learning methods • Static learning : learning based on static training queries before the system is used by real users.

Learning-based Approaches Different types of learning methods (cont.) • Dynamic learning : learning based on real evaluated user queries. • Combined learning: learned knowledge based on training queries will be adjusted based on real user queries.

Dynamic Learning (1) Example : SavvySearch(Dreilinger 97) • database representative of database D: wi : indicate how well D responds to query term ti cfi : number of databases containing ti ph : penalty due to low return pr : penalty due to long response time • Initially, wi = ph = pr= 0

Dynamic Learning: SavvySearch • Learning the value of wi for database D. After a k-term query having term ti is processed: • if no document is retrieved: wi = wi - 1/k • if some returned document is clicked wi = wi + 1/k • otherwise, no change to wi • Over time, large positive wi indicates that database D responds well with the term ti .

Dynamic Learning: SavvySearch • Compute ph and pr for each database D. • if avg# of hits returned for most recent 5 queries < Th (default: Th = 1) ph = (Th - h) 2 / Th 2 • if avg response time for most recent 5 queries > Tr (default: Tr = 15 second) pr = (r - Tr ) 2 / (45 - Tr ) 2

Dynamic Learning: SavvySearch Compute the ranking score of database D for query q = (t1, ..., tk ) r (q, D) = where N is the number of local databases.

CS533 Information Retrieval