420 likes | 555 Views
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1]. (c) Wolfgang Hürst, Albert-Ludwigs-University. Introduction. So far: General-purpose search engines (index and search the whole web)
E N D
Web Search – Summer Term 2006VII. Selected Topics -Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University
Introduction So far: General-purpose search engines (index and search the whole web) In addition: Special-purpose search engines exist (index and search documents from a particular domain) Problem: - Both only cover only a fraction of the web- Users sometimes have to address different search engines Possible solution: Metasearch engines
Motivation Why should we build metasearch engines? - To increase the coverage of the web - To solve the scalability of searching the web - To facilitate the invocation of multiple search engines - To improve the retrieval effectiveness
A typical metasearch engine session 1. User sends query to metasearch engine 2. Metasearch engine processes query and sends it to component search engines 3. Component search engines return results (based on local similarity) to metasearch engine 4. Metasearch engine merges and ranks returned results (based on global similarities) and presents them to the user
A typical architecture USER INTERFACE Database selection problem DATABASE SELECTOR Result merging problem Document selection problem DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE . . . SEARCH ENGINE
Challenges for metasearch engines Why is it hard to create metasearch engines? Because of the heterogeneity among the involved autonomous component search engines:- Various indexing methods- Different document term weighting schemes- Different query term weighting schemes- Varying similarity functions- Different document databases- Unequal document versions- Different result representation
A typical architecture USER INTERFACE Database selection problem DATABASE SELECTOR Result merging problem Document selection problem DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE . . . SEARCH ENGINE
The database selection problem Selection is important if the number of component search engines is large Goal: Identifying as many potentially useful databases as possible while minimizing wrongly identifying useless ones Decision is usually based on a representative of the database Most important questions:- What are good representatives?- How do we get them?
Approaches for database selection Three types of techniques: - Rough representative approaches e.g. several key words or paragraphs - Statistical representative approaches statistical information (e.g. document frequency of a term) - Learning-based approaches learned from past experience (training queries or real (previous) user queries)
Approaches for database selection Three types of techniques: - Rough representative approaches e.g. several key words or paragraphs - Statistical representative approaches statistical information (e.g. document frequency of a term) - Learning-based approaches learned from past experience (training queries or real (previous) user queries)
Rough representative approaches Typically several keywords and paragraphs Only give a rather general idea about the database Often manually generated Automatic approaches exist (e.g. taking the text from the interface page or anchor text from pages pointing to the search engine) Alternatively: Involve the user, e.g. by explicit search engine selection or need to specify a subject area Advantages: Easy and little storage space requirements, works well for highly specialized component search engines with diversified topics Disadvantage: Does not work well for databases with documents of diverse interests
Statistical representative approaches Examples: - D-WISE approach - CORI Net approach - gGIOSS approach - Estimating the number of potentially useful documents - Estimating the similarity of the most similar documents
The D-WISE approach The representative of a component search engine contains- the document frequency of each term in the component search engine (n values dij)- the number of documents in the database (1 value ni) Ranking the component search engines based on a given query q: 1. Calculate the cue validity Cij of each search term tj
The D-WISE approach (cont.) 2. Calculate the variance CVVj of the CVij's of each query term tj for all component databases (ACVj = average of all CVij's for all comp. databases) 3. Compute the ranking score ri for each component database i with respect to q
The D-WISE approach (cont.) 4. Select component databases with the highest ranking score (Intuitive interpretation of the ranking score: Indicates where useful query terms are concentrated) Advantages of this approach:- easy scalable- easy to compute Disadvantages:- ranking scores are relative scores (no absolute quality measures)- does not distinguish between multiple term appearances within single documents
The D-WISE approach (cont.) 4. Select component databases with the highest ranking score (Intuitive interpretation of the ranking score: Indicates where useful query terms are concentrated) Advantages of this approach:- easy scalable- easy to compute Disadvantages:- ranking scores are relative scores (no absolute quality measures)- does not distinguish between multiple term appearances within single documents
Learning-based approaches Predict usefulness of a database for new queries based on the experiences with the database from past ones How to obtain these retrieval experiences?- Usage of training queries (static learning)- Usage of real user queries and accumulate retrieval knowledge gradually and update continuously (dynamic learning)- Use a combined learning approach, i.e. start with training queries and update continuously based on real user queries Examples:- The MRDD approach- The SavvySearch approach- The ProFusion approach
The MRDD approach Static learning approach, i.e. a set of training queries is given and all relevant documents for each training query have to be identified (manually) Stores a vector reflecting the distribution of the relevant documents: <r1, r2, ..., rs> with ri = number of top-ranked documents that must be retrieved to obtain i relevant documents Example: Retrieval result: (d1, ..., d100) Relevant documents: d1, d4, d10, d17, d30 Distribution vector: <r1, r2, r3, r4, r5> = <1, 4, 10, 17, 30>
The MRDD approach (cont.) Application: Input: query q from the user 1. Compare q to the training queries and select the k most similar ones (e.g. k=8) 2. Calculate the average distribution vector over these k queries for each database 3. Select the databases Di's which maximize precision Example: D1: <1, 4, 6, 7, 10, 12, 17> D2: <3, 5, 7, 9, 15, 20> D3: <2, 3, 6, 9, 11, 16>
A typical architecture USER INTERFACE Database selection problem DATABASE SELECTOR Result merging problem Document selection problem DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE . . . SEARCH ENGINE
The document selection problem The naive approach (i.e. return all results) won't work because it delivers too many documents Goal: Retrieve as many potentially useful documents from a component database as possible while minimizing the retrieval of useless documents Different categories of approaches: - User determination - Weighted allocation - Learning-based approaches - Guaranteed retrieval
User determination Let the user select a number for each component database (or use a default number, if none is given by the user) Advantage:- Works well with small numbers of component databases and if the user is reasonably familiar with them Disadvantage:- Does not work for larger sets of component databases
Weighted allocation Use the rank (or ranking score) calculated by the database selection algorithm to specify the number of documents that should be selected from the respective database Example: If ri is the ranking score of database Di (i = 1, ..., N) and m documents should be returned to the user, then select documents from database Di
Learning-based approaches Idea: Learn how many documents to retrieve for a given query based on past retrieval experiences for similar queries Example: The MRDD approach (see before) is a learning-based approach that already includes document selection!
Guaranteed retrieval Idea: Take global similarities into account (not only local ones, e.g. component database dependent document frequencies, etc.) Example approaches: - Query modification - Computing the tightest local threshold
A typical architecture USER INTERFACE Database selection problem DATABASE SELECTOR Result merging problem Document selection problem DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE . . . SEARCH ENGINE
Result merging approaches Goal: Combine all returned results (sorted by local similarity) into one single result (sorted by global similarity) Why is this a hard problem? - Local similarities are not provided by all search engines- The heterogeneity of the component search engines make them hard (if not impossible) to compare- Local similarities might differ significantly from global ones- Should documents returned from multiple search engines considered differently?
Result merging approaches (cont.) Two types of approaches exist: Local similarity adjustment, e.g.- Adjust local similarities using additional information (e.g. quality of the component database)- Convert local document ranks to similarity Global similarity estimationAttempts to estimate the true global similarities
Local similarity adjustment Distinguish three cases: 1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint) 2. The selected databases (or returned results) overlap but are not identical 3. The selected databases are identical The latter case is known as data fusion and normally not considered for metasearch engines In the following: Assume case 1 (disjoint results)
Local similarity adjustment 1st: assume local similarities exist 1. Normalize the returned similarities 2. Use database scores to adjust the local similarities (i.e. give higher preference to documents from highly rated databases), e.g. If s is the ranking score of database D and s is the average of all these scores, then is the weight for database D (N = no. of databases) Define the adjusted similarity as w*x (x = original, returned local similarity) 3. Sort results by adjusted similarity
Local similarity adjustment 2nd: assume local similarities do NOT exist Possible approaches: - Use the local document rank information directly to perform the merge, e.g. 1. Arrange databases based on their scores 2. Select documents with the round-robin method (Alternatively: randomized version where the documents are selected based on the probability of the database being relevant) - Convert the local document ranks to similarities
Local similarity adjustment Distinguish three cases: 1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint) 2. The selected databases (or returned results) overlap but are not identical How to deal with results returned by multiple search engines? Calculate adjusted similarities as before and use (e.g.) methods from data fusion to combine them. However, this might not work well because of the different coverage of the component databases (-> active research field).
Global similarity estimation Approaches to estimate global similarities: 1. Document fetching: Idea: Download returned documents to get (e.g.) term frequencies, estimate global document frequency Disadvantages: expensive (but remedies exist, e.g. download docs in parallel, keep downloading and analyzing docs while initial results are presented, only download the beginning portion) Advantages: - Identifies obsolete URLs - Ranking based on current (up-to-date) content - Better result representation
Global similarity estimation Approaches to estimate global similarities (cont.): 2. Use of discovered knowledge: Basic idea: Try to figure out the specific document indexing and similarity computation methods used by the different component search engines Use this information to - better compare local similarities - better adjust local similarities to make them comparable with each other - better derive global from local similarities
Global similarity estimation 2. Use of discovered knowledge: Examples Assume all component search engines use same indexing and local similarity estimation methods that do not include collection-dependent statistics -> local similarities are comparable and can be used directly Assume the only difference is the usage of (different) lists of stopwords -> modify the query to make results comparable Assume idf information is also used -> either adjust local similarities or compute global similarities directly (cont. on next slide)
Global similarity estimation 2. Use of discovered knowledge: Examples Assume idf information is also used -> either adjust local similarities or compute global similarities directly Case 1: Query q contains just one single term t Similarity in the component database: Global similarity: Multiply local sim. with Case 2: Query q contains multiple terms: see lit.
Other challenges for metasearch engines 1. Integrate local systems employing different indexing techniques 2. Integrate local systems supporting different types of queries (e.g. Boolean vs. vector space queries) 3. Discover knowledge about component search engines 4. Develop more effective result merging methods 5. Study the appropriate cooperation between a metasearch engine and the local systems 6. Incorporate new indexing and weighting techniques to build better metasearch engines
Other challenges for metasearch engines 7. Improve the effectiveness of metasearch 8. Decide where to place the software components of a metasearch engine 9. Create a standard testbed to evaluate the proposed techniques for database selection, document selection, and result merging 10. Extend metasearch techniques to different types of data sources
References [1] MENG, YU, LIU: BUILDING EFFICIENT AND EFFECTIVE METASEARCH ENGINES. ACM COMPUTING SURVEYS, VOL. 34, NO. 1, MARCH 2002.
DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION
General web search engine architecture CLIENT WWW PAGE REPOSITORY QUERIES RESULTS QUERY ENGINE RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. FIG. 1 IN ARASU ET AL. "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, 8-2001, PAGE 4)
Next week 1. Final grading of the programming exercises 2. Exercises with the PageRank simulation tool (participation mandatory, but no grading) Exam dates August 28th and 29th OR September 18th and 19th OR by personal arrangement