410 likes | 569 Views
Metasearching. CS 502 – 20020312 Carl Lagoze – Cornell University. Acknowledgements: Luis Gravano Andreas Paepcke. Web Search Strategies – Crawling. “central” index. ?. Web Search Strategies – Metadata Harvesting. metadata. Author Title Abstract Identifer. ?.
E N D
Metasearching CS 502 – 20020312 Carl Lagoze – Cornell University Acknowledgements: Luis Gravano Andreas Paepcke 20020307
Web Search Strategies – Crawling “central” index ? 20020307
Web Search Strategies – Metadata Harvesting metadata 20020307
Author Title Abstract Identifer ? Web Search Strategies – Metadata Harvesting metadata 20020307
Web Search Strategies - Metasearching Metasearch Engine ? 20020307
What is “Metasearching”? • Given many document sources and a query, a metasearcher: • Finds the good sources for the query • Evaluates the query at these sources • Merges the results from these sources Metasearcher Existing Web Application Unindexed Documents Legacy Database / WAIS / etc. 20020307
Metasearching Issues • How to query different types of sources? • How to combine results and rankings from multiple data sources? Metasearcher http://…/getTitle? title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt 20020307
Metasearching Issues . . . Cont’d • How to choose among multiple data sources? • How to get metadata about multiple data sources? Metasearcher Best: http://….?getMetaData Worst: “Hi. What do you have?” cat *.txt SELECT SCHEMA ……. 20020307
Function versus cost of acceptance Cost of acceptance Z39.50 SDLIP/STARTS Metadata Harvesting google 20020307 Function
Z39.50 http://www.loc.gov/z3950/agency/ 20020307
Aims of Z39.50 • Permits one computer, the client, to search and retrieve information on another, the database server • Important both technically and for its wide use in library systems • Most development has concentrated on bibliographic data • Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records 20020307
Technical history • Z39.50 • Developed for X.25 networks (connection orientation), conversion to run over TCP fitted later • Original concept in days when repeating a search was expensive computation (about 1980) • WAIS is a stateless derivative of an early version of Z39.50 20020307
Z39.50 principles • Abstract view of database searching. • Server stores a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction. 20020307
State • Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database. 20020307
Z 39.50 services • init -- client connects to the server and exchanges initial information, e.g., preferred message size • explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options • search -- client presents a query to a database choices of syntax for specifying searches • • only Boolean queries widely implemented • • one or more records may be returned to the client 20020307
Z 39.50 services manipulation of results sets -- e.g., sort or delete present -- requests the server to send specified records from the results set to the client in a specified format • options: for controlling content and formats for managing large records or large results sets 20020307
Sample query • In the database named "Books" find all records for which the access point title that contains the value "evangeline" and the access point author contains the value "longfellow.“ • Z39.50 defines a rich variety of search access points that can be extended by implementers 20020307
Problems with Z39.50 • Very difficult to implement • There are freely available implementations, but they are complex • Outdated assumptions • Searching is expensive computationally • Bandwidth is limited (ASN.1 compression) • Originally designed for bibliographic record retrieval, and not full documents or other objects • “Overspecified” • (Almost) Nobody Implements Explain! • Assumes questionable user model (stateful) 20020307
Simple Digital Library Interoperability Protocol http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/ 20020307
SDLIP • Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web • Support for stateful and stateless operation by the server • Support for thin clients, such as handheld devices • Developed jointly by Stanford, Berkeley, and UC Santa Barbara • Heavily influenced by DASL from IETF 20020307
SDLIP – search middleware 20020307
SDLIP Interfaces • Search Interface – defines simple query language, protocol can then include other languages • Result Interface – parking meter metaphor supports varying notions of results sets • Source Metadata Interface – provides extension mechanism through discovery server capabilities 20020307
Result Access Interface • This interface allows client applications to access the set of result documents, wherever that set is maintained • Four services: • getSessionInfo • getDocs • extendStateTimeout • cancelRequest 20020307
Source Metadata Interface • Provides information about the service and server itself, such as • Collections served • Collection metadata/content information • Searchable properties • Three operations • getInterface • getSubcollectionInfo • getPropertyInfo 20020307
STARTS/SDARTS http://www-db.stanford.edu/~gravano/starts_home.html http://sdarts.cs.columbia.edu/default.html 20020307
STARTS • Stanford Protocol Proposal for Internet Retrieval and Search • Joint work of Stanford Digital Library Project and Cornell Digital Library Research Group • SDARTS – current work at Columbia to integrate with SDLIP and metadata harvesting (OAI-PMH) 20020307
Different text search engines are largely incompatible • Different query languages (the query-language problem) • Different ranking algorithms (the rank-merging problem) • No exported information about sources (the metadata problem) 20020307
Rank Merging • Return information in query result to allow rank merging: • unnormalized score of the document • statistics about each query term 20020307
We cannot merge document ranks from different sources directly • Search engines use different ranking algorithms: DB1: (doc1, 0.7), (doc2, 0.3) DB2: (doc3, 1000), (doc4, 400) Merged rank? • Some algorithms depend on the source characteristics 20020307
Extra information helps merge document ranks meaningfully Sources return query results and statistics: Query: "distributed databases" DB1: (doc1, 0.7) "distributed" appears 3 times in doc1"databases" appears 5 times in doc1 20020307
author=Hopcroft? Hopcroft doc8 Tarjan doc9 Tarjan doc6 Wilensky doc7 Hopcroft doc1, doc2 Hartmanis doc3, doc4 Motivating Source MetadataRouting Problem - Disjoint Search Sources Hopcroft I1, I3 Hartmanis I3 Tarjan I1, I2 Wilensky I2 I1,I3 doc1, doc2 doc8 Content Summary I1 I2 I3 20020307
Source Metadata • Data to help select the right sources for a query source metadata attributes - what the source engine can do source content summary - what the source engine can search • Simplified form of Z39.50 “explain” service 20020307
Source metadata attributes • Fields Supported • Modifiers Supported • Score Range • Ranking Algorithm ID 20020307
Source Content Summary For each source: • Vocabulary • Document frequency for each word • Total number of postings for each word • Number of documents • Implementation of GLOSS work: • GlOSS: Text-Source Discovery over the Internet, L. Gravano, H. Garcia-Molina, A. Tomasic, in ACM Transactions on Database Systems, vol. 24, no. 2, Jun. 1999 20020307
Distributed Searching Issues Query Routing to Replicated Sources 20020307
author=Hopcroft? Hopcroft doc8 Tarjan doc9 Hopcroft doc8 Tarjan doc9 Routing ProblemReplicated Distributed Indexes Tarjan doc6 Wilensky doc7 Tarjan doc6 Wilensky doc7 20020307
Routing Issues • Choice of primary?, secondary?, etc. • Fault-tolerance • Routing Factors • Performance-based • Freshness-based • Cost-based • weighted mix based on user preference 20020307
Components of Replicated Routing Problem • Metadata Issue: metadata made available by indexer to aid in routing • Metadata Distribution Issue: topology of metadata repositories • Decision Issue: routing decision algorithms • Fault-tolerance: use of backup indexers 20020307
Distributed Metadata for Query Routing central metadata store 20020307
Performance-based Routing - present 8 T Timed low pass filter Average response time Predicted response time New = low pass filter(T, actual response time, old ) 20020307