240 likes | 255 Views
SDARTS is a protocol integrating SDLIP and STARTS for metasearching, evaluating source relevance, extracting metadata, querying different sources, and merging results effectively. Developed by multiple universities, SDARTS ensures flexible and optimized search engine interoperability.
E N D
SDLIP + STARTS = SDARTSA Protocol and Toolkit for Metasearching Noah Green Panagiotis G. Ipeirotis Luis Gravano Computer Science Dept., Columbia University
Web vs. “Hidden” Web • Web • Link structure • Crawlable • Individual collections (or “Hidden” Web) • No link structure • Documents “hidden” behind search forms Columbia University Computer Science Dept.
Metasearching Given many documentsources and a query, a metasearcher: • Finds the good sources for the query. • Evaluates the query at these sources. • Merges the results from these sources. Metasearcher Existing Web Application Non-indexed Documents Legacy Database / WAIS / etc. Columbia University Computer Science Dept.
Metasearching Issues • How to evaluate the relevance of different sources? • How to get metadata? • How to query different types of sources? • How to merge the results? Metasearcher http://…/getTitle? title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt Columbia University Computer Science Dept.
S = Search Metasearcher M = Metadata S M S M S M grep cat select http://…. Solution: A Common Protocol Columbia University Computer Science Dept.
Why “SDARTS = SDLIP+STARTS”? • NOT yet another protocol • We combined existing efforts, keeping compatibility • SDLIP defines a common interface for interacting with the sources • STARTS defines expressive metadata that sources should export Columbia University Computer Science Dept.
SDARTS: Outline • Description of SDLIP. • Description of STARTS. • Integration of SDLIP and STARTS into SDARTS. • Implementation and configuration of SDARTS wrappers. Columbia University Computer Science Dept.
Developed during DLI2 project by: • Stanford University • UC Berkeley • UC San Diego • UC Santa Barbara • San Diego Supercomputer Center • California Digital Library Columbia University Computer Science Dept.
S M DB-specific interfaces SDLIP: An Interoperability Protocol • Basic interfaces: • Search • Metadata • A wrapper implements these interfaces • Interface parameter and return types are XML • Transport layer implementations (HTTP, CORBA) Common SDLIP interface • Flexible and adaptable • Optimized for clients that know the source to query (i.e., simple requirements for metadata) Columbia University Computer Science Dept.
STARTS: Informal Standardfor Search Engine Interoperability • Coordinated by Stanford in 1996; • Both search engine vendors and "users“ participated: • Netscape • Microsoft Network • GILS • Infoseek • Harvest • Hewlett-Packard • Fulcrum • Verity • Wais • PLS • Excite Columbia University Computer Science Dept.
STARTS: A Metasearching Protocol • Defines: • Query language • Results format • Metadata for the collection • No specified transport layer or implementation • Naturally complements SDLIP for metasearching purposes Example of metadata: Stemming = no # of docs = 20,000 … Diabetes TF:12, DF: 4 XML TF:1200, DF:750 … Columbia University Computer Science Dept.
SDARTS = SDLIP + SDARTS • Extends SDLIP with a richer metadata interface from STARTS • Keeps compatibility with SDLIP (same DTDs) • Can support easily similar protocols (transforming XML is easy) • Makes wrapping collections easy through a toolkit Columbia University Computer Science Dept.
SDARTS: Implementation Details • Defined STARTS using XML; new version named “STARTS XML.” • Used the getPropertyInfo()from SDLIP to extend SDLIP with STARTS metadata. • Term frequency information is available through a different URL (faster download for metasearchers that do not use it). Columbia University Computer Science Dept.
Example of STARTS Metadata: “Content Summary” <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE starts:scontent-summary SYSTEM "http://www.cs.columbia.edu/~dli2test/dtd/starts.dtd"> <starts:scontent-summary xmlns:starts="http://www.cs.columbia.edu/~dli2test/STARTS/" version="Starts 1.0" stemming="false" stopwords="false" case-sensitive="true" fields="false" numdocs="19997"> <starts:field-freq-info> … <starts:field type-set="basic1" name="body-of-text"/> <starts:term> <starts:value>algorithm</starts:value> </starts:term> <starts:term-freq>75</starts:term-freq> <starts:doc-freq>34</starts:doc-freq> … Columbia University Computer Science Dept.
SDARTS Wrapper Design Rationale • Goal: Isolate developer from parsing and generating STARTS XML requests and responses • Goal: Reusability and simplicity • SDARTS toolkits and reference implementations • Wrapping local text document collections • Wrapping XML collections • Wrapping HTTP/CGI interfaces Columbia University Computer Science Dept.
Internet SDARTS Wrapping Architecture SDLIP LSP Client Program STARTS XML over HTTP/DASL LSPObjects SDARTS Bean BackEndLSP S FrontEnd LSP M Existing SDLIP Client STARTS XML Native Protocol/ Search Engine Columbia University Computer Science Dept.
SDARTS: Wrapper Implementation • Standardize on STARTS as the XML protocol for SDLIP • Create a standard wrapper architecture LSPObjects STARTS XML BackEnd LSP S FrontEnd LSP M • “Front-End”: • Implements SDLIP interfaces • Communicates with client using STARTS XML nested inside SDLIP method calls • “Back-End”: • Communicates with front-end using simple container objects • Talks to underlying collection using native protocol Native Protocol/ Search Engine Columbia University Computer Science Dept.
Adding a Local Text Collection • Write standard doc_config.xml file • Regular expressions to describe where to find fields • No coding or compilation needed! doc_ config .xml index meta_ attributes .xml content_ summary .xml TextBackEndLSP Lucene Search Engine Non-indexed Text Documents Columbia University Computer Science Dept.
Sample doc_config.xml <doc-config re-index="true"> <path>/home/dli2test/collections/doc1/20groups</path> <linkage-prefix>http://localhost/20groups</linkage-prefix> . . . . . . . . <stop-words><word>the</word><word>a</word></stop-words> . . . . . . . . <field-descriptor name="author"> <start><regexp>^From: </regexp></start> <end><regexp>$</regexp></end> </field-descriptor> . . . . . . . . </doc-config> Columbia University Computer Science Dept.
Adding a Local XML Collection • Write standard doc_config.xml file • Write an XSL stylesheet to find fields in documents • No coding or compilation needed! doc_style.xsl index meta_ attributes .xml content_ summary .xml doc_config.xml Apache Xalan XSL Processor Lucene Search Engine XMLBackEndLSP Non-indexed XML Documents Columbia University Computer Science Dept.
Adding an External Web Collection • Must code a custom wrapper to send correct CGI parameters and parse returning HTML • No new code needed; uses XSLT for parsing the results • Usually no metadata or content summary available • Possible to automate metadata extraction: • [Callan et al., SIGMOD’99]: Automatic extraction of vocabulary statistics • [Ipeirotis et al., SIGMOD’01]: Automatic categorization of databases • [Raghavan and Garcia-Molina, VLDB 2001]: Automatic interaction with forms meta_attributes.xml Web BackEnd LSP HTTP/CGI Collection Columbia University Computer Science Dept.
Conclusions • SDARTS uses SDLIP interfaces and code (compatible with it). • SDARTS enhances SDLIP and STARTS. • Reference wrappers available for common collection types. • Any text or XML document collection can be easily wrapped without new compiled code. • Automatic metadata extraction for local collections • Using XSLT for web wrappers • Possible to automate the extraction of rich metadata for web-accessible collections • New wrappers can be written without having to parse or generate STARTS XML. • SDARTS is in Java and can run on multiple platforms. Columbia University Computer Science Dept.
We are on the Web :) • Available for downloading: • SDARTS DTDs and documentation • Java code and search engine (Lucene) included • Source code documentation • Web client source code • Reference wrappers (text, XML, web) • Wrapped collections • The web client is web-accessible for the public to test and query our SDARTS server http://sdarts.cs.columbia.edu/ Columbia University Computer Science Dept.
Related Work • Metadata: • Open Archives • Dublin Core • MARC • … • Interoperability Protocols: • Z39.50 • GILS Columbia University Computer Science Dept.