250 likes | 261 Views
Explore the challenges and solutions in metasearching the Hidden Web, extracting content summaries using SDARTS, and interfacing with the Open Archives Initiative at Columbia University's Computer Science Dept.
E N D
Extending SDARTS:Extracting Metadata from Web Databasesand Interfacing with Open Archives Initiative Panagiotis G. Ipeirotis Tom Barry Luis Gravano Computer Science Dept., Columbia University
Metasearching? Why?“Surface” Web vs. “Hidden” Web • “Surface” Web • Link structure • Crawlable • “Hidden” Web • Documents “hidden” in databases • No link structure • Search engines do not index them • Need to query each collection individually Columbia University Computer Science Dept.
“Content summaries” of databases (frequencies of words) Uniform interfaces wireless: 2,000 network: 8,000 ... wireless: 5 network: 40 ... wireless: 0 network: 10 ... Metasearching Challenges • Select good databases for a given query • Evaluate the query at these databases • Merge the results from these databases Hidden Web Metasearcher Existing Web Database Non-indexed Documents Relational Database / Library / etc. Columbia University Computer Science Dept.
Outline • Background: SDARTS, SDLIP, STARTS • Extracting content summaries from remote web databases • Interfacing with Open Archives Initiative Columbia University Computer Science Dept.
S = Search M = Metadata SDARTS: SDLIP + STARTS NOT yet another protocol Metasearcher SDLIP interfaces STARTS metadata S M S M S M grep cat select http://…. Columbia University Computer Science Dept.
PubMed content summary number of documents = 3,868,552 … cancer 1,398,178 heart 281,506 hepatitis 23,481 basketball 907 STARTS: A Metasearching Protocol • Defines: • Query language • Results format • Metadata for the collection • Complements SDLIP for metasearching purposes • Provides metadata for individual documents • Provides content summaries for databases Columbia University Computer Science Dept.
SDARTS: The Toolkit • SDARTS architecture makes new-wrapper implementation easy • SDARTS toolkit includes reference implementations for common types of text databases: • Local text databases • Local XML databases • Remote web databases Customization requires just editing configuration files, no programming Columbia University Computer Science Dept.
SDARTS Content Summaries • Detailed content summaries easily extracted from locally available (plain-text or XML) databases • Detailed content summaries so far not available for remote web databases • No access to full contents Columbia University Computer Science Dept.
Extracting Content Summaries from Remote Web Databases • No direct access to remote documents • Resort to document sampling: • Send queries to the database • Retrieve a representative document sample • Use the sample to create an approximation of the content summary • Database selection algorithms work well even with approximate content summaries VLDB 2002 Columbia University Computer Science Dept.
Topic-based Sampling: Training • Start with a predefined hierarchy and associated, pre-classified documents • Train rule-based document classifiers for each node • The output is a set of rules like: • ibm AND computers → Computers • lung AND cancer → Health • … • hepatitis AND liver → Hepatitis • angina → Heart • … } Root } Health Columbia University Computer Science Dept.
Topic-based Sampling: Probing • Transform each rule into a query • For each query: • Send query to database • Record number of matches • Retrieve top-k documents for query • At the end of the round: • Analyze matches for each category • Choose category to focus on The result is a representative document sample Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries to the database Columbia University Computer Science Dept.
Sample Contains “Relative” Word Frequencies • “Liver” appears in 200 out of 300 documents in sample • “Kidney” appears in 100 out of 300 documents in sample • “Hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? • Query “liver” returned 140,000 matches • Query “hepatitis” returned 20,000 matches • “kidney” was not a query probe… Can exploit number of matches from one-word queries Columbia University Computer Science Dept.
Adjusting Document Frequencies • We know absolute document frequencyf of words from one-word queries • We know ranking r of words according to document frequency in sample • Mandelbrot’s formula connects word frequency f and ranking r • We use curve-fitting to estimate the absolute frequency of all words in sample Columbia University Computer Science Dept.
Implementing Content-Summary Extraction in SDARTS Toolkit • Implemented content-summary extraction module as J2EE-compliant servlet • First, build SDARTS wrapper for remote web database • Then, trigger extraction process to generate content summary automatically • Module customizable with any classification scheme • Toolkit provides 72-node hierarchical scheme and associated classifiers • To add new scheme, should define the hierarchy and provide classifiers for the internal nodes Columbia University Computer Science Dept.
Fraction of PubMed Content Summary PubMed content summary number of documents = 3,868,552 … cancer 1,398,178 aids 106,512 heart 281,506 angina 26,775 hepatitis 23,481 … basketball 907 cpu 487 • Extracted automatically • ~ 27,500 words in the extracted content summary • Less than 200 queries sent • Retrieved 4 documents per query The extracted content summary accurately represents size and contents of the database Columbia University Computer Science Dept.
Topic-based Sampling: Conclusions • SDARTS now supports extraction of detailed content summaries from any database, local or remote • Sophisticated database selection algorithms can now be implemented on top of SDARTS Implemented and available for download: Database Selection Module SDARTS Client with Database Selection Columbia University Computer Science Dept.
Interfacing with Open Archives Initiative (OAI) “No man is an island, entire of itself; every man is a piece of the continent, a part of the main...…” (John Donne) • Export SDARTS metadata under OAI • Access transparently any OAI collection through SDARTS OAI Service Provider SDARTS/SDLIP Server OAI Data Provider SDARTS Client Columbia University Computer Science Dept.
Exporting SDARTS Metadata under OAI • SDARTS supports detailed, record-level metadata for each document, for XML and plain-text collections • Easy mapping to Dublin Core • SDARTS also exports content summaries under OAI • Each SDARTS collection is mapped to an OAI set • We export the content summaries under OAI, as metadata about the set • <PAPER> • <TITLE>The threat of vancomycin resistance</TITLE> • <AUTHORS>Trish M. Perl MD, MSc</AUTHORS> • <FILENO>ajm_106_05_0489</FILENO> • <APPEARED> • <JRNL>American Journal of Medicine</JRNL> • <VOL>106</VOL><ISS>5</ISS> • <DATE>3 May </DATE><YEAR>1999</YEAR> • </APPEARED> • <ABSTRACT>… </ABSTRACT> • <BODY> … </BODY> • </PAPER> • COLUMBIA SDARTS Server • PubMed Publications • Aides Medical Collection • NOAH: New York Online Access to Health • Cardiovascular Institute of the South • Columbia's DLI2 Medical Corpus • Harrisons Online Columbia University Computer Science Dept.
SDARTS OAI Sever: Details • Uses OCLC OAI Server • Uses MySQL –via JDBC– to store OAI records • Records materialized after first request for space efficiency • Distributed as WAR file • Simple configuration: Specify SDARTS/MySQL address OAI Service Provider SDARTS OAI Interface JDBC SDARTS Server MySQL RDBMS Columbia University Computer Science Dept.
Searching OAI Collections • OAI is not designed for searching • Possible to restrict only “Date” and “Set” • Need to search OAI collections • Users want to specify “Title”, “Author”, etc. OAI Service Provider Author =“F. Douglass” OAI Data Provider (e.g., Library of Congress ) User ? Author =“F. Douglass” Columbia University Computer Science Dept.
Harvesting and Searching OAI within SDARTS • OAI exports metadata records in XML • SDARTS can index and search XML collections Solution: • Harvest OAI records (by “Date”, “Set”) • Store records locally as XML documents • Use SDARTS XML wrapper to index them OAI Data Provider (e.g., Library of Congress ) Harvest OAI/XML records SDARTS/SDLIP Server Index OAI/XML records The OAI collection is searchable as an SDARTS XML database Columbia University Computer Science Dept.
Adding an OAI Collection in SDARTS http://memory.loc.gov/cgi-bin/oai loc 2002-01-01 Columbia University Computer Science Dept.
Distributed Search over OAI • SDARTS treats OAI collections as simple, local XML databases • Exact content summaries are exported for OAI collections • Possible to build sophisticated distributed search over OAI using SDARTS VT Electronic Thesis & Dissertation number of documents = 2,948 … study 1,479 thesis 493 … cancer 13 basketball 2 … SDARTS Content Summary for an OAI collection Columbia University Computer Science Dept.
Conclusions • SDARTS can now extract rich content summaries from: • Local text and XML databases • Remote web databases • OAI-compliant collections • SDARTS is now OAI-compliant • SDARTS allows easy integration of any OAI collection into SDARTS • SDARTS supports searching transparently over a wide range of heterogeneous collections No programming required for any of the tasks Columbia University Computer Science Dept.
We are on the Web :-) • SDARTS executables and documentation • SDARTS source code with documentation • SDARTS web client • SDARTS database selection module • SDARTS-OAI interface tools • Sample SDARTS-compliant databases http://sdarts.cs.columbia.edu/ Columbia University Computer Science Dept.