250 likes | 259 Views
Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiative. Panagiotis G. Ipeirotis Tom Barry Luis Gravano. Computer Science Dept., Columbia University. Metasearching? Why? “Surface” Web vs. “Hidden” Web. “Surface” Web Link structure Crawlable.
E N D
Extending SDARTS:Extracting Metadata from Web Databasesand Interfacing with Open Archives Initiative Panagiotis G. Ipeirotis Tom Barry Luis Gravano Computer Science Dept., Columbia University
Metasearching? Why?“Surface” Web vs. “Hidden” Web • “Surface” Web • Link structure • Crawlable • “Hidden” Web • Documents “hidden” in databases • No link structure • Search engines do not index them • Need to query each collection individually Columbia University Computer Science Dept.
“Content summaries” of databases (frequencies of words) Uniform interfaces wireless: 2,000 network: 8,000 ... wireless: 5 network: 40 ... wireless: 0 network: 10 ... Metasearching Challenges • Select good databases for a given query • Evaluate the query at these databases • Merge the results from these databases Hidden Web Metasearcher Existing Web Database Non-indexed Documents Relational Database / Library / etc. Columbia University Computer Science Dept.
Outline • Background: SDARTS, SDLIP, STARTS • Extracting content summaries from remote web databases • Interfacing with Open Archives Initiative Columbia University Computer Science Dept.
S = Search M = Metadata SDARTS: SDLIP + STARTS NOT yet another protocol Metasearcher SDLIP interfaces STARTS metadata S M S M S M grep cat select http://…. Columbia University Computer Science Dept.
PubMed content summary number of documents = 3,868,552 … cancer 1,398,178 heart 281,506 hepatitis 23,481 basketball 907 STARTS: A Metasearching Protocol • Defines: • Query language • Results format • Metadata for the collection • Complements SDLIP for metasearching purposes • Provides metadata for individual documents • Provides content summaries for databases Columbia University Computer Science Dept.
SDARTS: The Toolkit • SDARTS architecture makes new-wrapper implementation easy • SDARTS toolkit includes reference implementations for common types of text databases: • Local text databases • Local XML databases • Remote web databases Customization requires just editing configuration files, no programming Columbia University Computer Science Dept.
SDARTS Content Summaries • Detailed content summaries easily extracted from locally available (plain-text or XML) databases • Detailed content summaries so far not available for remote web databases • No access to full contents Columbia University Computer Science Dept.
Extracting Content Summaries from Remote Web Databases • No direct access to remote documents • Resort to document sampling: • Send queries to the database • Retrieve a representative document sample • Use the sample to create an approximation of the content summary • Database selection algorithms work well even with approximate content summaries VLDB 2002 Columbia University Computer Science Dept.
Topic-based Sampling: Training • Start with a predefined hierarchy and associated, pre-classified documents • Train rule-based document classifiers for each node • The output is a set of rules like: • ibm AND computers → Computers • lung AND cancer → Health • … • hepatitis AND liver → Hepatitis • angina → Heart • … } Root } Health Columbia University Computer Science Dept.
Topic-based Sampling: Probing • Transform each rule into a query • For each query: • Send query to database • Record number of matches • Retrieve top-k documents for query • At the end of the round: • Analyze matches for each category • Choose category to focus on The result is a representative document sample Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries to the database Columbia University Computer Science Dept.
Sample Contains “Relative” Word Frequencies • “Liver” appears in 200 out of 300 documents in sample • “Kidney” appears in 100 out of 300 documents in sample • “Hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? • Query “liver” returned 140,000 matches • Query “hepatitis” returned 20,000 matches • “kidney” was not a query probe… Can exploit number of matches from one-word queries Columbia University Computer Science Dept.
Adjusting Document Frequencies • We know absolute document frequencyf of words from one-word queries • We know ranking r of words according to document frequency in sample • Mandelbrot’s formula connects word frequency f and ranking r • We use curve-fitting to estimate the absolute frequency of all words in sample Columbia University Computer Science Dept.
Implementing Content-Summary Extraction in SDARTS Toolkit • Implemented content-summary extraction module as J2EE-compliant servlet • First, build SDARTS wrapper for remote web database • Then, trigger extraction process to generate content summary automatically • Module customizable with any classification scheme • Toolkit provides 72-node hierarchical scheme and associated classifiers • To add new scheme, should define the hierarchy and provide classifiers for the internal nodes Columbia University Computer Science Dept.
Fraction of PubMed Content Summary PubMed content summary number of documents = 3,868,552 … cancer 1,398,178 aids 106,512 heart 281,506 angina 26,775 hepatitis 23,481 … basketball 907 cpu 487 • Extracted automatically • ~ 27,500 words in the extracted content summary • Less than 200 queries sent • Retrieved 4 documents per query The extracted content summary accurately represents size and contents of the database Columbia University Computer Science Dept.
Topic-based Sampling: Conclusions • SDARTS now supports extraction of detailed content summaries from any database, local or remote • Sophisticated database selection algorithms can now be implemented on top of SDARTS Implemented and available for download: Database Selection Module SDARTS Client with Database Selection Columbia University Computer Science Dept.
Interfacing with Open Archives Initiative (OAI) “No man is an island, entire of itself; every man is a piece of the continent, a part of the main...…” (John Donne) • Export SDARTS metadata under OAI • Access transparently any OAI collection through SDARTS OAI Service Provider SDARTS/SDLIP Server OAI Data Provider SDARTS Client Columbia University Computer Science Dept.
Exporting SDARTS Metadata under OAI • SDARTS supports detailed, record-level metadata for each document, for XML and plain-text collections • Easy mapping to Dublin Core • SDARTS also exports content summaries under OAI • Each SDARTS collection is mapped to an OAI set • We export the content summaries under OAI, as metadata about the set • <PAPER> • <TITLE>The threat of vancomycin resistance</TITLE> • <AUTHORS>Trish M. Perl MD, MSc</AUTHORS> • <FILENO>ajm_106_05_0489</FILENO> • <APPEARED> • <JRNL>American Journal of Medicine</JRNL> • <VOL>106</VOL><ISS>5</ISS> • <DATE>3 May </DATE><YEAR>1999</YEAR> • </APPEARED> • <ABSTRACT>… </ABSTRACT> • <BODY> … </BODY> • </PAPER> • COLUMBIA SDARTS Server • PubMed Publications • Aides Medical Collection • NOAH: New York Online Access to Health • Cardiovascular Institute of the South • Columbia's DLI2 Medical Corpus • Harrisons Online Columbia University Computer Science Dept.
SDARTS OAI Sever: Details • Uses OCLC OAI Server • Uses MySQL –via JDBC– to store OAI records • Records materialized after first request for space efficiency • Distributed as WAR file • Simple configuration: Specify SDARTS/MySQL address OAI Service Provider SDARTS OAI Interface JDBC SDARTS Server MySQL RDBMS Columbia University Computer Science Dept.
Searching OAI Collections • OAI is not designed for searching • Possible to restrict only “Date” and “Set” • Need to search OAI collections • Users want to specify “Title”, “Author”, etc. OAI Service Provider Author =“F. Douglass” OAI Data Provider (e.g., Library of Congress ) User ? Author =“F. Douglass” Columbia University Computer Science Dept.
Harvesting and Searching OAI within SDARTS • OAI exports metadata records in XML • SDARTS can index and search XML collections Solution: • Harvest OAI records (by “Date”, “Set”) • Store records locally as XML documents • Use SDARTS XML wrapper to index them OAI Data Provider (e.g., Library of Congress ) Harvest OAI/XML records SDARTS/SDLIP Server Index OAI/XML records The OAI collection is searchable as an SDARTS XML database Columbia University Computer Science Dept.
Adding an OAI Collection in SDARTS http://memory.loc.gov/cgi-bin/oai loc 2002-01-01 Columbia University Computer Science Dept.
Distributed Search over OAI • SDARTS treats OAI collections as simple, local XML databases • Exact content summaries are exported for OAI collections • Possible to build sophisticated distributed search over OAI using SDARTS VT Electronic Thesis & Dissertation number of documents = 2,948 … study 1,479 thesis 493 … cancer 13 basketball 2 … SDARTS Content Summary for an OAI collection Columbia University Computer Science Dept.
Conclusions • SDARTS can now extract rich content summaries from: • Local text and XML databases • Remote web databases • OAI-compliant collections • SDARTS is now OAI-compliant • SDARTS allows easy integration of any OAI collection into SDARTS • SDARTS supports searching transparently over a wide range of heterogeneous collections No programming required for any of the tasks Columbia University Computer Science Dept.
We are on the Web :-) • SDARTS executables and documentation • SDARTS source code with documentation • SDARTS web client • SDARTS database selection module • SDARTS-OAI interface tools • Sample SDARTS-compliant databases http://sdarts.cs.columbia.edu/ Columbia University Computer Science Dept.