220 likes | 232 Views
Divide and Conquer: Challenges in Scaling Federated Search. Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC. SearchEngine Meeting 24 April 2006 Boston, MA.
E N D
Divide and Conquer:Challenges in Scaling Federated Search Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC SearchEngine Meeting 24 April 2006 Boston, MA
Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold
Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold
Challenges Overview • Managing a large number of sources • Searching a large number of sources in parallel • Organizing and ranking the results returned
Challenges of Managing Thousands of Data Sources Locate Reliable Sources Categorize Sources by Content Configure Sources for Searching Maintain Sources 4
Challenges in Searching Thousands of Sources Automatically Select Sources to Search Retrieve Results from Cache Perform Many Searches in Parallel Bring Back Best Results 5
Search Conductor Source Selection Optimizer Source Descriptions Previous Results Source Selection Optimizer
Caching of Search Results Reduces the load (cost) of accessing sources CHALLENGES • Requires a large database • Need to determine how often to update the cache • Works best with lots of users doing similar searches
We Address Scalability Through a Grid-Based Solution • Uses open standards (Web Services, WSDL, SOAP, XML) • Runs on distributed nodes • Is platform independent (Java based) • Very flexible, providing a framework for integration of various filtering and analysis tools
Enough good results? YES Deliver results to user Can I get more results from “good” sources? Search Conductor Select sources to search Perform Search Get Next Results NO YES NO
Searching a large number of sources can lead to a flood of results
Challenges in Organizing and Ranking Results Multi-tier Relevance Ranking User-driven Ranking Clustering of Results 5
Multi-tier Relevance Ranking • QuickRank – Ranks results based on occurrence of search terms in title, author, and snippet • MetaRank – Ranks results utilizing custom algorithms applied to meta-data • DeepRank – Downloads and indexes full-text documents HEAVY LIFTING REQUIRED!
User-driven Ranking Desired: Blending (weighing) of above criteria
A Grand Challenge for Federated Search Source: Walter Warnick, Ph.D., DOE OSTI. Global Discovery: Increasing the Pace of Knowledge Diffusion to Increase the Pace of Science. Presented at the Annual Meeting of the American Association for the Advancement of Science, February 16-20, 2006.
Math Databases: • Research Papers • Correspondence • Conferences Math Community Global Discovery Search Portal Biology Community Physics Community Knowledge Diffusion in Action Mathematician’s Scientific Discovery • Biology Databases: • Research Papers • Correspondence • Conferences Biology Researcher’s Scientific Discovery • Physics Databases: • Research Papers • Correspondence • Conferences Physics Scientific Discovery
Grid of Grids Scaling to the Next Level Each circle = a portal with 10-100 sources End result is thousands ofsources in 2 hops
Thank You! Abe Lederman 122 Longview Drive Los Alamos, NM 87544 abe@deepwebtech.com www.deepwebtech.com 12