120 likes | 134 Views
Explore mediator-based integration techniques for improved access to heterogeneous databases. Develop intelligent wrappers and efficient caching strategies to optimize search performance. Experiment with new ranking methods for a more relevant user experience.
E N D
Information Integration for Digital Libraries August 10, 2000 Prof. Sang Ho Lee Soongsil University Seoul, Korea shlee@computing.soongsil.ac.kr
Information integration • Provision of integrated access to multiple, distributed, heterogeneous databases and other information sources • Mediator approach • More up-to-date data • No need to copy data • Query needs can be unknown • Data warehouse approach • High query performance • Can operate when sources unavailable • Extra information at warehouse • Modify, summarize (store aggregates), add historical information
Client Client Mediator Wrapper Wrapper Wrapper Source Source Source Mediator Approach
Client Client Query & Analysis Warehouse Integration Source Source Source Data Warehouse Approach Metadata
Web Searching Practice • Approx. 800 million indexable Web pages (Feb. 1999) • Low coverage of the Web • No engine indexing more than 16% of indexable web pages • Out of date • New pages take months to be indexed • Low metadata use • 34% use “keywords” or “description” metatags • 0.3% use the Dublin Core metadata standard • Simple queries • Most queries use 1-3 search words • Poor relevancy ranking and precision
Meta Search engines • USA • SavvySearch (www.savvysearch.com) • MetaCrawler (www.go2net.com/search.html) • Ask Jeeves (www.askjeeves.com) • ProFusion (www.profusion.com) • Mamma (www.mamma.com) • Ixquick (www.ixquick.com) • Korea • Wakano (www.wakano.co.kr) • Ms. DaChanni (www.mochanni.com) • Over 3000 metasearch engines around the world
Operation Flow and Technical Issues User query Decompose and format queries Send queries and get results Post processing (ranking, clustering, etc.) Output result
Current Practice of Metasearch Engines • Tend to a least-common-denominator interface • Not utilize function of individual sources completely • Covers general area, not a specific area • Little utilization of domain knowledge • Little consideration to personal profiles
Proposed Research Topics (1) • Theme: focused on mediator-based integration techniques (in particular, metasearch engines) • Intelligent wrapper techniques • To extract, combine, and reconcile information for external sources • Exploit user profiles and utilize function of each sources as much as possible • Should be flexible and adaptable, as external sources change • Several approaches • Formal language based, machine learning based, heuristic based, extended CFG based, …
Proposed Research Topics (2) • Efficiency issues • How to cache results and queries, to provide a fast response to users • How to do parallelism when accessing external sources
Research/Development Strategies • Categorize objects and develop specialized search mechanism for each category • Build a working system to experiment theories • Experiment new ranking methods • Google, Goto, …