A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool Fan Wang and Gagan Agrawal The Ohio State University Presented by : Tantan Liu

The Deep Web • The definition of “Deep web” from Wikipedia The Deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.

Deep Web in Biological Domain • 500 times larger than the surface web • Nearly 800 deep web data sources in the bio-domain • 95 percent of the deep web is publicly accessible

Search the Deep Web: Solved and Unsolved Issues • Data Source Integration • Schema matching and Schema mining • Query Planning and Answering • Keyword search and Structured query answering • Fault Tolerance • Data access over wide-area networks • Unpredictable data source inaccessibility/unavailability • Network contention • However, uncompromised user search experience

Our Solution: A Redundancy based Self-Healing Approach • Identify data redundancy across independent data sources • Find the minimal “have to be replaced” sub-plan caused by data source unavailability/inaccessibility • Find the sub-query corresponding to the “have to be replaced” sub-plan • Generate a new replacing sub-plan based on redundancy using other data sources

Roadmap • Introduction and Motivation • Problem Formulation in Detail • Our Self-Healing Approach • Evaluation • Conclusion

Data Redundancy Model • A data source is represented by a three-tuple • IN: input attribute • O: output attribute • Con: attribute conditions imposed on data source • Data redundancy condition between data source A and B • They have the same input attributes • They have overlapping output attributes • They have non-conflicting attribute conditions

Query and Query Plan • Query • SQL query format select t1,t2,…,tn search term set ST from the deep web where in1=e1 and in2=e2,…,nm=em input term set INT • Query Plan • A DAG of data source nodes “covers” the user query Query plan nodes Starting node Output attributes may be user requested search terms Its input attributes are input terms in query Data source dependency

Algorithm Overview (1) • Find the part of the query plan needs to be replaced • Impacted sub-plan • the sub-graph reachable from the unavailable data source nodes • Minimal impacted sub-plan • The impacted sub-plan without usable data source nodes considering given data redundancy

Algorithm Overview (2) • Find Maximal Fixable Sub-Query • The sub-query corresponding to the minimal impacted sub plan • New Sub-Plan Generation • Use our existing query planning algorithm Select t3, t4 where input=t1

Minimal Impacted Sub-Plan Algorithm • Identify unavailable data sources • {B, I} 2. Find the sub-graph reachable from them (impacted sub-plan) 3. Cascading-crash conditions for data source X which is dependent on data source D A. At least one data source, sharing redundant data with D, is not crashed B. At least one such above data source has the same usage as D

Minimal Impacted Sub-Plan Fixability • Minimal Impacted Sub-Plan Fixability • How much the minimal impacted sub-plan can be fixed using other data sources taking advantage of data redundancy • Dead Attribute • No un-crashed data source can provide the attribute as its output attribute • Plan Fixability Categorization • Fully fixable: only self crashed node, no dead attribute • Partial fixable: only self crashed node, dead attribute • Cascading fully fixable: cascading crashed node, no dead attribute • Cascading partial fixable: cascading crashed node, dead attribute

Maximal Fixable Sub-Query Generation • For each source in the minimal impacted sub-plan, we compute • Input set IN • Requested output set RO • Linking set L • Maximal Fixable Sub-Query • Input term set: input attributes of all data sources in the minimal impacted sub-plan without incoming edges (self-crashed data sources) • Search term set • Users requested search terms which are supposed to be covered by the minimal impacted sub-plan • Terms in the linking set of the nodes in the minimal impacted sub-plan which have outgoing edges to data sources outside of the minimal impacted sub-plan IN={t1} L={t3,t4}

Roadmap • Introduction and Motivation • Problem Formulation in Detail • Our Self-Healing Approach • Evaluation • Conclusion

Evaluation • 12 biological deep web data sources • 20 queries, 4 groups • Each group corresponding to one fixability category • Methods compared • Baseline: start from stretch • Our method

Query Answering Time Comparison • Our method is more efficient in fixing failed query plans than • the baseline method • 2. Our method is at least 20% faster for all queries in this figure.

Query Result Quality Comparison For 18 out of 20 cases, the recall from our method is exactly the same as the ideal recall from the baseline method

Conclusion • Propose a self-healing approach to support fault tolerance for deep web searches • Find the minimal impacted sub-plan caused by unavailable/inaccessible data sources • Find a new plan to replace the minimal impacted sub-plan • Our method outperforms a baseline method in terms of both efficiency and result quality

Questions? Contact us: Fan Wang wangfa@cse.ohio-state.edu Gagan Agrawal agrawal@cse.ohio-state.edu

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Presentation Transcript

Current Search Technology for Deep Web

A Domain-Specific Language for Financial Products

Domain Specific Deep Web Discovery and Cataloging

Cupid A Domain Specific Language for Coupled ESMs

Domain Dependent Query Reformulation for Web Search

How domain specific are Domain Specific Languages?

Federated Search: A Tool for Knowledge Discovery

Domain-Specific Profiles for Your UML Tool

Data-Specific Web Search

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

SWARMS: A Tool for Domain Exploration in Semantic Web and its Application in FOAF Domain

A QoS-Oriented Reconfigurable Middleware For Self-Healing Web Services

RNAi A NEW APPROACH TO HEALING

Predicting domain-domain interactions using a parsimony approach

Domain-specific interpreters (a nested talk)

A domain specific language for Traversal Specification

UnrealScript: A Domain-Specific Language

Deep Web Search Engines

(A tool kit for) Deep Seq Data Analysis

A Domain-Specific Language for Financial Products

A Modeling Framework for Self-Healing Software Systems