190 likes | 285 Views
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool. Fan Wang and Gagan Agrawal The Ohio State University. Presented by : Tantan Liu. The Deep Web. The definition of “Deep web” from Wikipedia.
E N D
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool Fan Wang and Gagan Agrawal The Ohio State University Presented by : Tantan Liu
The Deep Web • The definition of “Deep web” from Wikipedia The Deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.
Deep Web in Biological Domain • 500 times larger than the surface web • Nearly 800 deep web data sources in the bio-domain • 95 percent of the deep web is publicly accessible
Search the Deep Web: Solved and Unsolved Issues • Data Source Integration • Schema matching and Schema mining • Query Planning and Answering • Keyword search and Structured query answering • Fault Tolerance • Data access over wide-area networks • Unpredictable data source inaccessibility/unavailability • Network contention • However, uncompromised user search experience
Our Solution: A Redundancy based Self-Healing Approach • Identify data redundancy across independent data sources • Find the minimal “have to be replaced” sub-plan caused by data source unavailability/inaccessibility • Find the sub-query corresponding to the “have to be replaced” sub-plan • Generate a new replacing sub-plan based on redundancy using other data sources
Roadmap • Introduction and Motivation • Problem Formulation in Detail • Our Self-Healing Approach • Evaluation • Conclusion
Data Redundancy Model • A data source is represented by a three-tuple • IN: input attribute • O: output attribute • Con: attribute conditions imposed on data source • Data redundancy condition between data source A and B • They have the same input attributes • They have overlapping output attributes • They have non-conflicting attribute conditions
Query and Query Plan • Query • SQL query format select t1,t2,…,tn search term set ST from the deep web where in1=e1 and in2=e2,…,nm=em input term set INT • Query Plan • A DAG of data source nodes “covers” the user query Query plan nodes Starting node Output attributes may be user requested search terms Its input attributes are input terms in query Data source dependency
Algorithm Overview (1) • Find the part of the query plan needs to be replaced • Impacted sub-plan • the sub-graph reachable from the unavailable data source nodes • Minimal impacted sub-plan • The impacted sub-plan without usable data source nodes considering given data redundancy
Algorithm Overview (2) • Find Maximal Fixable Sub-Query • The sub-query corresponding to the minimal impacted sub plan • New Sub-Plan Generation • Use our existing query planning algorithm Select t3, t4 where input=t1
Minimal Impacted Sub-Plan Algorithm • Identify unavailable data sources • {B, I} 2. Find the sub-graph reachable from them (impacted sub-plan) 3. Cascading-crash conditions for data source X which is dependent on data source D A. At least one data source, sharing redundant data with D, is not crashed B. At least one such above data source has the same usage as D
Minimal Impacted Sub-Plan Fixability • Minimal Impacted Sub-Plan Fixability • How much the minimal impacted sub-plan can be fixed using other data sources taking advantage of data redundancy • Dead Attribute • No un-crashed data source can provide the attribute as its output attribute • Plan Fixability Categorization • Fully fixable: only self crashed node, no dead attribute • Partial fixable: only self crashed node, dead attribute • Cascading fully fixable: cascading crashed node, no dead attribute • Cascading partial fixable: cascading crashed node, dead attribute
Maximal Fixable Sub-Query Generation • For each source in the minimal impacted sub-plan, we compute • Input set IN • Requested output set RO • Linking set L • Maximal Fixable Sub-Query • Input term set: input attributes of all data sources in the minimal impacted sub-plan without incoming edges (self-crashed data sources) • Search term set • Users requested search terms which are supposed to be covered by the minimal impacted sub-plan • Terms in the linking set of the nodes in the minimal impacted sub-plan which have outgoing edges to data sources outside of the minimal impacted sub-plan IN={t1} L={t3,t4}
Roadmap • Introduction and Motivation • Problem Formulation in Detail • Our Self-Healing Approach • Evaluation • Conclusion
Evaluation • 12 biological deep web data sources • 20 queries, 4 groups • Each group corresponding to one fixability category • Methods compared • Baseline: start from stretch • Our method
Query Answering Time Comparison • Our method is more efficient in fixing failed query plans than • the baseline method • 2. Our method is at least 20% faster for all queries in this figure.
Query Result Quality Comparison For 18 out of 20 cases, the recall from our method is exactly the same as the ideal recall from the baseline method
Conclusion • Propose a self-healing approach to support fault tolerance for deep web searches • Find the minimal impacted sub-plan caused by unavailable/inaccessible data sources • Find a new plan to replace the minimal impacted sub-plan • Our method outperforms a baseline method in terms of both efficiency and result quality
Questions? Contact us: Fan Wang wangfa@cse.ohio-state.edu Gagan Agrawal agrawal@cse.ohio-state.edu