10 likes | 121 Views
Applying Data Mining Techniques for Schema Matching across Biological Deep Web Data Sources. Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University. Scientific deep web data sources Querying Interface: submitting query Input schema: describing input attributes
E N D
Applying Data Mining Techniques for Schema Matching across Biological Deep Web Data Sources Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University • Scientific deep web data sources • Querying Interface: submitting query • Input schema: describing input attributes • Output Web page: querying results • output schema: describing output attributes • Inter-dependence • A data source provides input for another data source • Input-output attributes matching across multiple data sources • Exploring inter-dependence between data • sources • Automatically generating query plans given users’ queries • Integrating multiple data sources Motivation Model for Schemas • Input Schema • Input Attributes -- label and instances • Output Schema • Hierarchical model • Siblings: Related attributes in a table or a separate block • Output Attributes -- label, instances, parent and siblings • Similarity Function • Similarity between corresponding properties • Linguistic similarity is utilized to compute the similarity between strings • System Design • Aim: identifying semantic matching between input attributes and output attributes across multiple data sources • Approaches • Discovering instances for input attributes • Help web pages -- querying interfaces and their linked web pages • Output web pages of other data sources • Schema matching via clustering Schema Matching Discovering Instances for Input Attributes • Discover semantic correspondence between attributes • Mapping attributes are grouped together • Hierarchical clustering • Similarity between attributes are calculated • At each step, groups with largest similarity are merged into one group • Input and output attributes in the same group: inter-dependence between their data sources • Bridge effect -- two attributes are similar if they are both similar to a third attribute. • From output web pages • Discovering instances from output web pages • Iteratively borrowing instances from related output attributes • More output attributes and instances provide more instances for input attributes • From help web pages • Potential web pages linked from query interface • Identifying help web pages by anchor text of links, e.g. ‘search hint’ • Locating potential instances by meaningful keywords, e.g. ‘for instance’ • Discovering potentialdomain-specific instances, less frequently used in other domains • Validating potential instances through querying interface • We applied a series of data mining techniques for schema matching. • We show that the instances for deep web data sources can be discovered from the query interfaces themselves. • We also show the instances are obtained from output pages of related data sources • Our approach has been effective on a number of biological data sources. Conclusion • Experiment Results • Data Sources • 11 data sources with 24 query interfaces • Datasources provide SNP, Gene, Protein and related information • Instances discovered from Interface • Accuracy on different subsets • Impact of size of input instance sets • Accuracy of all types of schema matching Tantan Liu liut@cse.ohio-state.edu Fan Wang wangfa@cse.ohio-state.edu Gagan Agrawal agrawal@cse.ohio-state.edu