Differential Analysis on Deep Web Data Sources

Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal {liut,wangfa,zhujie,agrawal}@cse.ohio-state.edu December 14, 2010

Outline • Introduction • Problem Definition • Differential Analysis and Approaches • Experiment Result • Conclusion

Introduction • Deep web • Query forms vs. backend databases • Similar information from multiple data sources • What’s their difference? • Application: guiding users’ search process • Higher-level knowledge summary • Patterns of values with respects to the same entity

Problem definition • Goal • Difference between multiple data sources in the same domain • Patterns of values of the same entity • Different values for the same data entity • For example: prices of commodities • How different is the data, under what conditions? • Differential Rules • Capturing the difference of values

Differential Analysis and Approaches • Summarizing difference between two data sources • Data queried from the deep web • A relational table • Attributes • Assumption: data sources have same attributes • Identical attributes • Same values for the same data object • Differential attributes • Different values for the same data object • Quantitative attributes • Differences in values of quantitative attributes

Differential Analysis and Approaches-Useful Identifiers • Two data source and • Identical attributes • Differential attributes • :attribute in data source • Combining relation tables of A and B • Differential rule where • Profile X: the left hand of the rule

Differential Analysis and Approaches-Differential Rule Mining • Frequent Item Set Mining • Apriori algorithm • A concept hierarchy • Identifying patterns for target attributes • For each frequent itemset X • Decide • Paired Z-test • : difference between two random variables • Hypothesis test vs. • if > , then • if >0, then

Differential Analysis and Approaches-Pruning Rules • Pruning rules • A large number of rules are generated • Essential rules predict unessential rules • Identifying essential rules • Direction of rules

Differential Analysis and Approaches-ancestors of rules • Rules R1, R2 are complementary ancestors of rule R • R1: Y->d, R2: Z->d • R: X->d, and • Rule R is predicated by complementary ancestors R1 and R2

Differential Analysis and Approaches-Profile Representation • Identifying essential Rules • Rules are processed level by level • For rule R in k, all the rules from level 1 to k-1 are visited • Computation cost is expensive • Profile Representation • Uniquely describe items contained in the profile X of a rule R • For profile , define • would be extremely large when profile X is large • Thus, we modify

Differential Analysis and Approaches-Process of Pruning • Hash table is used to store differential rules • Each level corresponds to a hash table • For each rule R in the k-the level • The ancestor rules from 1 to k/2 are visited • Identifying complementary rules by profile representation • R is unessential rules • Predicted by a pair of complementary ancestor rules • Process the next rule

Experiment Results • Data Set: four of the most popular travel sites. • 120 randomly selected cities all over the world • Attributes • Hotel ID, City, Star, Customer Rating, Cleanness Rating, Price, Service Rating • Concept Hierarchy for attribute: city

Experiment Results - effectiveness

Experiment Results – Pruning effectiveness

Experiment Results- Efficiency

Experiment Results -Mining-Utility of the Approach

Conclusion • A method to extract high-level summary of the differences in multiple data sources • Differential rule mining – A new data mining problem • Statistic test for discovering differential rules • A method to prune unessential rules • Hash-table is used to speedup the process. • Experiment results on four travel-related deep web data sources show good results.

Questions?

Differential Analysis on Deep Web Data Sources