170 likes | 193 Views
Explore the use of stratified sampling in data mining on the deep web, focusing on association and differential rule mining. Learn about a tree-based stratification approach that balances estimation accuracy and sampling cost.
E N D
Stratified Sampling for Data Mining on the Deep Web Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa ,agrawal}@cse.ohio-state.edu Dec. 16, 2010
Outline • Introduction • Background Knowledge • Association Rule Mining • Differential Rule Mining • Basic Formulation • Main Technical Approach • A Greedy Stratification Method • Experiment Result • Conclusion
Introduction • Deep Web • Query interface vs. backend database • Input attribute vs. Output attribute • Data mining on the deep web • High level summary of the data • Challenge • Databases cannot be accessed directly • Sampling • Deep web querying is time consuming • Efficient Sampling Method
Background Knowledge-Association Rule Mining • Aim: co-occurrence patterns for items • Frequent Itemset: Support of the itemset is larger than a threshold • Rule: • is a frequent itemset • Confidence is larger than threshold
Background Knowledge-Differential Rule Mining • Aim: differences between two deep web data sources • E.g. Price of the same hotels on two web sites • Identical attributes vs. Differential attributes • Same vs. different values • Rule: • X: Frequent itemset composed of identical attributes • t: differential or target attribute • D1, D2: data sources
Basic Formulation-Problem Formulation • Two step sampling procedure • A pilot sample • Randomly drawn from the deep web • Interesting rules are identified • Additional sample • Verify identified rules • Association rules and differential rules • Sampling more data records satisfying X • X only contains input attributes – easy • X contains output attributes • Randomly sampling ? not efficient! • how?
Basic Formulation-Problem Formulation in Detail • Considering rules with • A single output attribute in the left hand • Association Rule • Estimate or, • Differential Rule • Estimate mean of given A=a • Goal – sampling • High estimation accuracy • Low sampling cost
Basic Formulation-Stratified Sampling • Sampling separately from strata • Heterogeneousacross strata & homogenouswithin stratum • Estimating mean value of : • : size, and sampled mean value • Association Rule Mining • : whether an itemset is contained in a transaction • If an itemset is contained in a transaction, • Differential Rule Mining • :the value of target attribute
Background-Neymann Allocation • Sample Allocation • Determining sample size for each stratum • Fixed sum of sample size • Neymann Allocation • Minimizing variance of the stratified sampling • Problem of application in Deep Web • The probability of A = a in each stratum is not considered • Possible large sampling cost • Sampling cost: number of queries submitted to the deep web
Sampling Cost • Sampling Cost on the Deep web • Aim: obtain data records with • Sampling Cost: • : number of data records with • : probability of finding a data record with • Integrated Cost • Combing sampling cost and estimation variance • Two adjustable weights
Main technical Approach –Stratification Process • Stratification by a tree on the query space • A top-down construction manner • Best split to create child nodes • Input attribute with the smallest integrated cost • The splitting process stops • Integrated cost at each leaf node is small • Leaf nodes: final strata for sampling
Experiment Result • Data Set: US census • The incomeof US households from 2008 US Census • 40,000 data records • 7 categorical and 2 numerical attributes • Two Metrics • Variance of Estimation • Sampling Cost
Experiment Result-Settings • Five sampling procedures • Four different weights for variance and sampling cost • Full_Var: • Var7 : • Var5 : • Var3 : • Rand : simple random sampling
Experiment Result – Variance of Estimation • Association Rule Mining • Increase of variance of estimation by decreasing • Random Sampling has higher estimation of variance
Experiment Result – Sampling Cost • Association Rule Mining • Decrease of sampling cost by decreasing • Random Sampling has higher sampling cost
Conclusion • Stratified sampling for data mining on the deep web • Considering estimation accuracy and sampling cost • A tree model for the relation between input attributes and output attributes • A greedy stratification to maximally reduce an integrated cost metric • Our experiments show that • Higher sampling accuracy and lower sampling cost compared with simple random sampling • Reducing sampling costs by trading-off a fraction of estimation error