Stratified Sampling for Data Mining on the Deep Web

Stratified Sampling for Data Mining on the Deep Web Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa ,agrawal}@cse.ohio-state.edu Dec. 16, 2010

Outline • Introduction • Background Knowledge • Association Rule Mining • Differential Rule Mining • Basic Formulation • Main Technical Approach • A Greedy Stratification Method • Experiment Result • Conclusion

Introduction • Deep Web • Query interface vs. backend database • Input attribute vs. Output attribute • Data mining on the deep web • High level summary of the data • Challenge • Databases cannot be accessed directly • Sampling • Deep web querying is time consuming • Efficient Sampling Method

Background Knowledge-Association Rule Mining • Aim: co-occurrence patterns for items • Frequent Itemset: Support of the itemset is larger than a threshold • Rule: • is a frequent itemset • Confidence is larger than threshold

Background Knowledge-Differential Rule Mining • Aim: differences between two deep web data sources • E.g. Price of the same hotels on two web sites • Identical attributes vs. Differential attributes • Same vs. different values • Rule: • X: Frequent itemset composed of identical attributes • t: differential or target attribute • D1, D2: data sources

Basic Formulation-Problem Formulation • Two step sampling procedure • A pilot sample • Randomly drawn from the deep web • Interesting rules are identified • Additional sample • Verify identified rules • Association rules and differential rules • Sampling more data records satisfying X • X only contains input attributes – easy • X contains output attributes • Randomly sampling ? not efficient! • how?

Basic Formulation-Problem Formulation in Detail • Considering rules with • A single output attribute in the left hand • Association Rule • Estimate or, • Differential Rule • Estimate mean of given A=a • Goal – sampling • High estimation accuracy • Low sampling cost

Basic Formulation-Stratified Sampling • Sampling separately from strata • Heterogeneousacross strata & homogenouswithin stratum • Estimating mean value of : • : size, and sampled mean value • Association Rule Mining • : whether an itemset is contained in a transaction • If an itemset is contained in a transaction, • Differential Rule Mining • :the value of target attribute

Background-Neymann Allocation • Sample Allocation • Determining sample size for each stratum • Fixed sum of sample size • Neymann Allocation • Minimizing variance of the stratified sampling • Problem of application in Deep Web • The probability of A = a in each stratum is not considered • Possible large sampling cost • Sampling cost: number of queries submitted to the deep web

Sampling Cost • Sampling Cost on the Deep web • Aim: obtain data records with • Sampling Cost: • : number of data records with • : probability of finding a data record with • Integrated Cost • Combing sampling cost and estimation variance • Two adjustable weights

Main technical Approach –Stratification Process • Stratification by a tree on the query space • A top-down construction manner • Best split to create child nodes • Input attribute with the smallest integrated cost • The splitting process stops • Integrated cost at each leaf node is small • Leaf nodes: final strata for sampling

Experiment Result • Data Set: US census • The incomeof US households from 2008 US Census • 40,000 data records • 7 categorical and 2 numerical attributes • Two Metrics • Variance of Estimation • Sampling Cost

Experiment Result-Settings • Five sampling procedures • Four different weights for variance and sampling cost • Full_Var: • Var7 : • Var5 : • Var3 : • Rand : simple random sampling

Experiment Result – Variance of Estimation • Association Rule Mining • Increase of variance of estimation by decreasing • Random Sampling has higher estimation of variance

Experiment Result – Sampling Cost • Association Rule Mining • Decrease of sampling cost by decreasing • Random Sampling has higher sampling cost

Conclusion • Stratified sampling for data mining on the deep web • Considering estimation accuracy and sampling cost • A tree model for the relation between input attributes and output attributes • A greedy stratification to maximally reduce an integrated cost metric • Our experiments show that • Higher sampling accuracy and lower sampling cost compared with simple random sampling • Reducing sampling costs by trading-off a fraction of estimation error

Questions & Comments?

Stratified Sampling for Data Mining on the Deep Web

Stratified Sampling for Data Mining on the Deep Web

Presentation Transcript

Chapter 5 Stratified Random Sampling

CS345A: Data Mining on the Web

Stratified Sampling for Stochastic Transparency

Stratified Sampling

CS345A: Data Mining on the Web

Stratified Random Sampling

Stratified Sampling

Data Mining for Web Personalization

Stratified sampling Definition

Stratified Sampling

Stratified K-means Clustering Over A Deep Web Data Source

Mining for Ideas on the Web

Deep Web Integration: Querying Structured Data on the Deep Web

Deep Web Crawling and Mining

Estimation in Stratified Random Sampling

Stratified Sampling

Replicated Stratified Sampling

STRATIFIED SAMPLING

Stratified Sampling