200 likes | 427 Views
Stratified K-means Clustering Over A Deep Web Data Source. Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug. 14, 2012. Outline. Introduction Deep Web Clustering on the deep web Stratified K-means Clustering Stratification Sample Allocation
E N D
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug. 14, 2012
Outline • Introduction • Deep Web • Clustering on the deep web • Stratified K-means Clustering • Stratification • Sample Allocation • Conclusion
Deep Web • Data sources hidden from the Internet • Online query interface vs. Database • Database accessible through online Interface • Input attribute vs. Output attribute • An example of Deep Web
Data Mining over the Deep Web • High level summary of data • Scenario 1: a user wants to relocate to the county. • Summary of the residences of the county? • Age, Price, Square Footage • County property assessor’s web-site only allows simple queries
Challenges • Databases cannot be accessed directly • Sampling method for Deep web mining • Obtaining data is time consuming • Efficient sampling method • High accuracy with low sampling cost
k-means clustering over a deep web data source • Goal: Estimating k centers for theunderlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.
Overview of Method Sample Allocation Stratification
Stratification on the deep web • Partitioning the entire population in to strata • Stratifies on the query space of input attributes • Goal: Homogenous Query subspaces • Radius of query subspace: • Rule: Choosing the input attribute that mostly decreases the radius of a node • For an input attribute , decrease of radius:
Sampling Allocation Methods • We have created c*k partitions and c*k subspaces • A pilot sample • C*k-mean clustering generate c*k partitions • Representativesampling • Good Estimation on statistics of c*k subspaces • Centers • Proportions
Representative Sampling-Centers • Center of a subspace • Mean vector of all data points belonging to the subspace • Let sample S={DR1, DR2, …, DRn} • For i-th subspace, center :
Distance Function • For c*k estimated centers with true centers • Using Euclidean Distance • Integrated variance • Computed based on pilot sample • : # of sample drawn from j-th stratum
Optimized Sample Allocation • Goal: • Using Lagrange multipliers: • We are going to sample stratum with large variance • Data is spread in a wide area, and more data are need to represent the population
Active Learning based sampling Method • In machine learning • Passive learning: data are randomly chosen • Active Learning • Certain data are selected, to help build a better model • Obtaining data is costly and/or time-consuming • Choosing stratum i, the estimated decrease of distance function is • Iterative Sampling Process • At each iteration, stratum with largest decrease of distance function is selected for sampling • Integrated variance is updated
Representative Sampling-Proportion • Proportion of a sub-space: • Fraction of data records belonging to the sub-space • Depends on proportion of the sub-space in each stratum • In j-th stratum, • Risk function • Distance between estimated factions and their true values • Iterative Sampling Process • At each iteration, stratum with largest decrease of risk function is chosen for sampling • Parameters are updated
Stratified K-means Clustering • Weight for data records in i-th stratum • , : size of population, : size of sample • Similar to k-means clustering • Center for i-th cluster
Experiment Result • Data Set: • Yahoo! data set: • Data on used cars • 8,000 data records • Average Distance
Representative Sampling-Yahoo! Data set • Benefit of Stratification • Compared with rand, decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8% • Benefit of Representative Sampling • Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5% • Center based sampling methods have better performance • Optimized sampling method has better performance in the long run
Conclusion • Clustering over a deep web data source is challenging • A Stratified k-means clustering method over the deep web • Representative Sampling • Centers • Proportions • The experiment results show the efficiency of our work