Stratified K-means Clustering Over A Deep Web Data Source

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug. 14, 2012

Outline • Introduction • Deep Web • Clustering on the deep web • Stratified K-means Clustering • Stratification • Sample Allocation • Conclusion

Deep Web • Data sources hidden from the Internet • Online query interface vs. Database • Database accessible through online Interface • Input attribute vs. Output attribute • An example of Deep Web

Data Mining over the Deep Web • High level summary of data • Scenario 1: a user wants to relocate to the county. • Summary of the residences of the county? • Age, Price, Square Footage • County property assessor’s web-site only allows simple queries

Challenges • Databases cannot be accessed directly • Sampling method for Deep web mining • Obtaining data is time consuming • Efficient sampling method • High accuracy with low sampling cost

An Example of Deep Web for Real-Estate

k-means clustering over a deep web data source • Goal: Estimating k centers for theunderlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.

Overview of Method Sample Allocation Stratification

Stratification on the deep web • Partitioning the entire population in to strata • Stratifies on the query space of input attributes • Goal: Homogenous Query subspaces • Radius of query subspace: • Rule: Choosing the input attribute that mostly decreases the radius of a node • For an input attribute , decrease of radius:

Partition on Space of Output Attributes

Sampling Allocation Methods • We have created c*k partitions and c*k subspaces • A pilot sample • C*k-mean clustering generate c*k partitions • Representativesampling • Good Estimation on statistics of c*k subspaces • Centers • Proportions

Representative Sampling-Centers • Center of a subspace • Mean vector of all data points belonging to the subspace • Let sample S={DR1, DR2, …, DRn} • For i-th subspace, center :

Distance Function • For c*k estimated centers with true centers • Using Euclidean Distance • Integrated variance • Computed based on pilot sample • : # of sample drawn from j-th stratum

Optimized Sample Allocation • Goal: • Using Lagrange multipliers: • We are going to sample stratum with large variance • Data is spread in a wide area, and more data are need to represent the population

Active Learning based sampling Method • In machine learning • Passive learning: data are randomly chosen • Active Learning • Certain data are selected, to help build a better model • Obtaining data is costly and/or time-consuming • Choosing stratum i, the estimated decrease of distance function is • Iterative Sampling Process • At each iteration, stratum with largest decrease of distance function is selected for sampling • Integrated variance is updated

Representative Sampling-Proportion • Proportion of a sub-space: • Fraction of data records belonging to the sub-space • Depends on proportion of the sub-space in each stratum • In j-th stratum, • Risk function • Distance between estimated factions and their true values • Iterative Sampling Process • At each iteration, stratum with largest decrease of risk function is chosen for sampling • Parameters are updated

Stratified K-means Clustering • Weight for data records in i-th stratum • , : size of population, : size of sample • Similar to k-means clustering • Center for i-th cluster

Experiment Result • Data Set: • Yahoo! data set: • Data on used cars • 8,000 data records • Average Distance

Representative Sampling-Yahoo! Data set • Benefit of Stratification • Compared with rand, decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8% • Benefit of Representative Sampling • Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5% • Center based sampling methods have better performance • Optimized sampling method has better performance in the long run

Conclusion • Clustering over a deep web data source is challenging • A Stratified k-means clustering method over the deep web • Representative Sampling • Centers • Proportions • The experiment results show the efficiency of our work

Stratified K-means Clustering Over A Deep Web Data Source

Stratified K-means Clustering Over A Deep Web Data Source

Presentation Transcript

k -means Clustering

K-means Clustering

K-means Clustering

K means Clustering ( Weka )

Canopy Clustering and K-Means Clustering

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

K-means Clustering

Privacy-Preserving K- means Clustering over Vertically Partitioned Data

Initial K-Means Clustering :

Data Clustering: 50 years beyond K-means

K-means Clustering

K-means Clustering

Clustering Beyond K -means

Clustering: K-Means

K-means*: Clustering by Gradual Data Transformation

K-means Clustering Algorithm with Matlab Source code

K-means clustering

Privacy-Preserving K- means Clustering over Vertically Partitioned Data