370 likes | 550 Views
A System for Outlier Detection and Cluster Repair. Ying Liu Dr. Sprague Oct 21, 2005. A data set. Clustering algorithms could generate bad cluster. hMETIS (k=6). Clustering algorithms could generate bad cluster. hMETIS (k=20). BIRCH. BIRCH.
E N D
A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005
Clustering algorithms could generate bad cluster • hMETIS (k=6)
Clustering algorithms could generate bad cluster • hMETIS (k=20)
Clustering algorithms could generate bad cluster • BIRCH (k=20)
Factors affecting clustering results • Outliers • Inappropriate value for parameters • Drawbacks of the clustering algorithm themselves
Factors affecting outlier detection results • Distributions • Boundary between outlier group and microcluster • Nested outliers
Two steps of cluster repair • Outlier/outlier group detection for each cluster • Separate points which are not supposed to be together • Merge density connected points • Merge points which should be together Clusters generated by a clustering algorithm Outlier detection of different clusters. Merge similar points from different clusters.
Step 1: Cluster Repair Outlier Detection and Evaluation by Network Flow
Network Flow: Maximum Flow/Minimum Cut • Ford-Fulkerson (1962) • The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink.
Outlier detection: Maximum flow/Minimum cut 12/12 b a s->a->b->t: 12 19/19 28/30 s t s->c->d->t: 3 7/10 9/9 7/7 s->c->b->t: 9 12/13 3/3 c d s->a->c->d->b->t: 7 10/11 maximum-flow= minimum-cut = 12+3+9+7=31
Outlier detection by network flow • compute k nearest neighbors of each point in a cluster of data. • for the data of a cluster, set up the network. • begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t. • use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group. • remove the candidate outlier or outlier groups from the graph. • select the next source, go back to 3 until the stop criterion. • adjusting: coarsen the graph and adjust the maximum flow.
Loosely connected clusters 20 1 19 2 10
Setting up the Network 7 nearest neighbors 591 points, 5028 edges Experiments (setting up the network) The No. 20 cluster,591 points
Setting up the network • Compute k nearest neighbors, make sure all vertices are connected. • Compute the capacity between two vertices by the distance.
Experiment (adjusting) 18 vertices, 66 edges
Stop criteria • Users input the number of outlier or outlier group they want. • Use the maximum flow as the stop condition. • Stop when Dflow Davg • Davg = average distance of the remaining data
9 10 2 7 20 11 6 8 1 4 19 12 18 13 17 5 3 14 15 16 Experiment (20 clusters)
Step 2: Cluster Repair Merge Density Connected Points
9 10 2 7 20 11 6 8 1 4 19 12 18 13 17 5 3 14 15 16 Merge density connected microclusters by flexible parameters of DBSCAN
Flexible parameters of DBSCAN • get the average distance d of every microcluster by each point’s k nearest neighbors No. 10 cluster No. 19 cluster No. 20 cluster
DBSCAN with flexible Eps • Original DBSCAN use least dense e-neighborhood as global Eps and set MinPts=4. • We use average distance of every microcluster as the Eps. • When do DBSCAN, points in different microclusters use different Eps.
Kd tree • Use kd tree to find buckets with more than two microclusters from different original cluster results.
MinPts = 4 for dim = 2 Eps p Search the rectangle (x+Eps, y+Eps, x-Eps, y-Eps) by R* tree, when Eps = avg_dist between points, it is very possible the point P could include 3 extra points besides itself.
No. 125 bucket (a) MinPts = 5 (b) MinPts = 5
Other controversial buckets No.119 bucket No.113 bucket No.114 bucket If x% points of a microcluster are merged into another microcluster, then merge These two microclusters. Since the proportion of points of these microclusters in these buckets that are merged exceeds 90%, 24 and 28 microclusters are merged.
Conclusion • Repair cluster from two aspects. • Removing points which are loosely connect to the clusters by outlier/outlier group detection; • merging points which are density connected by DBSCAN with flexible Eps. • Analyze interested microclusters • Found the Relationship among Outliers, outlier groups and main clusters.
Questions • MinPts in high dimensional data • For 3-d, MinPts=5; 4-d, MinPts=6? • For some outlier group microcluster, MinPts could be very high, it’s because border points include points in neighbor dense microcluters within its Eps, how to use each microcluster’s MinPts as reference.