A System for Outlier Detection and Cluster Repair

A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005

A data set

Clustering algorithms could generate bad cluster • hMETIS (k=6)

Clustering algorithms could generate bad cluster • hMETIS (k=20)

BIRCH

Clustering algorithms could generate bad cluster • BIRCH (k=20)

Factors affecting clustering results • Outliers • Inappropriate value for parameters • Drawbacks of the clustering algorithm themselves

Factors affecting outlier detection results • Distributions • Boundary between outlier group and microcluster • Nested outliers

Two steps of cluster repair • Outlier/outlier group detection for each cluster • Separate points which are not supposed to be together • Merge density connected points • Merge points which should be together Clusters generated by a clustering algorithm Outlier detection of different clusters. Merge similar points from different clusters.

Step 1: Cluster Repair Outlier Detection and Evaluation by Network Flow

Network Flow: Maximum Flow/Minimum Cut • Ford-Fulkerson (1962) • The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink.

Outlier detection: Maximum flow/Minimum cut 12/12 b a s->a->b->t: 12 19/19 28/30 s t s->c->d->t: 3 7/10 9/9 7/7 s->c->b->t: 9 12/13 3/3 c d s->a->c->d->b->t: 7 10/11 maximum-flow= minimum-cut = 12+3+9+7=31

Outlier detection by network flow • compute k nearest neighbors of each point in a cluster of data. • for the data of a cluster, set up the network. • begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t. • use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group. • remove the candidate outlier or outlier groups from the graph. • select the next source, go back to 3 until the stop criterion. • adjusting: coarsen the graph and adjust the maximum flow.

Loosely connected clusters 20 1 19 2 10

Setting up the Network 7 nearest neighbors 591 points, 5028 edges Experiments (setting up the network) The No. 20 cluster，591 points

Setting up the network • Compute k nearest neighbors, make sure all vertices are connected. • Compute the capacity between two vertices by the distance.

Experiment result

Experiment (adjusting) 18 vertices, 66 edges

Stop criteria • Users input the number of outlier or outlier group they want. • Use the maximum flow as the stop condition. • Stop when Dflow Davg • Davg = average distance of the remaining data

Outlier Degree

9 10 2 7 20 11 6 8 1 4 19 12 18 13 17 5 3 14 15 16 Experiment (20 clusters)

Step 2: Cluster Repair Merge Density Connected Points

9 10 2 7 20 11 6 8 1 4 19 12 18 13 17 5 3 14 15 16 Merge density connected microclusters by flexible parameters of DBSCAN

Flexible parameters of DBSCAN • get the average distance d of every microcluster by each point’s k nearest neighbors No. 10 cluster No. 19 cluster No. 20 cluster

DBSCAN

DBSCAN with flexible Eps • Original DBSCAN use least dense e-neighborhood as global Eps and set MinPts=4. • We use average distance of every microcluster as the Eps. • When do DBSCAN, points in different microclusters use different Eps.

Kd tree • Use kd tree to find buckets with more than two microclusters from different original cluster results.

No. 125 bucket

MinPts = 4 for dim = 2 Eps p Search the rectangle (x+Eps, y+Eps, x-Eps, y-Eps) by R* tree, when Eps = avg_dist between points, it is very possible the point P could include 3 extra points besides itself.

No. 125 bucket (a) MinPts = 5 (b) MinPts = 5

Other controversial buckets No.119 bucket No.113 bucket No.114 bucket If x% points of a microcluster are merged into another microcluster, then merge These two microclusters. Since the proportion of points of these microclusters in these buckets that are merged exceeds 90%, 24 and 28 microclusters are merged.

No. 20, 19 and 10 cluster repair

After repair 20 clusters

Conclusion • Repair cluster from two aspects. • Removing points which are loosely connect to the clusters by outlier/outlier group detection; • merging points which are density connected by DBSCAN with flexible Eps. • Analyze interested microclusters • Found the Relationship among Outliers, outlier groups and main clusters.

Questions • MinPts in high dimensional data • For 3-d, MinPts=5; 4-d, MinPts=6? • For some outlier group microcluster, MinPts could be very high, it’s because border points include points in neighbor dense microcluters within its Eps, how to use each microcluster’s MinPts as reference.

A System for Outlier Detection and Cluster Repair