1 / 37

A System for Outlier Detection and Cluster Repair

A System for Outlier Detection and Cluster Repair. Ying Liu Dr. Sprague Oct 21, 2005. A data set. Clustering algorithms could generate bad cluster. hMETIS (k=6). Clustering algorithms could generate bad cluster. hMETIS (k=20). BIRCH. BIRCH.

eros
Download Presentation

A System for Outlier Detection and Cluster Repair

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005

  2. A data set

  3. Clustering algorithms could generate bad cluster • hMETIS (k=6)

  4. Clustering algorithms could generate bad cluster • hMETIS (k=20)

  5. BIRCH

  6. BIRCH

  7. Clustering algorithms could generate bad cluster • BIRCH (k=20)

  8. Factors affecting clustering results • Outliers • Inappropriate value for parameters • Drawbacks of the clustering algorithm themselves

  9. Factors affecting outlier detection results • Distributions • Boundary between outlier group and microcluster • Nested outliers

  10. Two steps of cluster repair • Outlier/outlier group detection for each cluster • Separate points which are not supposed to be together • Merge density connected points • Merge points which should be together Clusters generated by a clustering algorithm Outlier detection of different clusters. Merge similar points from different clusters.

  11. Step 1: Cluster Repair Outlier Detection and Evaluation by Network Flow

  12. Network Flow: Maximum Flow/Minimum Cut • Ford-Fulkerson (1962) • The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink.

  13. Outlier detection: Maximum flow/Minimum cut 12/12 b a s->a->b->t: 12 19/19 28/30 s t s->c->d->t: 3 7/10 9/9 7/7 s->c->b->t: 9 12/13 3/3 c d s->a->c->d->b->t: 7 10/11 maximum-flow= minimum-cut = 12+3+9+7=31

  14. Outlier detection by network flow • compute k nearest neighbors of each point in a cluster of data. • for the data of a cluster, set up the network. • begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t. • use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group. • remove the candidate outlier or outlier groups from the graph. • select the next source, go back to 3 until the stop criterion. • adjusting: coarsen the graph and adjust the maximum flow.

  15. Loosely connected clusters 20 1 19 2 10

  16. Setting up the Network 7 nearest neighbors 591 points, 5028 edges Experiments (setting up the network) The No. 20 cluster,591 points

  17. Setting up the network • Compute k nearest neighbors, make sure all vertices are connected. • Compute the capacity between two vertices by the distance.

  18. Experiment result

  19. Experiment (adjusting) 18 vertices, 66 edges

  20. Stop criteria • Users input the number of outlier or outlier group they want. • Use the maximum flow as the stop condition. • Stop when Dflow Davg • Davg = average distance of the remaining data

  21. Outlier Degree

  22. 9 10 2 7 20 11 6 8 1 4 19 12 18 13 17 5 3 14 15 16 Experiment (20 clusters)

  23. Step 2: Cluster Repair Merge Density Connected Points

  24. 9 10 2 7 20 11 6 8 1 4 19 12 18 13 17 5 3 14 15 16 Merge density connected microclusters by flexible parameters of DBSCAN

  25. Flexible parameters of DBSCAN • get the average distance d of every microcluster by each point’s k nearest neighbors No. 10 cluster No. 19 cluster No. 20 cluster

  26. DBSCAN

  27. DBSCAN

  28. DBSCAN with flexible Eps • Original DBSCAN use least dense e-neighborhood as global Eps and set MinPts=4. • We use average distance of every microcluster as the Eps. • When do DBSCAN, points in different microclusters use different Eps.

  29. Kd tree • Use kd tree to find buckets with more than two microclusters from different original cluster results.

  30. No. 125 bucket

  31. MinPts = 4 for dim = 2 Eps p Search the rectangle (x+Eps, y+Eps, x-Eps, y-Eps) by R* tree, when Eps = avg_dist between points, it is very possible the point P could include 3 extra points besides itself.

  32. No. 125 bucket (a) MinPts = 5 (b) MinPts = 5

  33. Other controversial buckets No.119 bucket No.113 bucket No.114 bucket If x% points of a microcluster are merged into another microcluster, then merge These two microclusters. Since the proportion of points of these microclusters in these buckets that are merged exceeds 90%, 24 and 28 microclusters are merged.

  34. No. 20, 19 and 10 cluster repair

  35. After repair 20 clusters

  36. Conclusion • Repair cluster from two aspects. • Removing points which are loosely connect to the clusters by outlier/outlier group detection; • merging points which are density connected by DBSCAN with flexible Eps. • Analyze interested microclusters • Found the Relationship among Outliers, outlier groups and main clusters.

  37. Questions • MinPts in high dimensional data • For 3-d, MinPts=5; 4-d, MinPts=6? • For some outlier group microcluster, MinPts could be very high, it’s because border points include points in neighbor dense microcluters within its Eps, how to use each microcluster’s MinPts as reference.

More Related