240 likes | 254 Views
Stable Clustering. Vladyslav Kolbasin. Clustering data. Clustering is part of exploratory process Standard definition:
E N D
Stable Clustering Vladyslav Kolbasin
Clustering data • Clustering is part of exploratory process • Standard definition: • Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups • There is no “true” solution for clustering • We don't have any “true Y values” • Usually we want to do some data exploration or simplification or even find some data taxonomy • Usually we don't have precise mathematical definition of clustering • Usually we iterate through different methods that have different mathematical target function, then use some best method 2
Usual clustering methods • Methods: • K-means • Hierarchical clustering • Spectral clustering • DBScan • BIRCH • … • Issues: • Need to estimate clusters count • Non-determinism • Non-stability 3
Audience data • A lot of attributes: 9000-30000- ... • All attributes are binary • There are several data providers • There is no very important attributes 6
Stability importance • Data comes from different providers and it is very noisy • It is unlikely that results will change from run to run • Usually audience doesn't change a lot in short period • Many algorithms “explode” when we increase data size • Non linear complexity of clustering • Best count of clusters move to higher values for bigger data 7
Stable clustering. Requirements • Let's add some additional requirement to clustering: • clustering result should be a structure on the data set that is “stable” • So there should be similar results when: • We change some small portion of data • We apply clustering onto several datasets from the same underlying model • Apply clustering onto several subsets of initial dataset • We don't want to process gigabytes and terabytes to get several stable clusters which are independent of randomness in sampling 8
Stable clustering. Requirements • Natural restrictions: • We don't want to have too many clusters • We don't want to have too small or too big clusters • Too small clusters are usually useless for further processing • Too big clusters do not bring significantly new information • Some points can be noise points, so let try to find only significant tendencies • It will be big benefit if we can easily scale results • To be able to look at inner structure of selected cluster without full rerun • Any additional instruments for manual analysis of clustering are welcome 9
Stable clustering ideas • Do not use whole dataset, but use many small sub samples • Use several samplings to mine as much as possible information from data • Average all clustering on samples to get stable result 10
Stable clustering algorithm • Select N samples of whole dataset • Do clustering for each sample • So for each sample we have set of clusters (possibly very different) • Do some clusters' preprocessing • Associate clusters from different samples to each other • Build some relationship structure - clusters graph • Set relationship measure - distance measure • Do clustering on relationship structure • Do communities search 11
2. Sample clustering • Any clustering method: • Kmeans • Hierarchical clustering • … • It is conveniently to use hierarchical clustering: • It is rather fast clustering method • We can estimate clusters count using natural restrictions, not using special criteria like we usually do for kmeans • We can deep into internal structure without any additional calculations 12
2.1. Dendrogram clustering • Recursive splitting of large clusters • With natural restrictions: • Set max possible cluster size (in %) • Set min cluster size (in %), any smaller cluster – noise • Max count of splits • … 13
3. Do clusters' preprocessing • Reduce noise points • Cluster smoothing • Make clusters more convenient for associating: • Cluster can be similar to several other clusters (1-to-many) • If split it, it can transform into: 1-to-1 & 1-to-1 clusters • And some other heuristics... 15
4. Associate clusters from different samples to each other • How similar to each other are clusters? • Set relationship measure: • Simplest measure - distance between cluster's centers • But we can use any suitable measure 16
4. Associate clusters from different samples to each other • Clusters relationship structure - clusters graph • But we are not interested in edges for very different clusters • So need some threshold: • Can estimate manually, then hard-code • Can estimate automatically 17
5. Communities search in networks • Methods: • walktrap.community • edge.betweenness.community • fastgreedy.community • spinglass.community • … • It is possible that some clusters will not be in any community. Then will mark these clusters as special type community 18
5.1 Community structure detection based on edge betweenness • edge.betweenness.community() implements Girvan–Newman algorithm • Betwenness- the number of geodesics (shortest paths) going through an edge • Algorithm: • Calculate edge-betweenness for all edges • Remove the edge with highest betweenness • Recalculate betweenness • Repeat until all edges are removed, or modularity function is optimized (depending on variation) 19
Summary • Issues in clustering algorithms • Why stability is important for business questions? • 2 staged clustering algorithm • 1st stage – apply simple clustering on samples • 2nd stage – do clustering on cluster graph • Real data clustering example • Algorithm can be simply parallelized: • Most time is spent on 2nd step 24