Vladyslav Kolbasin

Stable Clustering Vladyslav Kolbasin

Clustering data • Clustering is part of exploratory process • Standard definition: • Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups • There is no “true” solution for clustering • We don't have any “true Y values” • Usually we want to do some data exploration or simplification or even find some data taxonomy • Usually we don't have precise mathematical definition of clustering • Usually we iterate through different methods that have different mathematical target function, then use some best method 2

Usual clustering methods • Methods: • K-means • Hierarchical clustering • Spectral clustering • DBScan • BIRCH • … • Issues: • Need to estimate clusters count • Non-determinism • Non-stability 3

Are standard methods stable? Kmeans 4

Are standard methods stable? Hierarchical clustering 5

Audience data • A lot of attributes: 9000-30000- ... • All attributes are binary • There are several data providers • There is no very important attributes 6

Stability importance • Data comes from different providers and it is very noisy • It is unlikely that results will change from run to run • Usually audience doesn't change a lot in short period • Many algorithms “explode” when we increase data size • Non linear complexity of clustering • Best count of clusters move to higher values for bigger data 7

Stable clustering. Requirements • Let's add some additional requirement to clustering: • clustering result should be a structure on the data set that is “stable” • So there should be similar results when: • We change some small portion of data • We apply clustering onto several datasets from the same underlying model • Apply clustering onto several subsets of initial dataset • We don't want to process gigabytes and terabytes to get several stable clusters which are independent of randomness in sampling 8

Stable clustering. Requirements • Natural restrictions: • We don't want to have too many clusters • We don't want to have too small or too big clusters • Too small clusters are usually useless for further processing • Too big clusters do not bring significantly new information • Some points can be noise points, so let try to find only significant tendencies • It will be big benefit if we can easily scale results • To be able to look at inner structure of selected cluster without full rerun • Any additional instruments for manual analysis of clustering are welcome 9

Stable clustering ideas • Do not use whole dataset, but use many small sub samples • Use several samplings to mine as much as possible information from data • Average all clustering on samples to get stable result 10

Stable clustering algorithm • Select N samples of whole dataset • Do clustering for each sample • So for each sample we have set of clusters (possibly very different) • Do some clusters' preprocessing • Associate clusters from different samples to each other • Build some relationship structure - clusters graph • Set relationship measure - distance measure • Do clustering on relationship structure • Do communities search 11

2. Sample clustering • Any clustering method: • Kmeans • Hierarchical clustering • … • It is conveniently to use hierarchical clustering: • It is rather fast clustering method • We can estimate clusters count using natural restrictions, not using special criteria like we usually do for kmeans • We can deep into internal structure without any additional calculations 12

2.1. Dendrogram clustering • Recursive splitting of large clusters • With natural restrictions: • Set max possible cluster size (in %) • Set min cluster size (in %), any smaller cluster – noise • Max count of splits • … 13

2.1. Dendrogram clustering 14

3. Do clusters' preprocessing • Reduce noise points • Cluster smoothing • Make clusters more convenient for associating: • Cluster can be similar to several other clusters (1-to-many) • If split it, it can transform into: 1-to-1 & 1-to-1 clusters • And some other heuristics... 15

4. Associate clusters from different samples to each other • How similar to each other are clusters? • Set relationship measure: • Simplest measure - distance between cluster's centers • But we can use any suitable measure 16

4. Associate clusters from different samples to each other • Clusters relationship structure - clusters graph • But we are not interested in edges for very different clusters • So need some threshold: • Can estimate manually, then hard-code • Can estimate automatically 17

5. Communities search in networks • Methods: • walktrap.community • edge.betweenness.community • fastgreedy.community • spinglass.community • … • It is possible that some clusters will not be in any community. Then will mark these clusters as special type community 18

5.1 Community structure detection based on edge betweenness • edge.betweenness.community() implements Girvan–Newman algorithm • Betwenness- the number of geodesics (shortest paths) going through an edge • Algorithm: • Calculate edge-betweenness for all edges • Remove the edge with highest betweenness • Recalculate betweenness • Repeat until all edges are removed, or modularity function is optimized (depending on variation) 19

5. Communities examples 20

Algorithm analysis 21

Summary • Issues in clustering algorithms • Why stability is important for business questions? • 2 staged clustering algorithm • 1st stage – apply simple clustering on samples • 2nd stage – do clustering on cluster graph • Real data clustering example • Algorithm can be simply parallelized: • Most time is spent on 2nd step 24

Vladyslav Kolbasin

Vladyslav Kolbasin

Presentation Transcript

By Vladyslav Syrotenko (student of V. N. Karazin Kharkiv National University,

Vladyslav Yanishevky, Deputy Director for Staff and Social Issues 2 7-31 May 201 8