280 likes | 450 Views
Applying Electromagnetic Field Theory Concepts to Clustering with Constraints. Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila. Motivation. Well-known problem, Dimensionality Curse : As the # of dimensions increases, distance
E N D
Applying Electromagnetic Field Theory Concepts to Clustering with Constraints Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila
Motivation • Well-known problem, Dimensionality Curse: • As the # of dimensions increases, distance metrics start losing their functionality • Relative Distances • Unlike exact distances, relative distance-based metrics have some immunity for the curse • Shortest path calculated by using the edges in graphs • Local distance adjustments • In many domains, local changes affect whole system • Cancer cells in body, sensor depletion in a network, etc… • The same idea is valid for distance metrics. • Relative distances supported by pairwise constraints performs much better • Constraints cause changes in local distances (a) Change of a unit shape as dimen-sionality increases (b) The distance matrix becomes useless if dimensions keeps increasing
Negative Constraint Motivation(2) • Best environment to realize objectives GRAPH • Graph considered as Electromagnetic Field (EMF) • Pairwise constraints expressed naturally • Constraints EMF sources exerting force over edges • The force causes reduction or escalation of edge weights • No limitation for reduction/escalation amount thanks to graph domain • Cartesian space metrics bounded by triangular inequality
Related Work • Distance metric learning [ Xing et al.]: • Global Linear transformation of data point • Different weights for each dimension • Shortcomings: • May fail in some cases, • Euclidian distance may utilize better • Integrating constraints and metric learning in semi-supervised clustering [Bilenko et al.]: • Local weights for each cluster • Readjustment of weights at each iteration • Combines constraints and metric learning in objective function • Shortcomings: • Sometimes fails to adjust weights locally, • No guarantee for better accuracy with more constraints K-Means KMeans+Dist. Metric K-Means w1x = w1y w2x = w2y MPCK-Means w1x > w1y w2x < w2y
Related Work • Semi-supervised Graph Clustering: A Kernel Approach [Kulis et al.]: • Mapping of data points into new feature space • Similaritybetween Kernel-KMeans and graph clustering objectives • Works for both vector andgraph data • Shortcomings: • Optimal Kernel required for good results • Time to compute optimal kernel is high • Relies mostly on min-cut objective, not distance Correct Clustering SS-Kernel-Means Approach
Magnetically Affected Paths (MAP) • Two special edges for constraints: • Positive Edge : Must-link constraints • Negative Edge: Cannot-link constraints • Definitions: • Reduction Ratio: Amount of decrement in edge weight(+) • Escalation Ratio: Amount of increment in edge weight (-) _ Positive Edges Negative Edge
Midpoint e(u,v) s t Magnetically Affected Paths (MAP) • Each constraint edge affects regular edges based on: • Constraint type • Vertical Distance (vd):Distance to the constraint axis • Horizontal Distance (hd):Distance to the mid-point of the constraint axis • Vertical and Horizontal Effects Probabilistic model • if vdincreases, effectdecreases for both (+) and (-) constraints • if hdincreases, effectdecreases for (-) constaints • hd has no effect on (+) constraints effect vd hd hd(u,v) vd(u,v) axis s t e(u,v) Vertical Distance Horizontal Distance
Magnetically Affected Paths (MAP) • Compute escalation/reduction ratios of each constraint where _ and Typically, qe/qr = ~1.6 w(u,v) v u t s r = vertical distance effect ∆ = horizontal distance effect qe = weight of cannot link constraint qr = weight of must link constraint
Magnetically Affected Paths (MAP) • Compute overall escalation/reduction ratio on an edge • Multiply overall ratio by edge weight to assign new edge weight (1<α<∞) _ Overall effect on an edge is quantified as total effect of all constraints _
EMC (ElectroMagnetic Field Based Clustering) Framework • 3 steps clustering framework • Graph Construction • Readjustment of Edge Weights • Clustering Process
EMC (ElectroMagnetic Field Based Clustering) Framework • Graph Construction • Select the n-nearest neighbors for each object • Connect the neighborhood and use Euclidean distance as edge weight • If graph not connected, add new edges between disconnected components • Readjustment of Edge Weights • Apply the MAP concept on graph • all (+) and (-) edges applied before clustering step • Extract new affinity matrix using new edge weights • Employ k-shortest path distance as distance metric • Better than single shortest path • Can utilize MAP better • Very slow for large graphs
EMC (ElectroMagnetic Field Based Clustering) Framework • Clustering Process • Run clustering algorithm using new affinity matrix • Any clustering algorithm compatible with graphs • K-Means • Hierarchical • SS-Kernel-KMeans, etc… • We have used K-Medoids and Hierarchical clustering algorithms • Since they have similar results, we report only K-Medoids results • Small amount of constraints improves accuracy significantly • Other algorithms need more constraints to achieve same performance _
Two improvements for k-shortest paths • K-SD shortest path algorithm • Based on Dijkstra algorithm • Each vertex keeps k-distance entries • Paths are distinct (two paths cannot have a common edge) • Just k times slower than Dijkstra algorithm • Divide-and-Conquer approach (Multilevel approach) • Partition the graph using multilevel graph partitioning • Kmetis: partitions large graphs into equal-sized subgraphs • Very fast (takes just a few seconds to partition very large graphs) • Identify hubs • The nodes residing on the boundary of a partition • Connected to at least two partitions • These are the only way from one partition to next partition . Hubs between two partitions
Two improvements for k-shortest paths • Divide-and-Conquer approach (Cont.) • Extract distance matrix for each partition • Merge the distance matrices using the hubs • At least 20 times faster compared to original K-SD shortest path algorithm • Applicable to very large graphs
SHub Constructing Hub graph and extracting SHub matrix
SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through first hub • SHubis used for transition from first partitionhubs to second partitionhubs
SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through second hub • SHubis used for transition from first partitionhubs to second partitionhubs
SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through last hub • SHubis used for transition from first partitionhubs to second partitionhubs
SHub Computing of K-SD shortest path distance Update distances from second partitionnodes to first partition’snode1 to through second partitions hubs At this moment, all second partitionhubs have their distances to the first partition’snode1 • SHubis used for transition from first partitionhubs to second partitionhubs
Experiments • Implemented in Java and Matlab • Synthetic and real datasets • Datasets from UCI Machine Learning Repository: • Soybean, Iris, Wine, Ionosphere, Balance, Breast cancer, Satellite
Experiments • EMCK-Means Experiments: • Graph construction • Varied # of paths and # of nearest neighbors • Readjustment phase • Constraint amount is increased by %10·|Dataset| • Compared against to: • MPCK-Means: Unifies distance-based and metric based approaches • Diagonal Metric: Learns a distance metric with weighted dimensions • EMCK-Means:MAP implementation with K-Medoids • SS-Kernel-KMeans:Performs graph clustering based on min-cut objective • Experimental Setup: • Same constraint sets used for each algorithm • Constraints are chosen at random • %x .N where N is the dataset size • Run each algorithm 200 times
Experiments • Clustering results for EMCK-Means on : • Wine, Balance, Breast Cancer, Ionosphere, Iris and Soybean dataset • We adjust number of shortest paths ranging from 5 to 20.
Comparison of Algorithms • Comparison of EMC, MPCK-Means, KMeans+Diagonal metric and SS-Kernel-KMeans • OutperformsIris, Balance and Ionosphere • Reasonable for Soybean and Breast Cancer • Almost no gain at all for Wine
Conclusions • EMC framework offers flexible and more accurate clustering in graph domain • We can integrate other clustering algorithms into the framework • Small amount of constraints improves accuracy significantly • Applicability of more constraints at any time • Time reduces significantly as we increase # of partitions,p • Future Works • Multilevel EMC • Coarsen the graph • Perform clustering • Refinement • Performs much faster than other algorithms without any significant change in accuracy • No hubs or merge process • _