Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Applying Electromagnetic Field Theory Concepts to Clustering with Constraints Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila

Motivation • Well-known problem, Dimensionality Curse: • As the # of dimensions increases, distance metrics start losing their functionality • Relative Distances • Unlike exact distances, relative distance-based metrics have some immunity for the curse • Shortest path calculated by using the edges in graphs • Local distance adjustments • In many domains, local changes affect whole system • Cancer cells in body, sensor depletion in a network, etc… • The same idea is valid for distance metrics. • Relative distances supported by pairwise constraints performs much better • Constraints cause changes in local distances (a) Change of a unit shape as dimen-sionality increases (b) The distance matrix becomes useless if dimensions keeps increasing

Negative Constraint Motivation(2) • Best environment to realize objectives GRAPH • Graph considered as Electromagnetic Field (EMF) • Pairwise constraints expressed naturally • Constraints EMF sources exerting force over edges • The force causes reduction or escalation of edge weights • No limitation for reduction/escalation amount thanks to graph domain • Cartesian space metrics bounded by triangular inequality

Related Work • Distance metric learning [ Xing et al.]: • Global Linear transformation of data point • Different weights for each dimension • Shortcomings: • May fail in some cases, • Euclidian distance may utilize better • Integrating constraints and metric learning in semi-supervised clustering [Bilenko et al.]: • Local weights for each cluster • Readjustment of weights at each iteration • Combines constraints and metric learning in objective function • Shortcomings: • Sometimes fails to adjust weights locally, • No guarantee for better accuracy with more constraints K-Means KMeans+Dist. Metric K-Means w1x = w1y w2x = w2y MPCK-Means w1x > w1y w2x < w2y

Related Work • Semi-supervised Graph Clustering: A Kernel Approach [Kulis et al.]: • Mapping of data points into new feature space • Similaritybetween Kernel-KMeans and graph clustering objectives • Works for both vector andgraph data • Shortcomings: • Optimal Kernel required for good results • Time to compute optimal kernel is high • Relies mostly on min-cut objective, not distance Correct Clustering SS-Kernel-Means Approach

Magnetically Affected Paths (MAP) • Two special edges for constraints: • Positive Edge : Must-link constraints • Negative Edge: Cannot-link constraints • Definitions: • Reduction Ratio: Amount of decrement in edge weight(+) • Escalation Ratio: Amount of increment in edge weight (-) _ Positive Edges Negative Edge

Midpoint e(u,v) s t Magnetically Affected Paths (MAP) • Each constraint edge affects regular edges based on: • Constraint type • Vertical Distance (vd):Distance to the constraint axis • Horizontal Distance (hd):Distance to the mid-point of the constraint axis • Vertical and Horizontal Effects Probabilistic model • if vdincreases, effectdecreases for both (+) and (-) constraints • if hdincreases, effectdecreases for (-) constaints • hd has no effect on (+) constraints effect vd hd hd(u,v) vd(u,v) axis s t e(u,v) Vertical Distance Horizontal Distance

Magnetically Affected Paths (MAP) • Compute escalation/reduction ratios of each constraint where _ and Typically, qe/qr = ~1.6 w(u,v) v u t s r = vertical distance effect ∆ = horizontal distance effect qe = weight of cannot link constraint qr = weight of must link constraint

Magnetically Affected Paths (MAP) • Compute overall escalation/reduction ratio on an edge • Multiply overall ratio by edge weight to assign new edge weight (1<α<∞) _ Overall effect on an edge is quantified as total effect of all constraints _

EMC (ElectroMagnetic Field Based Clustering) Framework • 3 steps clustering framework • Graph Construction • Readjustment of Edge Weights • Clustering Process

EMC (ElectroMagnetic Field Based Clustering) Framework • Graph Construction • Select the n-nearest neighbors for each object • Connect the neighborhood and use Euclidean distance as edge weight • If graph not connected, add new edges between disconnected components • Readjustment of Edge Weights • Apply the MAP concept on graph • all (+) and (-) edges applied before clustering step • Extract new affinity matrix using new edge weights • Employ k-shortest path distance as distance metric • Better than single shortest path • Can utilize MAP better • Very slow for large graphs

EMC (ElectroMagnetic Field Based Clustering) Framework • Clustering Process • Run clustering algorithm using new affinity matrix • Any clustering algorithm compatible with graphs • K-Means • Hierarchical • SS-Kernel-KMeans, etc… • We have used K-Medoids and Hierarchical clustering algorithms • Since they have similar results, we report only K-Medoids results • Small amount of constraints improves accuracy significantly • Other algorithms need more constraints to achieve same performance _

Two improvements for k-shortest paths • K-SD shortest path algorithm • Based on Dijkstra algorithm • Each vertex keeps k-distance entries • Paths are distinct (two paths cannot have a common edge) • Just k times slower than Dijkstra algorithm • Divide-and-Conquer approach (Multilevel approach) • Partition the graph using multilevel graph partitioning • Kmetis: partitions large graphs into equal-sized subgraphs • Very fast (takes just a few seconds to partition very large graphs) • Identify hubs • The nodes residing on the boundary of a partition • Connected to at least two partitions • These are the only way from one partition to next partition . Hubs between two partitions

Two improvements for k-shortest paths • Divide-and-Conquer approach (Cont.) • Extract distance matrix for each partition • Merge the distance matrices using the hubs • At least 20 times faster compared to original K-SD shortest path algorithm • Applicable to very large graphs

Divide-and-Conquer Approach

Constructing Hub graph and extracting SHub matrix

SHub Constructing Hub graph and extracting SHub matrix

Computing of K-SD shortest path distance

SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through first hub • SHubis used for transition from first partitionhubs to second partitionhubs

SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through second hub • SHubis used for transition from first partitionhubs to second partitionhubs

SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through last hub • SHubis used for transition from first partitionhubs to second partitionhubs

SHub Computing of K-SD shortest path distance Update distances from second partitionnodes to first partition’snode1 to through second partitions hubs At this moment, all second partitionhubs have their distances to the first partition’snode1 • SHubis used for transition from first partitionhubs to second partitionhubs

Experiments • Implemented in Java and Matlab • Synthetic and real datasets • Datasets from UCI Machine Learning Repository: • Soybean, Iris, Wine, Ionosphere, Balance, Breast cancer, Satellite

Experiments • EMCK-Means Experiments: • Graph construction • Varied # of paths and # of nearest neighbors • Readjustment phase • Constraint amount is increased by %10·|Dataset| • Compared against to: • MPCK-Means: Unifies distance-based and metric based approaches • Diagonal Metric: Learns a distance metric with weighted dimensions • EMCK-Means:MAP implementation with K-Medoids • SS-Kernel-KMeans:Performs graph clustering based on min-cut objective • Experimental Setup: • Same constraint sets used for each algorithm • Constraints are chosen at random • %x .N where N is the dataset size • Run each algorithm 200 times

Experiments • Clustering results for EMCK-Means on : • Wine, Balance, Breast Cancer, Ionosphere, Iris and Soybean dataset • We adjust number of shortest paths ranging from 5 to 20.

Comparison of Algorithms • Comparison of EMC, MPCK-Means, KMeans+Diagonal metric and SS-Kernel-KMeans • OutperformsIris, Balance and Ionosphere • Reasonable for Soybean and Breast Cancer • Almost no gain at all for Wine

Conclusions • EMC framework offers flexible and more accurate clustering in graph domain • We can integrate other clustering algorithms into the framework • Small amount of constraints improves accuracy significantly • Applicability of more constraints at any time • Time reduces significantly as we increase # of partitions,p • Future Works • Multilevel EMC • Coarsen the graph • Perform clustering • Refinement • Performs much faster than other algorithms without any significant change in accuracy • No hubs or merge process • _

Thank you!

Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Presentation Transcript

Applying the Theory of Constraints to Ambulatory Practice

Electromagnetic field tensor

Maxwell’s Equations of the Electromagnetic Field Theory

Electromagnetic Theory

Theory of Constraints

Theory of Constraints

EE3321 ELECTROMAGNETIC FIELD THEORY

The Electromagnetic Field

CLUSTERING Basic Concepts

EE3321 Electromagnetic Field Theory

Constant Electromagnetic Field

Theory of Constraints

THEORY OF CONSTRAINTS

Theory of Constraints Part II: TOC Concepts

EMF2016 ELECTROMAGNETIC THEORY

Theory of Constraints

Theory of Constraints