Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering

Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in part by NDSEG and Olin Fellowships and by NSF grants IIS-0196057 and ITR/EIA-0113618.

Overview • Introduction • Example • Results • Conclusion Washington University in St. Louis

Introduction • Rearrangement clustering • Rearrange rows of a matrix • Minimize the sum of the differences between adjacent rows • min Sd(i, i+1) • Rows correspond to objects • Columns correspond to features Washington University in St. Louis

Introduction • Applications • Information retrieval • Manufacturing • Software engineering Washington University in St. Louis

Example Washington University in St. Louis

Example • Bond Energy Algorithm (BEA) • Introduced in 1972 (McCormick, Schweitzer, White) • Approximate solution • Still widely used Washington University in St. Louis

Example • Optimal solution • Lenstra (1974) observed equivalence to the Traveling Salesman Problem (TSP) • Given n cities and the distance between each pair • Find shortest cycle visiting every city • NP-hard problem Washington University in St. Louis

Example • Transform into a TSP • Each object corresponds to a city • Distance between two cities equal to difference between the corresponding objects • Dummy city added to problem • Costs from dummy city to all other cities equal a constant • Location of dummy city indicates position to cut cycle into a path Washington University in St. Louis

Example • TSP solvers extremely slow even for small problems in the 70’s • Massive research efforts to solve TSP over last three decades • Current solvers • Concorde (Applegate, Bixby, Chvatal, Cook, 2001) • Solved a 15,112 city TSP Washington University in St. Louis

Example • BEA and TSP offer approximate and optimal solutions • We have observed a flaw in the objective function when the objects form natural clusters • The objective minimizes the sum of every pair of adjacent rows • Inter-cluster distances tend to be significantly larger than intra-cluster distances • Summation dominated by inter-cluster distances Washington University in St. Louis

Example • TSPCluster addresses this flaw • Add k dummy cities • k clusters are specified by the output • TSP solver ignores inter-cluster distances • Minimizes sum of intra-cluster distances • Use sufficiently small constant for distances to/from dummy cities • Dummy cities never adjacent to each other Washington University in St. Louis

Results • Arabidopsis • 499 genes • 25 conditions • Comparison with BEA • Used BEA similarity measure • BEA score: 447,070 • TSPCluster score: 452,109 (k = 1) Washington University in St. Louis

Results BEA TSPCluster Washington University in St. Louis

Results • Compared with Cluster (Eisen et al., 1998) and k-ary (Bar-Joseph et al., 2003) • Used Pearson correlation coefficient • Cluster: 398 • k-ary: 427 • TSPCluster: 436 (k = 1) Washington University in St. Louis

Results Cluster k-ary TSPCluster Washington University in St. Louis

Results • TSPCluster with k equal to 2 to 50 • How many clusters? • Average inter-cluster distances • BEA local peaks: • 6, 13, 19, 26, 29, 35, 40, 47 • Pearson correlation coefficient local peaks: • 3, 9, 12, 21, 26, 40 • Computation time varied • Less than half minute to ~3 minutes Washington University in St. Louis

Results k = 26 k = 40 Washington University in St. Louis

Conclusion • Most problems have errors in their data • Error introduced by approximation algorithms can’t be expected to “undo” this error • Computers are cheap • Computers and solvers are sophisticated • Don’t have to always resort on approximate solutions even for NP-hard problems Washington University in St. Louis

Conclusion • Rearrangement clustering provides a linear ordering • Linear ordering inherent to many applications • Information retrieval • Manufacturing • Software engineering Washington University in St. Louis

Conclusion • Gene data arranged in linear order to examine data • Linear ordering not necessarily essential to gene clustering problems • Current work • Optimally solve subproblems in clustering algorithms Washington University in St. Louis

Questions? Washington University in St. Louis

Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering

Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering

Presentation Transcript

Chapter 3: Changing Group Structures and the Metamorphosis of Terrorism

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Clustering Techniques and Applications to Image Segmentation

Recombination and Linkage

ICASSP 2007 Robustness Techniques Survey

Results-Based Management: Logical Framework Approach

High Performance Cluster Computing: Architectures and Systems

Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering

Chapter 4: Unsupervised Learning

Density-Based Clustering of Uncertain Data (KDD2005)

OPTIMAL LEARNING CONDITIONS

Clustering

Clustering and Pathway Analysis

Windows Server 2012 Hyper-V (Clustering)

Data Mining: Concepts and Techniques Cluster Analysis: Basic Concepts and Methods

Spectral Clustering

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 10 —

Spatial Clustering Methods

Clustering Prof. Navneet Goyal BITS, Pilani

Metodi Numerici per la Bioinformatica

Clustering Prof. Navneet Goyal BITS, Pilani

Clustering Analysis