Iclust: Information-Based Clustering for Gene Expression Data Analysis

Iclust: information based clustering Noam Slonim The Lewis-Sigler Institute for Integrative Genomics Princeton University Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek

Running example Gene expression data N conditions 2 12 -1 -1 6 -3 8 ?? 7 -5 3 -4 12 ?? -5 11 -2 6 11 11 -8 12 ?? -2 ?? 12 5 12 4 -1 8 -2 ?? 5 14 ?? 8 1 12 1 14 -8 ?? -2 5 14 -8 -7 5 -5 11 17 -2 15 5 14 -8 5 16 2 (log) ratio of the mRNA expression level of a gene in a specific condition K genes 1 11 -8 0 5 -5 5 14 18 ?? 2 1 -6 12 4 12 4 7 -1 3 -7 3 7 -5 21 ?? ?? 3 2 4 -11 -3 3 -3 ?? 9 Relations between genes? Relations between experimental conditions?

Information as a correlation/similarity measure • Some nice features of the information measure: • Model independent • Responsive to any type of dependency • Captures more than just pairwise relations • Suitable for both continuous and discrete data • Independent of the measurement scale • Axiomatic

The resulting reduction in the uncertainty about gene-A state • is called the mutual information between these two variables : How much can we learn from the state of gene-B about the state of gene-A (and vice versa). Mutual information - definition We have some “uncertainty” about the state of gene-A; but now someone told us the state of gene-B…

MI~1 bit; Corr.~0.9 MI~2 bits; Corr.~0.6 gene-B expression level gene-B expression level gene-A expression level gene-A expression level MI~1.3 bits; Corr.~0 MI~0 bits; Corr.~0 gene-B expression level gene-B expression level gene-A expression level gene-A expression level Model independence & responsiveness to “complicated” relations

Triplet-information ~ 1.0 bits MI~0 bits; Corr.~0 gene-A/gene-B/gene-C expression gene-A/gene-B expression Experiment index Experiment index Capturing more than just pairwise relations Using a model-dependent correlation measure might result in missing significant dependencies in our data.

Mutual-information vs. Pearson-Correlation results in bacteria gene-expression data Mycobacterium tuberculosis 81 experiments Mutual information Pearson Correlation

Information relations between gene expression profiles Given the expression of gene-A, how much information do we have about the expression of gene-B ? (when averaging over all conditions) ( sample size: number of conditions - 173 in Gasch data ) Once we find these information relations, we often want to apply cluster analysis. Numerous clustering methods are available – but typically they assume a particular model. For example, K-means corresponds to the modeling assumption that each cluster can be described by a spherical Gaussian. Back in square one …?

Formally, we wish to maximize Or … Or … Iclust – information based clustering What is a “good” cluster? A simple proposal – given a cluster, we pick two items at random, and we want them to be as “similar” to each other as possible. Namely, we wish to maximize the average information relations in our clusters, or to find clusters s.t. in each cluster all items are highly informative about each other.

S(c) is maximized, but the penalty term is maximized as well (no compression) Penalty term is minimized (maximal compression), but S(c) is minimized as well. Intermediate interesting cases – small penalty with high S(c) Iclust – information based clustering (cont.) A penalty term that we wish to minimize, as in rate-distortion theory :

Clustering parameters Expected information relations among data items Information between data items and clusters Tradeoff parameter Iclust – information based clustering (cont.) The intuitive clustering problem can be turned into a General mathematical optimization problem: Clustering is formulated as trading bits of similarity against bits of descriptive power, without any further assumptions.

Iclust Classical rate distortion For the special case of pairwise relations The difference is whether the sum over i2 is before/after d is computed Relations with other classical rate distortion If the distortion/similarity matrix is a kernel matrix the formulations are equivalent

Both formulations induce different decoding schemes A sender observes a pattern Φi, but is allowed to send only the cluster index, c In classical rate distortion the receiver is assumed to decode by Deterministic decoding with vocabulary size Nc In Iclust he receiver is assumed to decode by Stochastic decoding with vocabulary size N And yet – some important differences Iclust is applicable when the raw data is given directly as pairwise relations Iclust do not require a definition of a “prototype” (or “centroid”) Iclust can handle more than just pairwise correlations

Iclust (stochastic) decoding RD (deterministic) decoding 2 clusters 2 clusters Iclust vs. classical rate-distortion decoding Original figure: 220 gray levels

Average “similarity” of i to c members Average “similarity” among c members Responsive to any type of dependency among the data Invariant to changes in the data representation Allows to cluster based on more than pairwise relations For more details : Slonim, Atwal, Tkacik, and Bialek (2005) Information based clustering, PNAS, in press. See www.princeton.edu/~nslonim Iclust algorithm - freely available Web implementation

Clusters of genes C18 C15 C4 RPS10A RPS10B RPS11A RPS11B RPS12 … FRS1 KRS1 SES1 TYS1 VAS1 … PGM2 UGP1 TSL1 TPS1 TPS2 … Proteins of the small ribosomal subunit Enzymes that attach amino acids to tRNA Enzymes involved in the trehalose anabolism pathway Clusters of stocks C17 C12 C2 Wal-Mart Target Home Depot Best Buy Staples … Microsoft Apple Comp. Dell HP Motorola … NY Times Tribune Co. Meredith Corp. Dow Jones & Co. Knight-Ridder Inc. … Data: Rating by viewers Data: Dynamics of stock prices Clusters of movies C12 C1 C7 Snow White Cinderella Dumbo Pinocchio Aladdin … Psycho Apocalypse Now The Godfather Taxi Driver Pulp Fiction … Star Wars Return of the Jedi The Terminator Alien Apollo 13 … Given the price of stock-A, how much information do we have about the price of stock-B ? (when averaging over many days) Given the rating of movie-A, how much information do we have about the rating of movie-B ? (when averaging over many viewers) Iclust – clusters examples

Coherence results – comparison to alternative algorithms ESR S&P 500 EachMovie K-means K-means K-means K-medians K-medians K-medians Hierarchical Hierarchical Hierarchical

Quick Summary Information as the core measure of data analysis with many appealing features Iclust - a novel information-theoretic formulation of clustering, with some intriguing relations with classical rate distortion clustering. Validations: finding coherent gene clusters based on information relations in gene-expression data … and finding coherent stocks clusters, coherent movies clusters … … and genotype-phenotype association in bacteria, based on phylogenetic data - Slonim, Elemento & Tavazoie (2005), Mol. Systems Biol., in press. … and more?

Iclust: Information-Based Clustering for Gene Expression Data Analysis