Mining Phenotype Structures

Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004

Microarray Data Analysis • Analysis from two angles • sample as object, gene as attribute • gene as object, sample/condition as attribute

Supervised Analysis • Select training samples (hold out…) • Sort genes (t-test, ranking…) • Select informative genes (top 50 ~ 200) • Cluster based on informative genes Class 1 Class 2 g1 g2 . . . . . . . g4131 g4132 1 1 … 10 0 … 0 1 1 … 10 0 … 0 g1 g2 . . . g4131 g4132 1 1 … 10 0 … 0 1 1 … 10 0 … 0 0 0 … 01 1 … 1 0 0 … 01 1 … 1 0 0 … 01 1 … 1 0 0 … 01 1 … 1

Unsupervised Analysis • We will focus on unsupervised sample partition which assume no phenotype information being assigned to any sample. • Since the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis. • Many mature statistic methods can not be applied without the phenotypes of samples being known in advance.

Unsupervised Analysis Automatic Phenotype Structure Mining gene1 gene1 Informative Genes gene2 gene2 gene3 gene3 gene4 gene4 Non- informative Genes gene5 gene5 gene6 gene6 gene7 gene7 samples 8 9 10 1 2 3 456 7 An informative gene is a gene which manifests samples' phenotype distinction. Phenotype structure: sample partition + informative genes.

Automatic Phenotype Structure Mining Gene expression matrix Result Mining Phenotype distinction 1 2 3 456 7 gene1 gene2 gene3 Mining Informative genes Given a nm data matrix M and the number of samples' phenotypes K. The goal is to find K mutually exclusive groups of the samples matching their empirical phenotypes, and to find the set of informative genes which manifests this phenotype distinction.

Requirements • The expression levels of each informative gene should be similar over the samples within each phenotype • The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes

Challenges (1) The volume of genes is very large while the number of samples is very limited, no distinct class structures of samples can be properly detected by the existing techniques.

Challenges (2) gene5 gene9 gene12 gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 The limited informative genes are buried in large amount of noise. gene9 gene10 gene11 gene12 gene13 gene14 gene15

Challenges (3) Gene LTC4 synthase U50136 Gene Fumarylacetoacetate M55150 Gene C-myb U22376 Gene PROTEASOME IOTA X59417 The values within data matrices are all real numbers None of the informative genes follows ideal “high-`low” pattern.

Related Work • New tools using traditional methods : • The similarity measures used in these methods are based on the full gene space. • PCs do not necessarily have strong correlation with informative genes. • SOM • K-means • Hierarchical clustering • Graph based clustering • PCA

Related Work (Cont’d) • Clustering with feature selection: (CLIFF, two-way ordering, SamCluster) • Filtering the invariant genes • Rank variance • PCA • CV • Partition the samples • Ncut, Min-Max Cut • Hierarchical Clustering • Pruning genes based on the partition • Markov blanket filter • T-test

Related Work (Cont’d) • Subspace clustering : • Bi-clustering • δ-clustering

Related Work (Cont’d) • Subspace clustering only measure trend similarity. But in our model, we require each gene show consistent signals on the samples of the same phenotype.

Related Work (Cont’d) • Subspace clustering algorithms only detect local correlated features and objects without considering dissimilarity between different clusters. We want to get the genes which can differentiate all phenotypes.

Our Contributions • We transferred the phenotype structure mining problem into an optimization problem. • A series of statistic-based metrics are defined as objective functions. • A heuristic searching method and a mutual reinforcing adjustment approach are proposed to find phenotype structures.

Model - Measurements Inter-divergency S1 S2 samples gene1 Phenotype Quality G’ gene2 gene3 Intra-consistency Intra-consistency

Intra-consistency NOT consistent consistent

Intra-pattern-consistency (Cont’d) In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples? • Variance of a single gene on the samples within one phenotype: • Intra-pattern-consistency: average row variance Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better.

Inter-pattern-divergence How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples? • Both “inter-pattern-consistency” and ``intra-pattern-divergence” on the same gene are reflected. • Average block distance: Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better.

Pattern Quality • The purpose of pattern discovery is to identify the empirical patterns where the intra-pattern-consistency inside each phenotype is high and the inter-pattern-divergence between each pair of phenotypes is large. The higher the value, the better the quality.

Measurements • Intra-consistency • Inter-divergence: • Phenotype Quality

Phenotype Quality Highest phenotype quality

Model - Formalized Problem Input m samples and n genes the corresponding gene expression matrix M the number of phenotypes K Output A K-partition of samples (phenotypes) and a subset of genes (informative space) that the phenotype quality  is maximized.

Strategy Maintain a candidate phenotype structure and iteratively adjust the candidate structure toward the optimal solution. Basic elements: A candidate structure: A partition of samples {S1,S2,…Sk} A subset of genes G’G The corresponding phenotype quality  An adjustment: For a gene  G’, insert into G’ For a gene  G’, remove from G’ For a sample in a group S’, move to other group The quality gain measures the change of phenotype quality of before and after the adjustment.

Heuristic Searching candidate structure generation Iterative Adjusting intermediate candidate structure pick up an object gene/sample N adjustment Ω > 0 Y adjusting

Heuristic Searching • Starts with a random K-partition of samples and a subset of genes as the candidate of the informative space. • Iteratively adjust the partition and the gene set toward a better solution. (Random order of genes and samples.) • for each gene, try possible insert/remove • for each sample, try best movement. Insert a gene Remove a gene Move a sample

Heuristic Search • For each possible adjustment, compute  • For each gene, try possible insert/remove • For each sample, try the best movement •  > 0  conduct the adjustment •  < 0  conduct the adjustment with probability • T(i) is a decreasing simulated annealing function and i is the iteration number. T(0)=1, T(i)=1/(i+1) in our implementation

Mutual Reinforcing Adjustment - Motivation Drawbacks of the heuristic searching method: blind initialization , equal chance of samples and genes, noisy samples. The phenotype quality value of subset of informative genes and partially phenotype should also be high. Mining phenotypes and informative genes directly from high-dimensional noisy data is difficult, we start from small groups whose data distribution and patterns are much easier to be detected. Mining of phenotypes and informative genes should mutually reinforced.

Mutual Reinforcing Adjustment - Motivation A B C

Mutual Reinforcing Adjustment - Major Steps • Partition the Matrix: divide the original matrix into a series of exclusive sub-matrices based on partitioning both the samples and genes. • Reference Partition Detection: post a partial or approximate phenotype structure called a reference partition of samples. • compute reference degree for each sample groups; • select k groups of samples; • do partition adjustment. • Gene Adjustment: adjust the candidate informative genes. • compute W for reference partition on G • perform possible adjustment of each genes • Refinement Phase

Method Detail - Iteration Phase all samples all samples reference partition detection reference partition partitioning the matrix informative genes G’ informative genes G’ informative genes G’ reference partition all samples gene adjustment to next iteration informative genes G’’ informative genes G’’

Partitioning the Matrix • Partition the samples and genes into multiple groups • Use CAST A threshold t decide the size of each group • Based on the Pearson’s correlation Coefficient • Outliers will be filtered out from any group • Samples or genes in the same group share similar patterns

Reference Partition Detection • Select the groups of samples as potential phenotypes • Pick the first group with the highest reference degree • Select the other groups by considering the inter-phenotype divergence w.r.t. selected groups

Check the Missing Samples • Probabilistically insert the remaining samples not in the selected groups into the most probably matching group • In iterations, use the gene candidate sets to improve the reference partition

Gene Adjustment Insert a gene Remove a gene • Gene adjustment: Test the possible adjustments that lead to improvement

Method-Refinement Phase • The partition corresponding to the best state may not cover all the samples. • Add every sample not covered by the reference partition into its matching group  the phenotypes of the samples. • Then, a gene adjustment phase is conducted. We execute all adjustments with a positive quality gain  informative space. • Time complexity O(n*m2*I)

Mining Multiple Phenotype Structures Empirical Phenotype Structure Hidden Phenotype Structure gene8 gene9 samples 1 2 3 4 5 6 7 8 9 10 gene1 gene2 gene3 gene4 gene6 gene7 Output: p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized.

Maintain p candidate phenotype structures and iteratively adjust them toward the optimal solution. Basic elements of each candidate structure: A candidate structure A Kt partition of samples A subset of genes G’G The corresponding phenotype quality t An adjustment For a gene gi Gt, insert into Gt For a gene giGt, move from Gt’ (tt’) or remove from all structures For a sample si in group S’, move to other group The quality gain measures the change of pattern quality of the states after the adjustment. Extended Algorithm Strategy

The Extended Algorithm (Cont’d) insert move remove • Gene move Sample candidate structure 1 candidate structure 2

Mining Multiple Phenotype Structures (Cont’d) • Partially informative genes

Formalized Problem Input m samples and n genes the corresponding gene expression matrix M the number of phenotype structures p the set of numbers {K1, K2, …, Kp} Output p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized.

The Algorithm • Candidate Structure Generation • cluster genes into p’ group (p’>p) (CAST) • generate sample partitions one by one on clusters of genes, select best quality genes. • Iterative Adjustment • for each gene, try possible insert/move/remove • for each sample, • examine all possible adjustment • select best movement.

The Algorithm (Cont’d) insert move remove • Gene (p possible adjustments) • Sample (Kt-1 possible adjustments for each partition)

The Algorithm (Cont’d) • Data Standardization the original gene intensity values relative values where • Random order of genes and samples • Conduct negative action with a probability • Simulated annealing technique

Experiments • Data Sets: • Multiple-sclerosis data • MS-IFN : 4132 * 28 (14 MS vs. 14 IFN) • MS-CON : 4132 * 30 (15 MS vs. 15 Control) • Leukemia data • 7129 * 38 (27 ALL vs. 11 AML) • 7129 * 34 (20 ALL vs. 14 AML) • Colon cancer data • 2000 * 62 (22 normal vs. 40 tumor colon tissue) • Hereditary breast cancer data • 3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)

Rand Index P Q s1 s2 s1 s2 • Rand Index -A measurement of “agreement” between the ground-truth (P) and the results (Q) : • “a” : the number of pairs of objects that are in the same class in P and in the same class in Q; • “b” : the number of pairs of objects that are in the same class in P but not in the same class in Q; • “c” : the number of pairs of objects that are in the same class in Q but not in the same class in P; • “d” : the number of pairs of objects that are in different classes in P and in different class in Q. s1 s2 s1 s2 s1 s2 s1 s2 s1 s2 s1 s2

Phenotype Structure Detection

Experiments The mean value and standard deviation of the numbers of iterations and response time (in second) with respect to the matrix size.

Phenotype Structure Detection (Cont’d) Experimental Results (5) • The mutual reinforcing approach as applied to the MS-IFN group. • (A) shows the distribution of the original 28 samples. Each point represents a sample with 4132 genes mapped to two-dimensional space. • (B) shows the distribution in the middle of the adjustment. • (C) shows the distribution of the same 28 samples after the iterations. 76 genes was selected as informative space.

Mining Phenotype Structures