440 likes | 541 Views
Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium. Guy Harari. FABIA: factor analysis for bicluster acquisition. Sepp Hochreiter et al., University of Linz, Austria. FABIA - Motivation. Plaid models: for bicluster i :
E N D
Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari
FABIA: factor analysis for bicluster acquisition SeppHochreiter et al., University of Linz, Austria
FABIA - Motivation • Plaid models: for bicluster i: • They use least squares fit for model selection • Thus assume Gaussian effects • However, microarray datasets are not Gaussian (heavy tails)
FABIA – model • Biclusters have multiplicative coherent values • λ – prototype • z - factors • In the example above:
FABIA – model • For p biclusters and additive Gaussian noise: • The j-th sample (column in X) is: • where is the j-the column of Z. • Λ and Z are sparse.
Generative Model for Factor Analysis • Data was produced by: • Picking values independently from some Gaussian hidden factors. • Linearly combining the factors using a factor loading matrix. • Add Gaussian noise for each input
Generative Model for Factor Analysis • Assume factors and noise areindependent. • Assume also . • Select #factors by e.g. Kaiser criterion – • Extract factors using e.g.maximum likelihood.
FABIA – model • Fix the value for j. • Factors are the ‘s, . • Biclusters shouldn’t be correlated. • are the loading matrix’s entries. • is diagonal – independentGaussian noise.
Sparseness • We want sparse solutions for and • So use Laplace distribution for : • For use one of: • FABIA: • FABIAS: parameter
Model Selection • Center the data to zero median. • Normalization – divide values by row’s std. • Use EM where the parameters are and . • Rank biclusters according to mutual information: • Determine members of each bicluster using two thresholds for values and .
Experiments – Simulated Datasets • n=1000genes, l=100 samples • p=10 multiplicative biclusters • Generate : • Choose - the number of genes in bicluster i - uniformly at random from {10,…,210}. • Choose genes from {1,…,1000}. • Set components not in bicluster i to . • Set components in bicluster i to .
Experiments – Simulated Datasets • Generate : • Choose - the number of samples in bicluster i - uniformly at random from {5,…,25}. • Choose samples from {1,…,100}. • Set components not in bicluster i to . • Set components in bicluster i to . • Add random noise to all entries according to . • Compute the dataset with
Evaluation – consensus score • For two sets of biclusters: • Compute similarity between each pair of biclusters, one from each set. • Find maximum assignment using the Munkres (Hungarian) algorithm. • Penalize different numbers of biclusters - Divide the sum of similarities of the assigned biclusters by the number of biclusters of the largest set. • Use Jaccard index for computing similarity.
Simulated Datasets - Results • Average score and STD for each method:
Simulated Datasets - Results • Avg. and STD of information content and similarity:
Simulated additive datasets • Generate biclusters in the same way. • Use additive model for each bicluster: • Choose from and from . • Choose from one of three models: • Low signal – • Moderate signal – • High signal –
Additive Datasets - results • Low signal:
Additive Datasets - results • Moderate signal:
Additive Datasets - results • High signal:
Gene Expression Datasets • Breast cancer (Van’t Veer et al., 2002) – 3 classes (clusters) were found in Hoshida et al., 2007. • Multiple tissue types dataset (Su et al., 2002) • Diffuse large-B-cell lymphoma dataset (DLBCL) (Rosenwald et al., 2002) – 3 classes (clusters) were found in Hoshida et al. (2007).
Biological Interpretation • Breast cancer: • Bicluster 1 is related to cell cycle (GO and KEGG, ) and to the proteins CDC2 (division control) and KIF (mitosis). • Bicluster 2 is related to immune response (GO, ) and cytokine-cytokine receptor interaction (KEGG ), and to cytokine-related proteins as CCR5, CCL4 and CSF2RB. • Multiple tissue – no biological interpretation.
Biological Interpretation • DLBCL: • Bicluster 1 is related to the ribosome (GO , KEGG ) and to B-cell receptor signaling (KEGG ). • Bicluster 2 is related to the immune system (GO , KEGG ).
Drag Design • Goal: find compounds with similar effects on gene expression. • Use Affymetrix GeneChip HT HG-U133+ PM array plates with 12*8 samples per plate. • Selected compounds are active on a cancer cell line. • Each compound was testes in a group of three replicates.
Drag Design • 3 biclusters were found to have 2-5 replicate sets. • One of them extracted genes related to mitosis (GO ). • The compounds of this bicluster are now under investigation by Johnson & Johnson Pharmaceutical R&D.
Biclustering Gene Expression Time Series Sara C Madeira, Technical University of Lisbon
Introduction • Input: columns correspond to samples taken in consecutive instants of time. • Output: biclusters with contiguous columns. • Motivation: biological processes start and end in a contiguous time leading to increased/decreased activity of some genes. • Goal: find all maximal contiguous column coherent (CCC) biclusters sorted by a statistical score.
Discretization • Let be the input expression matrix. • Define • Standardize A’ to mean=0 and STD=1 by gene.
Discretization • Define • Where D symbolizes Down-regulation, U for Up-regulation and N for No-change. • And t=1 is the standard deviation of a gene.
CCC-Bicluster • Definition: A CCC-Bicluster is a subset of rows and contiguous subset of columns such that for all rows and columns . • Note that each CCC-Bicluster defines a string S which is common to every row in I.
Suffix Trees Each node, other than the root, has at least two children. Each edges is labeled with nonempty substring of S (here “BANANA”) No two edges out of a node have edge labels starting with the same symbol. The label from the root to a leaf is a suffix of S.
Example Internal node = row-maximal, right-maximal CCC-Bicluster
Main Result • Every (inclusion) maximal CCC-Bicluster with at least two rows corresponds to an internal node in the suffix tree such that: • It does not have incoming suffix links, or, • It has incoming suffix links only from nodes having less leaves in their subtress. • Each such an internal node defines a maximal CCC-Bicluster with at least two rows. • This implies an O(nm) time algorithm for finding all CCC-Biclusters.
Experiments – Simulated Datasets • Generate a random 1000 x 50 dataset. • Apply the algorithm on it. • Plant 10 CCC-Biclusters on the same dataset. • Apply again the algorithm on the dataset. • Define a similarity measure to be Jaccard index (genes and conditions) and a statistical test. • Filter out similar biclusters and those didn’t pass the statistical test.
The Statistical Test • Null hypothesis – expression values of a subset of genes evolve independently. • Expression patterns are modeled by a first-order Markov Chain, e.g. for the pattern : where
The Statistical Test • n – the number of genes in the dataset. • I – the subset of genes in a CCC-Bicluster. • The significance of a CCC-Bicluster B with anexpression pattern is:
Simulated Datasets - results • 165 CCC-Biclusters passed the test at the 1 percent level, after Bonferroni correction.
Experiments – Real Datasets • Use yeast heat shock response dataset from Gasch et al. • 25 CCC-Biclusters were found to be highly significant at the 1% after Bonferroni corr. • 9 of them removed after similarity check. • Test results for GO enrichment (hypergeo.)
Improvements • Allow errors: replacement of D/U with N and vice versa. • Discover biclusters with opposite patterns (anti-correlated). • Allow scaled and time-lagged (shifted) patterns. • TriClustering – genes x time points x exemplars (different patients/stress conditions).
Other talks • “biclust” R package – Ludwig Maximilian University of Munich (Inst. of statistics) and Hasselt University. • ISA and related tools (R packages) – Gabor Csardi, University of Lausanne, Switzerland. • Clustering of dose-response microarray data – Hasselt University, Johnson & Johnson PR&D. • Model- and graph-based clustering of genomic data – Freiburg inst. For advanced studies, Ger.