Bioinformatics

Bioinformatics Microarray Analysis: Clustering 27/11/2006

Preprocessing Array by array approach ANOVA based Background corr Background corr Log transformation Log transformation Filtering Filtering normalization Linearisation Ratio Test statistic (T-test) Bootstrapping

Overview • MICROARRAY ANALYSIS • Gene expression • Omics era • Transcript profiling • Experiment design • Preprocessing • Further analysis • Identification of differentially expressed genes • Clustering

Overview further analysis Raw data Preprocessing Preprocessed data Test statistic Clustering Clusters of coexpressed genes Differentially expressed genes

Differentially expressed genes 2 sample design Control sample Induced sample Statistical testing Retrieve statistically over or under expressed genes Type1: Comparison of 2 samples

Differentially expressed genes Test Statistic Comparison of 2 experiments: • Fold test • T-test • SAM • … A plethora of different method available Which one performs best? Different underlying statistical assumptions Implication on the final result Difficult to define the best method

Clustering • Measure expression of all genes • During time (dynamic profile) • In different conditions Clustering Identify coexpressed genes Motif Finding Identify mechanism of coregulation

Clustering Multiple array design • Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18) • time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999) Original dataset : 6178 genes • Preprocessing: • select 4634 most variable (25 % most variable) • variance normalized • adaptive quality based clustering (32 clusters) (95%)

Preprocessing Log2 Ratio clustering

Clustering: principle 1) gene = vector with expression values, usually log2(R/G) Gene 1 2 4 6 . . . . . . . . . . . . Gene n 6 6 4

Gene 1 Gene 2 Clustering: Principle 2) measure distances between expression vectors 3) group genes with minimal distance Data from microarrays Different metric distances Different algorithms

Rescaling • mean centering: for each gene the average gene expression value (row average) is subtracted from the expression values. The average expression value of the gene will be 0 • mean centering and dividing each gene by its variance

Rescaling • Effects of rescaling • genes with a similar expression profile but strongly different expression values will have a closer distance in the M-dimensional space and will be more easily grouped together • the noise in the dataset will be boosted up

Influence of normalisation on the clustering cyan: clustering without rescaling red: clustering with rescaling Rescaling • most algorithms require normalisation. Without normalisation the algorithms tend to cluster the noisy genes since they are more closely in distance than the few genes that alter their expression level

rescaling Noisy profiles Significant profiles normalization Noisy sequences are rescaled and might deteriorate the quality of the cluster

Clustering methods Hierarchical Clustering algorithms Non-Hierarchical Clustering algorithms K-means A.Q.B.C. SOM Agglomerative Divisive • Algorithms require • Specific preprocessing • Specific metric • Specific parameter settings • Specific properties Design algorithm that combines biological relevant characteristics

Clustering Distance • Minkowsky distance

Clustering Distance • Similarity measures • Pearson correlation coefficient • Mutual information • Variance weighted distance measures

Algorithms: hierarchical clustering • Agglomerative method (phylogenetic classification) • Calculate pairwise distances between genes (distance matrix) • Metrics • Pearson correlation • Mutual information • Euclidean distance • distance matrix is searched for the two most similar genes (clusters) • Rules • Single linkage • Average linkage • complete linkage

Algorithms: hierarchical clustering • The two selected clusters (genes) are merged to produce a new object (e.g. average of two merged objects) • Distance is recalculated (between genes, between merged objects, between genes & merged objects) • Process is repeated until all genes are clustered

Algorithms: hierarchical clustering • Properties • deterministic • userdefined parameters: • Cut off value • Metric definition • Rule • Advantages • visualisation possible: dendrogram • Length of the branches is indicative for the distance between the clusters • Disadvantages • the number of clusters user defined.

Algorithms: K-means

Algorithms: K-means Predefined number of clusters = 5 Initialisation : randomly choose cluster centers (red points)

Algorithms: K-means Attribute each point (gene) to cluster with closest center

Algorithms: K-means Recalculate cluster centers = mean expression profile of genes in cluster

Algorithms: K-means Repeat the whole process

Algorithms: K-means • Properties • Userdefined parameters • number of clusters • number of iterations • Nondeterministic: dependent on the initialisation • Advantages • Easy to understand • Fast • Disadvantages • number of cluster has to be user-specified • outcome parameter sensitive (elaborated parameter finetuning essential) • all genes in the dataset will be clustered: the presence of noisy genes will disturb the average profile and the quality of the cluster of interest

Algorithms: K-means Sensitivity of K-means towards parameter setting K-means, nr. of clusters: 10; nr. of iterations: 100 number of clusters = low big clusters containing noise

Cluster algorithms Analysis number of clusters =high smaller clusters  higher resolution K-means K-means, nr. of clusters: 60; nr. of iterations: 100

literature/knowledge dataset • small clusters • contain genes with highly similar profile (+) • some information given up in first step (-) • big clusters • contain all real positives (+) • increasing number of false positives (-) validate “core” clusters extend clusters Motif finding DNA level

Gene 1 Gene 2 Normalized Expression Data from microarrays Algorithms: Quality based clustering

Algorithms: Quality based clustering • http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html • Quality = cluster radius computed by fitting a model to the data by EM • Initialisation : • * cluster center = mean expression profile of entire dataset • * cluster radius = radius of hypersphere enclosing entire dataset • * qual = radius hypershepere / 2 • Recalculate cluster center based on qua

Algorithms: Quality based clustering • Recalculate cluster center • * Find genes with distance < qual from the center • * Recalcalute center = mean expression profile of these genes • Recalculate qual • * For every gene calculate distance to new center & plot distribution • * Randomize dataset, calculate distances to new center & plot distribution • * Compare the distributions • * Derive new qual • Iterate until stopcriterion (actual cluster radius < qual) is reached • Genes in cluster are discarded from the dataset , repeat for next cluster

Algorithms: Quality based clustering • Adaptive quality based clustering Initialise cluster center = mean expression profile Find genes with distance < qual from the center

Algorithms: Quality based clustering • Adaptive quality based clustering Recalculate center = mean of blue genes Recalculate qual : actual radius of the cluster > qual

Algorithms: Quality based clustering • Recalculate qual

Algorithms: Quality based clustering A.Q.B.C.: QP: 95%; min nr genes 15 determines the number of clusters automatically determines the number of iterations automatically determines for each cluster an optimal radius (statistically determines whether clusters should be merged or separated) 1 important user defined parameter i.e. confidence level 0.95 % default: Defines that a gene assigned to a cluster has a probability of 95 % or more to belong to the cluster

Algorithms: Quality based clustering Comparison with K-means K-means, nr. of clusters: 32; nr. of iterations: 100 finding optimal parameter setting requires a lot of parameter finetuning

Comparison with K-means K-means(nr. of clusters = 32) A.Q.B.C. NOG=200 NOG=188 MCB replication & DNA synthesis Common = 159 NOG=153 NOG=118 NOG=87 M14 organisation of the centrosome Common = 42 Common = 44 NOG=147 NOG=18 ECB budding & cell polartity Common = 10

Comparison with K-means NOG=147 K-means forces every gene into a cluster clusters contain more noise

Comparison with K-means NOG=147 NOG=18 ECB budding & cell polarity Genes with noisy profile rejected from A.Q.B.C. cluster are retained by K-means Small groups of genes with highly similar profiles

INTRODUCTION MICROARRAY ANALYSIS VALIDATION OF THE RESULTS • Statistical validation • Biological validation

literature/knowledge Cluster validation dataset • small clusters • contain genes with highly similar profile (+) • some information given up in first step (-) • big clusters • contain all real positives (+) • increasing number of false positives (-) validate “core” clusters Motif finding DNA level

Comparison with K-means K-means(nr. of clusters = 32) A.Q.B.C. NOG=200 NOG=188 MCB replication & DNA synthesis Common = 159 NOG=153 NOG=118 NOG=87 M14 organisation of the centrosome Common = 42 Common = 44 NOG=147 NOG=18 ECB budding & cell polartity Common = 10

AC0020D11428 SRS, Medline, GeneCards,. MIPS,Gene Ontology. Clustering Manual Query :huge task Accession Nrs Literature/knowledge data Text Mining and Gene Ontologies Cluster Validation: Functional Enrichment Rationale:

Cluster Validation: Functional Enrichment • METABOLISM (1066 ORFs) • amino acid metabolism (204 ORFs) • amino acid biosynthesis (118 ORFs) • biosynthesis of the cysteine-aromatic group (1 ORF) • biosynthesis of the pyruvate family (alanine, isoleucine, leucine, valine) and D-alanine (1 ORF) • regulation of amino acid metabolism (33 ORFs) • amino acid transport (23 ORFs) • amino acid degradation (catabolism) (35 ORFs) • degradation of amino acids of the glutamate group (1 ORF) • degradation of glutamate (1 ORF) • degradation of amino acids of the cysteine-aromatic group (1 ORF) • degradation of glycine (1 ORF) • other amino acid metabolism activities (5 ORFs) • nitrogen and sulfur metabolism (74 ORFs) • nitrogen and sulfur utilization (38 ORFs) • regulation of nitrogen and sulphur utilization (29 ORFs) • nitrogen and sulfur transport (7 ORFs) • nucleotide metabolism (144 ORFs) • purine ribonucleotide metabolism (45 ORFs) • pyrimidine ribonucleotide metabolism (29 ORFs) • deoxyribonucleotide metabolism (11 ORFs) • metabolism of cyclic and unusual nucleotides (8 ORFs) • regulation of nucleotide metabolism (13 ORFs) • polynucleotide degradation (23 ORFs) • nucleotide transport (14 ORFs) • http://mips.gsf.de/proj/yeast/CYGD/db/index.html MIPS functional category

Motif finding Cluster Validation: Motif Detection cDNA arrays Preprocessing of the data Clustering Upstream regions Gibbs sampling EMBL BLAST

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

BIOINFORMATICS