620 likes | 752 Views
Microarray data analysis Introduction. Department of Bioinformatics, Centro de Investigación Príncipe Felipe, and Functional genomics node, INB, Spain. http://www.gepas.org. http://www.babelomics.org http://bioinfo. cipf .es. National Institute of Bioinformatics, Functional Genomics node.
E N D
Microarray data analysisIntroduction Department of Bioinformatics, Centro de Investigación Príncipe Felipe, and Functional genomics node, INB, Spain. http://www.gepas.org. http://www.babelomics.org http://bioinfo.cipf.es National Institute of Bioinformatics, Functional Genomics node
Structure of the course Introduction From images to numbers Normalization Clustering Gene selection Predictors Functional annotation GEPAS DNMAD Expresso Clustering Gene selection Tnasas Babelomics
Background Progress in science depends on new techniques, new discoveries and new ideas, probably in that order.Sydney Brenner, 1980 The introduction and popularisation of high-throughput techniques has drastically changed the way in which biological problems can be addressed and hypotheses can be tested. But not necessarily the way in which we really address or test them…
The pre-genomics paradigm Genes in the DNA... …code for proteins... …produces the final phenotype From genotype to phenotype. >protein kunase acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc.... …plus the environment... …whose structure accounts for function...
Genes in the DNA... …which can be different because of the variability. 10 million SNPs Now: 22240(NCBI build 3512/04) 50% display alternative splicing 25%-60% unknown …whose final effect configures the phenotype... >protein kunase acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc.... …when expressed in the proper moment and place... A typical tissue is expressing among 5000 and 10000 genes From genotype to phenotype. (in the functional post-genomics scenario) …code for proteins... …conforming complex interaction networks... That undergo post-translational modifications, somatic recombination... 100K-500K proteins …in cooperation with other proteins… …whose structures account for function... Each protein has an average of 8 interactions
Phylogenetic tree Molecular databases Search results Information Motif databases alignment Motif Conserved region Secondary and tertiary protein structure Bioinformatics tools for pre-genomicsequence data analysis Sequence The aim: Extracting as much information as possible for one single data
SNPs Expression Arrays Post-genomic vision Who? Genome sequencing Literature, databases 2-hybrid systems Mass spectrometry for protein complexes What do we know? And who else? In what way? Where, when and how much?
Question Experiment test Experiment (sometimes) test Question Is there any gene (or set of genes) involved in any process? Genome wide data and a note of caution: Genome-wide technologies allows us to produce vast amounts of data. But... dealing with many data (omic data) increase the occurrence of spurious associations due to chance Is gene A involved in process B? Sure, but... Is it real? (many hypotheses are rejected while this one is accepted a posteriori: numerology) The test is dependent on the hypothesis and not vice versa
polimorphisms Gene expression Post-genomic vision: whole system picture genes Information The new tools: Clustering Feature selection Information mining Information Databases interactions
Gene expression profiling.Some uses and related problems • Differences at phenotype level are the visible cause of differences at molecular level which, in many cases, can be detected by measuring the levels of gene expression. The same holds for different experiments, treatments, strains, etc. • Classification of phenotypes / experiments. Can I distinguish among classes (either known or unknown), values of variables, etc. using molecular gene expression data? (sensitivity) • Selection of differentially expressed genes among the phenotypes / experiments. Did I select the relevant genes, all the relevant genes and nothing but the relevant genes?(specificity) • Biological roles the genes are carrying out in the cell. What general biological roles are really represented in the set of relevant genes? (mechanism of action)
Studies must be hypothesis driven.What is our aim? Class discovery? sample classification? gene selection? ... Can we find groups of experiments with similar gene expression profiles? Unsupervised Different phenotypes... Supervised Reverse engineering Molecular classification of samples What genes are responsible for? Co-expressing genes... What do they have in common? B C A How is the network? D Genes interacting in a network (A,B,C..)... E
DNA microarrays: a paradigm of a post-genomic technique Cy5 Cy3 cDNA arrays Oligonucleotide arrays
Transforming images into numbers Two-color Test sample labeled red (Cy5) Reference sample labeled green (Cy3) Red : gene overexpressed in test sample Green : gene underexpressed in test sample Yellow - equally expressed red/green - ratio of expression One color Intersity of a gene using the probes (PM) (and in some cases the MM) Scanners generate a graphic file. Software analyzes the file: GenePix Pro (by Axon Instruments, Inc.) or Imagene (By Biodiscovery, Inc.) There are free systems too: TIGR Spotfinder, ScanAlyze, etc PM/MM
Normalisation A There are many sources of error that can affect and seriously bias the interpretation of the results. Differences in the efficience of labeling, the hibridisation, local effects, etc. Normalisation is a necessary step before proceeding with the analysis B C Before (left) and after (right) normalization. A) BoxPlots, B) BoxPlots of subarrays and C) MA plots (ratio versus intensity) (a) After normalization by average (b) after print-tip lowess normalization (c) after normalization taking into account spatial effects
Box-Plots Provide information on the median, the upper and lower quartiles, the range, and individual extreme values. The central box in the plot represents the inter-quartile range (IQR), which is defined as the difference between the 75th percentile and 25th percentile, i.e., the upper and lower quartiles. The line in the middle of the box represents the median; a measure of central location of the data. Extreme values, greater than 1.5 IQR above the 75th percentile and less than 1.5 IQR below the 25th percentile, are plotted individually
MA-Plots The MA-plots show the relationship between A (the "average signal" [0.5 * (log R + log G)], where R is the background subtracted red [mean of F635 - median of B635] and G the background subtracted green [mean of F532 - median of B532]) and M (the log [base 2] differential ratio: log(R/G)). These plots are shown both before and after normalization, and with different color lines for the lowess lines of each print-tip
Affy plots: Box plots Processed intensities (background corrected, normalized, pm-mm adjusted and summarized in a single number within each probe set).
The data ... A B C Different classes of experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc. • Characteristics of the data: • Number of variables (genes) is orders of magnitude larger than the number of experiments • Low signal to noise ratio • High redundancy and intra-gene correlations • Most of the genes are not informative with respect to the trait we are studying (account forunrelated physiological conditions, etc.) • Many genes have no annotation!! Expression profile of all the genes for a experimental condition (array) Genes (thousands) Expression profile of a gene across the experimental conditions Experimental conditions (from tens up to no more than a few houndreds)
Unsupervised problem: class discovery Our interest is in discovering clusters of items which we do not know beforehand Can we find groups of experiments with similar gene expression profiles? • What genes co-express? • How many different expression patterns do we have? • What do they have in common? • Etc. Co-expressing genes...
Non hierarchical hierarchical K-means, PCA UPGMA quick and robust SOM SOTA Different levels of information Unsupervised clustering methods:Method + distance: produce groups of items based on its global similarity
An unsupervised problem: clustering of genes. • Gene clusters are unknown beforehand • Distance function • Cluster gene expression patterns based uniquely on their similarities. • Results are subjected to further interpretation (if possible)
Clustering of experiments:The rationale If enough genes have their expression levels altered in the different experiments, we might be able of finding these classes by comparing gene expression profiles. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers Overview of the combined in vitro and breast tissue specimencluster diagram. A scaled-down representation of the 1,247-gene clusterdiagram The black bars show the positions of theclusters discussed in the text: (A) proliferation-associated, (B) IFNregulated,(C) B lymphocytes, and (D) stromal cells. Perou et al., PNAS 96 (1999)
Clustering of experiments:The problems Any gene (regardeless its relevance for the classification) has the same weight in the comparison. If relevant genes are not in overwhelming majority it produces: Noise and/or irrelevant trends
Supervised problem:We have information on the classes (e.g. wt vs. mutant) and we want to select genes differntially expressed or trying to predict class membership on the bases of gene expression Different phenotypes... Molecular classification of samples What genes are responsible for?
Supervised problems: Class prediction and gene selection, based on gene expression profilesInformation on classes (defined on citeria external to the gene expression measurements) is used. A B C Problems: How can classes A, B, C... be distiguished based on the corresponding profiles of gene expression? How a continuous phenotypic trait (resistence to drugs, survival, etc.) can be predicted? And Which genes among the thousands analysed are relevant for the classification? Class prediction Genes (thousands) Gene selection Experimental conditions (from tens up to no more than a few houndreds)
Gene selection.The simplest way: univariant gene-by-gene. Other multivariant approaches can be used • Two classes • T-test • Bayes • Data-adaptive • Clear • Multiclass • Anova • Clear • Continuous variable (e.g. level of metabolite) • Pearson • Spearmam • Regression • Survival • Cox model The T-rex tool
A simple problem: gene selection for class discrimination ~15,000 genes Case(10)/control(10) Genes differentially expressed among classes (t-test ), with p-value < 0.05
Sorry... the data was a collection of random numbers labelled for two classes This is a multiple-testing statistic contrast. You were not interested a priori in the first, best discriminant, gene. Adjusted p-values must be used!
NE EEC Genes differentially expressed between normal endometrium (ne) and endometrioid endometrial carcinomas (eec) NE EEC G Symbol A Number Hierarchical Clusteringof 86 genes with different expression patterns between Normal Endometrium andEndometrioid Endometrial Carcinoma (FDR adjusted p<0.05) selected among the ~7000 genes in the CNIO oncochip Moreno et al., BREAST AND GYNAECOLOGICAL CANCER LABORATORY, Molecular Pathology Programme, CNIO
What is a predictor? Of predictors and molecular signatures A B Diff (A, New) = 2 Diff (B, New) = 13 New, A or B? Most probably new belongs to A Algorithms: DLDA, KNN, SVM, random forests, PAM, etc.
Cross-validation The efficience of a classifier can be estimated through a process of cross-validation. Typical are three-fold, ten-fold and leave-one-out (LOO), in case of few samples for the training
Selection bias A B Feature selection CV A B Training Produce artificially small errors. Evaluation A B
A B Unbiased CV A B A B Feature selection Training Evaluation
Predictor of clinical outcome in breast cancer Genes are arranged to their correlation eith the pronostic groups Pronostic classifier with optimal accuracy van’t Veer et al., Nature, 2002
What are these groups? Cell cycle... DBs Information Datamining Functional annotation How are structured? What is this gen? My data... ? Clustering Links
Functional annotation.Use of biological information as a validation criteria Information mining of DNA array data. Allows quick assignation of function, biological role and other properties to groups of genes. Used to understand the molecular functional basis that account for the differences between two (or more) conditions Sources of information: pros and conts. Free text: Pubmed abstracts. Many gene-abstract correspondences Context, synonymous, useless terms, etc... Curated terms Less gene-term correspondences Accurate, unambiguous and controlled
A B B A Metabolism Transport ... Reproduction Functional annotation.The two-steps approach Example: We might be interested in understanding,e.g., which genes differ between strains, conditions, etc. Typically: We examine each gene selecting onlythose that show significant differences usingan appropriate statistical model, andcorrecting for multiple testing. Or Finding clusters of co-expressed genes test Then: We can extract biological terms associated to genes and test if these are differentially distributed and... test them for real differences test Metabolism Transport ... Reproduction
Testing two GO terms (remember, we have to test thousands) Biosynthesis Other Group A Group B Are this two groups of genes carrying out different biological roles? 6 4 A B 2 9 The popular Fisher’s test Biosynthesis 60% Biosynthesis 20% Sporulation 20% Sporulation 20% Genes in group A have significantly to do with biosynthesis, but not with sporulation.
GO terms found in sets of 50 genes Each row corresponds to a random selection of 50 genes from the E. coli genome, compared to the rest of the genome (as most programs do). GO terms in blue (p-value < 0.05 in individual test) have asymmetrical distributions by chance (see adjusted p-values).
How to test significant differences in the distribution of biological tems between groups of genes?FatiGO: GO-driven data analysisConstitutes a statistical framework able to deal with multiple-testing questions GO: source of information. A reduced number of curated terms The Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics25: 25-29
FatiGO Results The application extracts biological relevant terms (showing a significant differential distribution) for a set of genes Number Genes with GO Term at level and ontology selected for each Cluster Unadjusted p-value Step-down min p adjusted p-value FDR (indep.) adjusted p-value FDR (arbitrary depend.) adjusted p-value Tables GO Term – Genes Genes of old versions (Unigene) Genes without result Repeated Genes GO Tree with diferent levels of information
C PTL LB Understanding why genes differ in their expression between two different conditions Limphomas from mature lymphocytes (LB) and precursor T-lymphocyte (PTL). Genes differentially expressed, selected among the ~7000 genes in the CNIO oncochip Genes differentially expressed among both groups were mainly related to immune response (activated in mature lymphocytes) Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO
Biological processes shown by the genes differentially expressed among PTL-LB • Obvious? NO • You now know that there are no other co-variables (e.g. age, sex, etc) • If you had not a strong biological hypothesis now you have an explanation Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO
A B - Lower threshold So far so good......and, what if the first step fails? • Situations in which no differentially expressed genes between the studied conditions are found are quite common. • Causes: • Noise • Few samples • Internal heterogeneity, etc… • e.g.: 17 NTG vs. 8 IGT 18 DM2 • Mootha et al., Nat Genet. 2003 Jul;34(3):267-73 statistic Upper threshold +
A B 1 2 - Threshold-free approachIncluding information in the procedure of gene selection Our hypothesis is different: now is about sets of genes Two classes. Genes arranged by differential expression between them Genes in set 1: unrelated with molecular processes that account for the classes. Genes in set 2: related (mainly active in class A but not in B) statistic +
A E FatiScan - test Response to external stimulus C B D term over-represented statistic + statistic - term under-represented + Al-Shahrour et al. 2005. Bioinformatics
FatiScan Looks for: GO KEGG pathways Interpro motifs Swissprot keywords CisRed motifs TFBSs