560 likes | 597 Views
Statistical Methods for the Screening and Classification of Microarray Gene Expression Data. Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland. http://www.maths.uq.edu.au/~gjm.
E N D
Statistical Methods for the Screeningand Classification of Microarray Gene Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland http://www.maths.uq.edu.au/~gjm
Institute for Molecular Bioscience, University of Queensland
Liat Jones Richard Bean Justin Zhu
Outline of Workshop Part 1: Introduction to Microarray Technology Part 2: Detecting Differentially Expressed Genes in Known Classes of Tissue Samples Part 3: Supervised Classification of Tissue Samples Part 4: Unsupervised Classification: Cluster Analyis of Tissue Samples and Gene Profiles Part 5: Linking Microarray Data with Survival Analysis
A microarray is a new technology which allows the measurement of the expression levels of thousands of genes simultaneously. • (1) Sequencing of the genome (human, mouse, and others) • (2) Improvement in technology to generate high-density • arrays on chips (glass slides or nylon membrane) The entire genome of an organism can be probed at a single point in time.
Draft of the Human Genome Public Sequence Nature, Feb. 2001 Celera Sequence Science, Feb. 2001
The Challenge for Statistical Analysis of Microarray Data Microarrays present new problems for statistics because the data are very high dimensional with very little replication. The challenge is to extract useful information and discover knowledge from the data, such as gene functions, gene interactions, regulatory pathways, metabolic pathways etc.
Vital Statistics byC. Tilstone Nature 424, 610-612, 2003. “DNA microarrays have given geneticists and molecular biologists access to more data than ever before. But do these researchers have the statistical know-how to cope?” Branching out: cluster analysis can group samples that show similar patterns of gene expression.
Representation of Data from M Microarray Experiments Sample 1 Sample 2 Sample M Gene 1 Gene 2 Gene N Assume we have extracted gene expressions values from intensities. Expression Signature Expression Profile
It is assumed that the (logged) expression levels have been preprocessed with adjustment for array effects.
Majority of time on a data analysis project will be spent “cleaning” the data before doing any analysis • Paradoxically, most statistical training assumes that the data arrive “prelceaned.” Students, whether in PhD programs or an undergraduate introductory course, are not taught routinely to check data for accuracy or even to worry about it. Exacerbating the problem further are claims by software vendors that their techniques can produce valid results no matter what the quality of the incoming data. De Veaux and Hand (How to Lie with Bad Data, Statist. Sci., 2005)
“Large-scale gene expression studies are not a passing fashion, but are instead one aspect of new work of biological experimentation, one involving large-scale, high throughput assays.” Speed et al., 2002, Statistical Analysis of Gene Expression Microarray Data, Chapman and Hall/ CRC
Growth of microarray and microarray methodology literature listed in PubMed from 1995 to 2003. The category ‘all microarray papers’ includes those found by searching PubMed for microarray* OR ‘gene expression profiling’. The category ‘statistical microarray papers’ includes those found by searching PubMed for ‘statistical method*’ OR ‘statistical techniq*’ OR ‘statistical approach*’ AND microarray* OR ‘gene expression profiling’.
Mehta et al (Nature Genetics, Sept. 2004): “The field of expression data analysis is particularly active with novel analysis strategies and tools being published weekly”, and the value of many of these methods is questionable. Some results produced by using these methods are so anomalous that a breed of ‘forensic’ statisticians (Ambroise and McLachlan, 2002; Baggerly et al., 2003) who doggedly detect and correct other HDB (high-dimensional biology) investigators’ prominent mistakes, has been created.
Analyzing Microarray Gene Expression Data Analysis of Microarray Gene Expression Data
Analyzing Microarray Gene Expression Data Analysis of Microarray Gene Expression Data The Analysis of Gene Expression Data
Analyzing Microarray Gene Expression Data Analysis of Microarray Gene Expression Data The Analysis of Gene Expression Data The Statistical Analysis of Gene Expression Data
Analyzing Microarray Gene Expression Data(UQ, Wiley) Analysis of Microarray Gene Expression Data(Harvard, Kluwer) The Analysis of Gene Expression Data(Johns Hopkins, Springer) The Statistical Analysis of Gene Expression Data(Berkeley, C&H)
Analyzing Microarray Gene Expression Data Analysis of Microarray Gene Expression Data The Analysis of Gene Expression Data The Statistical Analysis of Gene Expression Data Statistics for Microarrays
Analyzing Microarray Gene Expression Data Analysis of Microarray Gene Expression Data The Analysis of Gene Expression Data The Statistical Analysis of Gene Expression Data Statistics for Microarrays Design and Analysis of DNA Microarrays
Analyzing Microarray Gene Expression Data Analysis of Microarray Gene Expression Data The Analysis of Gene Expression Data The Statistical Analysis of Gene Expression Data Statistics for Microarrays Design and Analysis of DNA Microarrays Exploration and Analysis of Microarrays
Analyzing Microarray Gene Expression Data Analysis of Microarray Gene Expression Data The Analysis of Gene Expression Data The Statistical Analysis of Gene Expression Data Statistics for Microarrays Design and Analysis of DNA Microarrays Exploration and Analysis of Microarrays Data Analysis Tools for DNA Microarrays
In the sequel, references to most of the material presented can be found in my joint book, McLachlan, Do, and Ambroise (2004), Analyzing Microarray Gene Expression Data, Hoboken, NJ: Wiley.
Contents • Microarrays in Gene Expression Studies • Cleaning and Normalization • Some Cluster Analysis Methods • Clustering of Tissue Samples • Screening and Clustering of Genes • Discriminant Analysis • Supervised Classification of Tissue Samples • Linking Microarray Data with Survival Analysis
mRNA Levels Indirectly Measure Gene Activity • Essentially every cell contains the same genes. • Type and amount of mRNA produced by a cell tells which genes are • being expressed • Cells differ in the genes which are active at any one time. • Gene Expression is transcription of • DNA to mRNA • mRNA is translated to proteins
Technical Background Two recent advances: • Human Genome Project (also other sequenced genomes: mouse, dog etc) • DNA microarray technology -- works by exploiting the ability of a given mRNA molecule to bind specifically to (hybridize) the DNA template from which it originated
What is a DNA microarray? • Small, solid supports onto which the sequences from thousands (tens of thousands) of genes are attached at fixed locations. • They may be glass slides, or silicon chips or nylon membranes. • The DNA is printed, spotted or synthesized directly onto the support • The spots can be DNA, cDNA or oligonucleotides.
The microarray experiment Spot DNA (known) Sample (unknown)
Microarrays Indirectly Measure Levels of mRNA • mRNA is extracted from the cell • mRNA is reverse transcribed to cDNA (mRNA itself is unstable) • cDNA is labeled with fluorescent dye TARGET • The sample is hybridized to known DNA sequences on the array • (tens of thousands of genes) PROBE • If present, complementary target binds to probe DNA • (complementary base pairing) • Target bound to probe DNA fluoresces
The microarray experiment • mRNA from the cell (sample) is washed over the surface – HYBRIDIZATION • measure the amount of bound mRNA at each spot Allows the measurement of expression for thousands of genes from the amount of bound mRNA.
A Spotted cDNA Microarray Experiment • Compare the gene expression levels for • two cell populations on a single microarray. • e.g. tumour and normal cells
Microarray Image Red: High expression in target labelled with cyanine 5 dye Green : High expression in target labelled with cyanine 3 dye Yellow : Similar expression in both target samples
Assumptions: Gene Expression (1) cellular mRNA levels directly reflect gene expression mRNA intensity of bound target is a measure of the abundance of the mRNA in the sample. (2) Fluorescence Intensity
Experimental Error Sample contamination Poor quality/insufficient mRNA Reverse transcription bias Fluorescent labeling bias Hybridization bias Cross-linking of DNA (double strands) Poor probe design (cross-hybridization) Defective chips (scratches, degradation) Background from non-specific hybridization
Why are microarrays important? • They contain a very large number of genes and are very small. • Compare gene expression within a single sample or in two different cell types or tissue samples • Examine expressions in a single sample on a genome-wide scale (GENOMICS) • Infer new gene functions, diagnostic tools – e.g. in cancer provides a molecular view.
The Microarray Technologies Spotted Microarray Affymetrix GeneChip cDNAs, clones, or short and long oligonucleotides deposited onto glass slides Each gene (or EST) represented by its purified PCR product Simultaneous analysis of two samples (treated vs untreated cells) provides internal control. short oligonucleotides synthesized in situ onto glass wafers Each gene represented multiply - using 16-20 (preferably non-overlapping) 25-mers. Each oligonucleotide has single-base mismatch partner for internal control of hybridization specifity. relative gene expressions absolute gene expressions Each with its own advantages and disadvantages
Pros and Cons of the Technologies Spotted Microarray Affymetrix GeneChip Flexible and cheaper Allows study of genes not yet sequenced (spotted ESTs can be used to discover new genes and their functions) Variability in spot quality from slide to slide Provide information only on relative gene expressions between cells or tissue samples More expensive yet less flexible Good for whole genome expression analysis where genome of that organism has been sequenced High quality with little variability between slides Gives a measure of absolute expression of genes
Aims of a Microarray Experiment • observe changes in a gene in response to external stimuli • (cell samples exposed to hormones, drugs, toxins) • compare gene expressions between different tissue types • (tumour vs normal cell samples) • To gain understanding of • function of unknown genes • disease process at the molecular level • Ultimately to use as tools in Clinical Medicine for diagnosis, • prognosis and therapeutic management.
Importance of Experimental Design • Good DNA microarray experiments should have clear objectives. • Not performed as “aimless data mining in search of unanticipated patterns that will provide answers to unasked questions” • (Richard Simon, BioTechniques 34:S16-S21, 2003)
Replicates Technical replicates: arrays that have been hybridized to the same biological source (using the same treatment, protocols, etc.) Biological replicates: arrays that have been hybridized to different biological sources, but with the same preparation, treatments, etc.
Extracting Data from the Microarray • Cleaning • Image processing • Filtering • Missing value estimation • Normalization • Remove sources of systematic variation. Sample 1 Sample 2 Sample 3 Sample 4 etc…
Examples of spot imperfections. A. donut shape; B. oval or pear shape; C. holey heterogeneous interior; D. high-intensity artifact; E. sickle shape; F. scratches.
Gene Expressions from Measured Intensities Spotted Microarray: log 2(Intensity Cy5 / Intensity Cy3) Affymetrix: (Perfect Match Intensity – Mismatch Intensity)
Data Transformation Rocke and Durbin (2001), Munson (2001), Durbin et al. (2002), and Huber et al. (2002)