880 likes | 1.12k Views
Some Statistical Issues in Microarray Data Analysis. Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat d’Estadística i BioinformàticaIR-HUVH. Outline. Introduction Experimental design Selecting differentially expressed genes
E N D
Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat d’Estadística i BioinformàticaIR-HUVH
Outline • Introduction • Experimental design • Selecting differentially expressed genes • Statistical tests • Significance testing • Linear models and Analysis of the variance • Multiple testing • Software for microarray data analysis
Why are we talking of statistics? • A microarray experiment is, as called, an experiment, that is: • It has been performed to determine if some previous hypothesis are true or false (although it can also lead to new hypotheses) • It is subject to errors which may arise from many sources
Sources of variability • Biological Heterogeneity in Population • Specimen Collection/ Handling Effects • Tumor: surgical bx, FNA • Cell Line: culture condition, confluence level • Biological Heterogeneity in Specimen • RNA extraction • RNA amplification • Fluor labeling • Hybridization • Scanning • – PMT voltage • – laser power (Geschwind, Nature Reviews Neuroscience, 2001)
Systematic variability Amount of RNA in the biopsy Efficiencies of lab procedures such as: RNA extraction, reverse transcription, Labeling or photodetection Random variation PCR yield DNA quality spotting efficiency, spot size cross-/unspecific hybridization stray signal Categories of variability
Dealing with systematic variability • Systematic variability has similar effects on many measurements • Corrections can be estimated from data • CALIBRATION or NORMALIZATION is the general name for processes that correct for systematic variability
Dealing with random variation • Random variation cannot be explicitly accounted for • Usual way to deal with it is to assume some ERROR MODELS (e.g. ei~N(0, s2)) • Assuming these error models are true… • EXPERIMENTAL DESIGN is (must be) used to control the action of random variation • STATISTICAL INFERENCE is (must be) used to extract conclusions in the presence of random variation
Biological question Experimental design Failed Microarray experiment Quality Measurement Image analysis Today Normalization Pass Analysis Clustering Discrimination Estimation Testing Biological verification and interpretation
Why experimental design? • The objective of experimental design is to make the analysis of the data and the interpretation of the results • As simple and as powerful as possible • Given the purpose of the experiment • And the constraints of the experimental material
Scientific aims and design choice • The primary focus of the experiments needs to be clearly stated, whether it is: • to identify differentially expressed genes • to search for specific gene-expression patterns • to identify phenotypic subclasses • Aim of the experiment guides design choice • Sometimes only one choice is reasonable • Sometimes different options available
Designing microarray experiments • The appropriate design of a microarray experiment must consider • Design of the array • Allocation of mRNA samples to the slides
I: Layout of the array • Which sequences to use • cDNA’s Selection of cDNA from library • Riken, NIA, etc • Affymetrix PM’s and MM’s • Oligo probes selection (from Operon, Agilent, etc) • Control probes • What %?. Where should controls be put • How many sequences to use • Should there be replicate spots within a slide?
II: Allocating samples in slides • Types of Samples • Replication: technical vs biological • Pooled vs individual samples • Different design layout / data analysis: • Scientific aim of the experiment • Efficiency, Robustness, Extensibility • Physical limitations (cost) : • Number of slides • Amount of material
Basic principles of experimental design • Apply the following principles to best attain the objectives of experimental design • Replication • Local control or Blocking • Randomization
1. Replication • It’s important • To reduce uncertainty (increase precision) • To obtain sufficient power for the tests • As a formal basis for inferential procedures • Consider different types of replicates • Technical • Duplicate spots • Multiple hybridizations from the same sample • Biological • Repeat most what is expected to vary most!
Biological vs Technical Replicates @ Nature reviews & G. Churchill (2002)
Replication vs Pooling • mRNA from different samples are often combined to form a ``pooled-sample’’ or pool. Why? • If each sample doesn’t yield enough mRNA • To compensate an excess of variability ? • Statisticians tend not to like it but pooling may be OK if properly done • Combine several samples in each pool • Use several pools from different samples • Do not use pools when individual information is important (e.g.paired designs)
2. Blocking • Assume we wish to perform an experiment to compare two treatments. • The samples or their processing may not be homogeneous: There are blocks • Subjects: Male/Female • Arrays produced in two lots (February, March) • If there are systematic differences between blocks the effects of interest (e.g. tretament) may be confounded • Observed differences are attributable to treatment effect or to confounding factors?
Confounding block with treatment effects • Two alternative designs to investigate treatment effects • Left: Treatment effects confounded with Sex and Batch effect • Right: Treatments are balanced between blocks • Influence of blocks is automatically compensated • Statistical analysis may separate block from treatment efefect
3. Randomisation • Randomly assigning samples to groups to eliminate unspecific disturbances • Randomly assign individuals to treatments. • Randomise order in which experiments are performed. • Randomisation required to ensure validity of statistical procedures. • Block what you can and randomize what you cannot
Experimental layout • How are mRNA samples assigned to arrays • The experimental layout has to be chosen so that the resulting analysis can be done as efficient and robust as possible • Sometimes there is only one reasonable choice • Sometimes several choices are available
Case 1: Meaningful biological control (C) Samples: Liver tissue from 4 mice treated by cholesterol modifying drugs. Question 1: Genes that respond differently between the T and the C. Question 2: Genes that responded similarly across two or more treatments relative to control. Case 2: Use of universal reference. Samples: Different tumor samples. Question: To discover tumor subtypes. T1 T2 T3 T4 T1 T2 Tn-1 Tn C Ref Example I: Only one design choice
T2 T4 T1 T3 T1 T2 T3 T4 Ref T1 T2 T3 T4 T2 T4 T3 T1 Example 2: a number of different designs are suitable for use (1) • Time course experiments • Design choice depends on the comparisons of interest
Indirect Direct A R A B B average (log (A/B)) log (A / R) – log (B / R ) 2 /2 22 How can we decide? • A-optimality: choosee design which minimizes variance of estimates of effects of interest • A simple example: Direct vs indirect estimates
Summary • Selection of mRNA samples is important • Most important: biological replicates • Technical replicates also useful, but different • If needed and possible use pooling wisely • Choice of experimental layout guided by • The scientific question • Experimental design principles • Efficiency and robustness considerations • Correspondence between experimental Designs-Linear Models-ANOVA can be exploited to select model and analyze data
Experimental design, Linear Models and Analysis of the Variance • In experimental design the different sources of variability influencing the observed response may be identified. • These sources can be related with the response using a linear model • Analysis of the variance can be used to separately estimate and test the relative importance of each source of variability.
Statistical methods to detect differentially expressed genes
Class comparison: Identifying differentially expressed genes • Identify genes differentially expressed between different conditions such as • Treatment, cell type,... (qualitative covariates) • Dose, time, ... (quantitative covariate) • Survival, infection time,... ! • Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(X)-log(Y) [=log(X/Y)]
What is a “significant change”? • Depends on the variability within groups, which may be different from gene to gene. • To assess the statistical significance of differences, conduct a statistical test for each gene.
Different settings for statistical tests • Indirect comparisons: 2 groups, 2 samples, unpaired • E.g. 10 individuals: 5 suffer diabetes, 5 healthy • One sample fro each individual • Typically: Two sample t-test or similar • Direct comparisons: Two groups, two samples, paired • E.g. 6 individuals with brain stroke. • Two samples from each: one from healthy (region 1) and one from affected (region 2). • Typically: One sample t-test (also called paired t-test) or similar based on the individual differences between conditions.
Different ways to do the experiment • An experiment use cDNA arrays (“two-colour”) or affy (“one-colour). • Depending on the technology used allocation of conditions to slides changes.
“Natural” measures of discrepancy For Direct comparisons in two colour or paired-one colour. For Indirect comparisons in two colour or Direct comparisons in one colour.
Some Issues • Can we trust average effect sizes (average difference of means) alone? • Can we trust the t statistic alone? • Here is evidence that the answer is no. Courtesy of Y.H. Yang
Some Issues • Can we trust average effect sizes (average difference of means) alone? • Can we trust the t statistic alone? • Here is evidence that the answer is no. • Averages can be driven by outliers. Courtesy of Y.H. Yang
Some Issues • Can we trust average effect sizes (average difference of means) alone? • Can we trust the t statistic alone? • Here is evidence that the answer is no. • t’s can be driven by tiny variances. Courtesy of Y.H. Yang
Variations in t-tests (1) • Let • Rgmean observed log ratio • SEg standard error of Rg estimated from data on gene g. • SE standard error of Rg estimated from data across all genes. • Global t-test: t=Rg/SE • Gene-specific t-test t=Rg/SEg
T-tests extensions SAM (Tibshirani, 2001) Regularized-t (Baldi, 2001) EB-moderated t (Smyth, 2003)
Gene 1: M11, M12, …., M1k Gene 2: M21, M22, …., M2k ……………. Gene G: MG1, MG2, …., MGk For every gene, calculateSi=t(Mi1, Mi2, …., Mik), e.g. t-statistics, S, B,… Statistics of interestS1, S2, …., SG Up to here…: Can we generate a list of candidate genes? With the tools we have, the reasonable steps to generate a list of candidate genes may be: ? A list of candidateDE genes We need an idea of how significant are these values We’d like to assign them p-values
Nominal p-values • After a test statistic is computed, it is convenient to convert it to a p-value:The probability that a test statistic, say S(X), takes values equal or greater than that taken on the observed sample, say S(X0), under the assumption that the null hypothesis is truep=P{S(X)>=S(X0)|H0 true}
Significance testing • Test of significance at the a level: • Reject the null hypothesis if your p-value is smaller than the significance level • It has advantages but not free from criticisms • Genes with p-values falling below a prescribed level may be regarded as significant
Calculation of p-values • Standard methods for calculating p-values: (i) Refer to a statistical distribution table (Normal, t, F, …) or (ii) Perform a permutation analysis
(i) Tabulated p-values • Tabulated p-values can be obtained for standard test statistics (e.g.the t-test) • They often rely on the assumption of normally distributed errors in the data • This assumption can be checked (approximately) using a • Histogram • Q-Q plot
Example Golub data, 27 ALL vs 11 AML samples, 3051 genes A t-test yields 1045 genes with p< 0.05
(ii) Permutations tests • Based on data shuffling. No assumptions • Random interchange of labels between samples • Estimate p-values for each comparison (gene) by using the permutation distribution of the t-statistics • Repeat for every possible permutation, b=1…B • Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls” • For each gene, calculate the corresponding two samplet-statistic, tb • After all the B permutations are done putp = #{b: |tb| ≥ |tobserved|}/B