140 likes | 261 Views
Supervised microarray data analysis. Mark van de Wiel. Quality control. Protocols Perform a small scale, well-controlled experiment to assess influence of experimental factors (Microarrays from different batches, printing tips, dyes, linearity of the scanner, etc.)
E N D
Supervised microarray data analysis • Mark van de Wiel
Quality control • Protocols • Perform a small scale, well-controlled experiment to assess influence of experimental factors (Microarrays from different batches, printing tips, dyes, linearity of the scanner, etc.) • Continuous factors (temperature, humidity, spotsize over time, intensity of control spot over time) can be monitored with standard control chart techniques.
Design of the experiment • Think very, very well what the biological goals are. • What software do you have at your disposal to analyse the data? • Do we need reference or not? • ‘Biological design’: what tissues to combine on an array (cDNA)? More than one biological factor: factorial design • Dye-bias: dye-swap. • Design on the array (negative/positive controls, repeats?, how many genes? Pilot study first, distributing the repeats over experimental factors (spatial, printing tips, etc.)) • Save some space on the (cDNA) microarray for assessing variability due to experimental factors (e.g. print same control gene with several printing tips)
Analysis: Multiple testing (after normalization) • Objective: control the number of falsely selected genes • FWE: Family wise error rate • Weak FWE control: P(falsely select gene i, i=1, ..., 20.000 | no gene truly expressed) • Strong FWE control: P(falsely select gene i, i=1, ..., 20.000 | some genes expressed, some genes not expressed) • FDR: False Discovery Rate • F: Expected number of false rejections when no genes are expressed, T: Total number of rejections • FDR control: F/T
Multiple testing: FWE vs FDR • Control of FDR implies weak control of FWE • Advantage strong control of the FWE: significance level under all situations controlled • Disadvantage: less power than FDR control • FWE based procedures tend to select less genes than FDR based procedure • Software: • Bioconductor: Step-down Westfall-Young (Dudoit et al.), control FDR and FWE. • SAM (permutation based ‘control’ of FDR)
SAM • Developed at Stanford, Tibshirani et al. (Paper: Tusher et al, PNAS 98, 5116-5121) Claim is FDR-control • Plus: • Ease of use, add-in to Excel • Allows asymmetric cut-offs • Minus: • Distribution under the null-hypotheses (‘no expression’) needs to be the same for all genes to guarantee FDR control • Combination with k-fold rule: no control of FDR anymore • Solutions: Use (normal) rank scores and a simple rank statistic • Explicitly test on k-fold expression; combine with FDR criterion
Modelling vs Normalisation + Testing • Modelling forces you to state what the assumptions are (linearity, normality, independence, etc.) • Normalisation steps may not be commutative • Non-linearities can be dealt with by normalisation methods • Advanced modelling requires help of statistician/bio-informatician • Standard approach to modelling: ANOVA. Model has two levels: • Normalisation level which includes linear corrections for dye and microarray effects • Gene expression level which includes effects on gene level, including interactions (interaction of interest is usually gene*variety)
Software • Freeware: SAM, Bioconductor • Specialized commercial software: Spotfire, Genespring, Genesight, Rosetta • Most contain: normalisation, variance stabilizing transformations, ANOVA, testing (most do not yet include the advanced multiple testing criteria) • Statistical software: SAS, S-Plus, SPSS • Much more debugged, long history, better documentation (Often very unclear what the specialized packages really do.) • Advantages specialized software: user-friendly, visualisation (nice pictures), link with data bases, annotation • Try several!!!
Bayesian models • +Natural translation to networks (pathways) • +Complex models (linearity is not necessary, interactions) • +Prior biological knowledge can be included • +Nesting of the models (image analysis + normalisation + gene expression) • +Inference for complex functions of gene expression data is relatively easy • No ‘easy’ software • Computational methods may take time to find reliable estimates • Example Network
Validation • Cross-validation: leave some data out and see how well the data values are predicted by the model (Note that for normalisation procedures it may be harder to predict the data from the normalized data) • Biological validation (spikes: known concentrations) • Very useful for validating the normalisation procedure or the model: • Pretend that spikes with equal concentrations that are used under different conditions (different dyes, microarray batch)are different quantities. • Estimate ratio of two estimates after normalisation or modelling • Ratio should approximately be equal to 1.
Comparison and meta analysis • Objective comparisons between methods very much needed! • Simulations may help (because we know the truth then). Setting up realistic simulations may be hard! • Competition between several methods (CAMDA ’03: Lung cancer) • Future goals: • Methods that allow for combining data from several experiments. • From relative quantities to absolute quantities. • Absolute quantities allow for direct comparison between labs. (otherwise, only if labs have used same reference material etc.)
Useful overview papers, books • Design: Churchill, G.A. (2002) Fundamental of experimental design for cDNA microarrays. Nature Genet.32 (490-495) • Analysis: Slonim, D.K. (2002) From patterns to pathways: gene expression data analysis comes of age Nature Genet.32 (502-508) • Normalisation: Quackenbush, J. (2002) Microarray normalisation and transformation Nature Genet.32 (496-501) • Pitfalls: Richard Simon et al. (2003) Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification J Natl Cancer Inst; 95: 14-18. • Books: Baldi & Hatfield (2002), DNA Microarrays and Gene expression, Cambridge University Press • Speed, T. (2003) Statistical Analysis of Gene Expression Microarray DataChapman & Hall • Acknowledgement: Nicola Armstrong (EURANDOM)