740 likes | 911 Views
BIOL6900 Chapter 9 Gene expression: Microarray data analysis. Outline: microarray data analysis. Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering
E N D
BIOL6900 Chapter 9 Gene expression: Microarray data analysis
Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
Microarrays: tools for gene expression A microarray is a solid support (such as a membrane or glass microscope slide) on which DNA of known sequence is deposited in a grid-like array. Page 312
Microarrays: tools for gene expression The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levels of thousands of genes. Page 312
Advantages of microarray experiments Fast Data on >20,000 transcripts within weeks Comprehensive Entire yeast or mouse genome on a chip Flexible Custom arrays can be made to represent genes of interest Easy Submit RNA samples to a core facility Cheap? Chip representing 20,000 genes for $300 Table 8-10 Page 313
Disadvantages of microarray experiments Cost ■ Some researchers can’t afford to do appropriate numbers of controls, replicates RNA ■ The final product of gene expression is protein significance ■ “Pervasive transcription” of the genome is poorly understood (ENCODE project) ■ There are many noncoding RNAs not yet represented on microarrays Quality ■ Impossible to assess elements on array surface control ■ Artifacts with image analysis ■ Artifacts with data analysis ■ Not enough attention to experimental design ■ Not enough collaboration with statisticians Table 8-11 Page 313
Sample acquisition Data acquisition Data analysis Data confirmation Fig. 8.17 Page 314 Biological insight
Stage 1:Experimental design Stage 2:RNA and probe preparation Stage 3: Hybridization to DNA arrays Stage 4: Image analysis Stage 5: Microarray data analysis Stage 6: Biological confirmation Stage 7: Microarray databases Fig. 8.17 Page 314
Array Express at the European Bioinformatics Institute http://www.ebi.ac.uk/arrayexpress/
MIAME In an effort to standardize microarray data presentation and analysis, Alvis Brazma and colleagues at 17 institutions introduced Minimum Information About a Microarray Experiment (MIAME). The MIAME framework standardizes six areas of information: ►experimental design ►microarray design ►sample preparation ►hybridization procedures ►image analysis ►controls for normalization Visit http://www.mged.org Page 319
Interpretation of RNA analyses The relationship of DNA, RNA, and protein: DNA is transcribed to RNA. RNA quantities and half-lives vary. There tends to be a low positive correlation between RNA and protein levels. The pervasive nature of transcription: The Encyclopedia of DNA Elements (ENCODE) project identified functional features of genomic DNA, initially in 30 megabases (1% of the human genome). One of its observations was the “pervasive nature of transcription”: the vast majority of DNA is transcribed, although the function is unknown. Page 320
Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Begin Chapter 9!
Microarray data analysis • begin with a data matrix (gene expression values versus samples) genes (RNA transcript levels) Fig. 9.1 Page 333
Microarray data analysis • begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 20,000) and few samples (~ 10) Fig. 9.1 Page 333
Microarray data analysis • begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics Fig. 9.1 Page 333
Microarray data analysis: preprocessing • Observed differences in gene expression could be • due to transcriptional changes, or they could be • caused by artifacts such as: • different labeling efficiencies of Cy3, Cy5 • uneven spotting of DNA onto an array surface • variations in RNA purity or quantity • variations in washing efficiency • variations in scanning efficiency Page 337
Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Page 337
Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Page 343
Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Page 343
Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Page 343
Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Page 338
Differential Gene Expression in Different Tissue and Cell Types Fibroblast Brain Astrocyte Astrocyte
up high down expression level Expression level (sample 2) low Expression level (sample 1) Fig. 9.2 Page 338
Log-log transformation Fig. 9.2 Page 338
Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry raw ratio log2 ratio timebehaviorvalue value t=0 basal 1.0 0.0 t=1h no change 1.0 0.0 t=2h 2-fold up 2.0 1.0 t=3h 2-fold down 0.5 -1.0 Page 339
expression level low high up Log ratio down Mean log intensity Fig. 9.2 Page 338
You can make these plots in Excel… …but for many bioinformatics applications use R. Visit http://www.r-project.org to download it.
Visit http://www.r-project.org to download it. See chapter 9 (p. 341, 371) for a tutorial on microarray data analysis.
SNOMAD converts array data to scatter plots http://www.snomad.org 2-fold Linear-linear plot Log-log plot EXP EXP 2-fold 2-fold 2-fold CON CON EXP > CON 2-fold Log10 (Ratio ) 2-fold EXP < CON Mean ( Log10 ( Intensity ) ) Page 344
Robust multi-array analysis (RMA) • Developed by Rafael Irizarry (Dept. of Biostatistics), Terry Speed, and others • Available at www.bioconductor.org as an R package • Also available in various software packages (including Partek, www.partek.com and Iobion Gene Traffic) • See Bolstad et al. (2003) Bioinformatics 19; • Irizarry et al. (2003) Biostatistics 4 • There are three steps: • [1] Background adjustment based on a normal plus exponential model (no mismatch data are used) • [2] Quantile normalization (nonparametric fitting of signal intensity data to normalize their distribution) • [3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect
log signal intensity array Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied. log signal intensity Page 342 array
precision with precision accuracy accuracy Good performance: reproducibility of the result Good quality of the result (relative to a gold standard) Page 344
Robust multi-array analysis (RMA) RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software). precision log expression SD MAS 5.0 RMA average log expression Page 345
Robust multi-array analysis (RMA) RMA offers comparable accuracy to MAS 5.0. accuracy observed log expression log nominal concentration Page 345
Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < 0.05. Page 346
Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? x1 – x2 difference between mean values variability (standard error of the difference) SE Page 348
Inferential statistics • A t-test is a commonly used test statistic to assess • the difference in mean values between two groups. • t = = • Notes • t is a ratio (it thus has no units) • We assume the two populations are Gaussian • The two groups may be of different sizes • Obtain a P value from t using a table • For a two-sample t test, the degrees of freedom is N -2. For any value of t, P gets smaller as df gets larger x1 – x2 difference between mean values variability (standard error of the difference) SE
Analyzing expression data in Excel Question: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease. control disease [1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. You can also calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples.
Analyzing expression data in Excel [2] You can calculate the ratios of control versus disease. Note that you can use the formula =E5/I5 in this case to divide the mean control and disease values. Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold.
Analyzing expression data in Excel [3] Perform a t-test. When you enter =TTEST into the function box above, a dialog box appears. Enter the range of values for controls and for disease samples, and specify a 1- or 2-tailed test.
Analyzing expression data in Excel [3] Perform a t-test (continued). For a one-tailed test, your prior hypothesis is that the transcript in the disease group is up (or down) relative to controls; the change is unidirectional. For example, in Down syndrome samples you might hypothesize that chromosome 21 transcripts are significantly up-regulated because of the extra copy of chromosome 21.
Analyzing expression data in Excel [3] Perform a t-test (continued). For a two-tailed test, you hypothesize that the two groups are different, but you do not know in which direction.
Analyzing expression data in Excel [3] Note the results: you can have… a small p value (<0.05) with a big ratio difference a small p value (<0.05) with a trivial ratio difference a large p value (>0.05) with a big ratio difference a large p value (>0.05) with a trivial ratio difference Only the first group is worth reporting! Why?
t-test to determine statistical significance disease vs normal Error difference between mean of disease and normal t statistic = variation due to error Page 354
variation between DS and normal F ratio = variation due to error ANOVA partitions total data variability Before partitioning After partitioning Subject disease vs normal disease vs normal Error Error Tissue type Page 354
Inferential statistics Paradigm Parametric test Nonparametric Compare two unpaired groups Unpaired t-test Mann-Whitney test Compare two paired groups Paired t-test Wilcoxon test Compare 3 or ANOVA more groups Page 350
Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10-6 The Bonferroni correction is generally considered to be too conservative. Page 351
Inferential statistics: false discovery rate The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR# regulated transcripts# false discoveries 0.1 100 10 0.05 45 3 0.01 20 1 Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive? Page 351