1.43k likes | 1.6k Views
Analysis of DNA Microarray Data: Sensitivity, Specificity, and Other Real-World Issues. 1. Definitions and basic considerations. DNA microarrays. Major advantage Simultaneous measurement of level of expression for nearly all transcribed genes within given cell or tissue Major disadvantage
E N D
Analysis of DNA Microarray Data: Sensitivity, Specificity, and Other Real-World Issues
DNA microarrays • Major advantage • Simultaneous measurement of level of expression for nearly all transcribed genes within given cell or tissue • Major disadvantage • Cost
Therefore, to get the most bang for the buck, it is imperative to understand the role of uncertainty in measurement…
Categorical tests (yes/no, based upon threshold) • Gene arrays • Is gene expressed or not? • Is gene differentially expressed under two different experimental conditions? • Medical tests • Does patient have disease or not?
Key concepts for categorical tests • Specificity • true negative rate • 1 – FPR (false positive rate) • Sensitivity • TPR (true positive rate)
Specificity provides the answer to questions like… • What fraction of patients who are disease-free are correctly classified as disease-free? • What fraction of genes that are not differentially expressed are correctly classified as being non-differentially expressed?
Specificity • Specificity is defined as true negative rate • Probability that disease-free patient will be correctly categorized as disease-free • False positive rate (FPR) = 1 – specificity • Probability that disease-free patient will be incorrectly categorized as having disease
Sensitivity and specificity deal with distinct sets of patients or genes • Specificity • Healthy patients lacking the disease • Non-expressed genes • Non-differentially expressed genes • Sensitivity • Sick patients having the disease • Expressed genes • Differentially expressed genes
Sensitivity provides the answer to questions like… • What fraction of patients who have a given disease are correctly classified as diseased? • What fraction of genes that are differentially expressed are correctly classified as being differentially expressed?
Sensitivity • Sensitivity is defined as true positive rate • Probability that diseased patient will be correctly categorized as having the disease
Yin and yang of sensitivity and specificity • Improving specificity always worsens sensitivity • Improving sensitivity always worsens specificity
SMEASURE = measured signal STRUE = true signal N = noise (error)
Noise-to-Signal (N:S) Ratio • N : S << 1 • reliable and trustworthy measurement • N ~ S • unreliable measurement • N > S • highly unreliable measurement
Sources of uncertainty in categorical measurements • Measurement uncertainty • SMEASURE does not necessarily equal STRUE • N ~ S or N > S • “Overlap” uncertainty • Some patients with disease truly have positive test values • Some patients without disease truly have negative test values
Gene arrays and medical tests have distinct and different sources of uncertainty
Variability in medical tests is mostly “overlap” • Measurement variability • Essentially none (error is of no clinical significance) • N : S << 1 • Hence, perform test once and only once • “Overlap” variability • Ubiquitous and essentially unavoidable • Feature of all medical tests to one degree or another • So what’s the solution? • Search for a better test
Variability in DNA microarrays is mostly measurement uncertainty • Measurement variability • Ever-present • N > S for many genes • “Overlap” variability • None • Absent gene has expression level of zero, whereas present gene has expression level of non-zero • Differentially expressed gene… • So what’s the solution? • Repeated measurements
Take mean of repeated measurements...
Benefits of repeated measurements • Assuming that noise N has a normal (Gaussian) distribution, then the error decreases with square root of number n of measurements • Example: to reduce N : S by half, take mean of 4 measurements
Signal Log Ratio (SLR) • SLR = logarithm to base 2 of the ratio of the signal for gene under experimental condition A (SA1) to that for the same gene under experimental condition N (SN1)
Examples of SLR SA1 = 4000 SA1 = 2 SN1 = 1000 SN1 = 16 SLR = log2 (4) = 2 SLR =log2 (1/8) = –3
To get a handle on specificity, perform same-versus-same comparisons • SLRTRUE must be zero • log2 (1) = 0 • Hence, SLRMEASURE is all noise
Perform separate analyses for “present” and “absent” genes • Present genes • N : S << 1 • Absent genes • N : S ≥ 1
Experimental system • Primary cultures of peritoneal macrophages from mice of 3 strains • BALB/c (normal) • MRL/+ (autoimmune lupus) • MRL/lpr (autoimmune lupus) • Each array represents mRNA pooled from distinct sets of ~ 6 mice harvested on separate days • Macrophages were stimulated with bacterial endotoxin (lipopolysaccharide, LPS) for 8 or 24 hours
Present genes: Same-vs.-same comparison (single array) • Average SLR = ~ 0.02 + 0.04 (~ 1.014-fold) • not different from zero • that’s good! • Standard deviation = ~ 0.69 + 0.30 • ~ 32% genes have SLR > 0.69 (1.61-fold induction) • ~ 4% genes have SLR > 1.38 (2.60-fold induction) • that’s not good
Present genes: Statistical distribution of SLR • Entire distribution • Not normal (p < 0.01, by D statistic) • Central 95% • Normal (p > 0.2, by D statistic) • Highly noteworthy, since D statistic detects tiny tiny deviations from normality • 5% at tails overestimate the SLR
If we compare genes in central 95% versus genes in 5% tails… • Center (95% genes) • Mean signal intensity = 1493 • Tails (5% genes) • Mean signal intensity = 620 (p < 10-19, t-test) • Consistent with intuitive idea that measurement variability is inversely related to level of gene’s expression
Absent genes: Same-vs.-same comparison (single array) • Average SLR = ~ 0.33 + 0.31 (~ 1.26-fold induction) • definitely not good • Standard deviation = ~ 1.12 + 0.24 • > 35% genes have SLR > 1.0 (2-fold induction) • > 5% genes have SLR > 2.0 (4-fold induction) • even worse!
Absent genes: Statistical distribution of SLR • Entire distribution • Not normal (p < 0.01, by D statistic) • Central 95% • Not normal (p < 0.01, by D statistic) • Central 60% • Not normal (p < 0.01, by D statistic)
Summary of same-vs.-same comparisons (single array) • Use SLR only for genes that are actually expressed (i.e., “present” genes) • Central 95% normally distributed with standard deviation of ~ 0.69 • 2.5% at each tail exceeds normal distribution • Do not use SLR for genes that are marginally, if at all, expressed (i.e., “absent” genes) • Most of measured signal is noise • SLR is therefore ratio of two small randomly distributed values