230 likes | 252 Views
This article discusses the differential expression analysis in bioinformatics, including preprocessing, filtering, normalization, and statistical testing methods such as T-test and SAM.
E N D
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006
Preprocessing Array by array approach ANOVA based Background corr Background corr Log transformation Log transformation Filtering Filtering normalization Linearisation Ratio Test statistic (T-test) Bootstrapping
Overview further analysis Raw data Preprocessing Preprocessed data Test statistic Clustering Clusters of coexpressed genes Differentially expressed genes
Preprocessing: test statistic Test Statistic Comparison of 2 experiments: • Fold test • T-test • SAM • … A plethora of different method available Which one performs best? Different underlying statistical assumptions Implication on the final result Difficult to define the best method
Diff Expr Genes: test statistic Type1: Comparison of 2 samples Control sample Induced sample Statistical testing Retrieve statistically over or under expressed genes
Diff Expr Genes : test statistic • black/white experiment description (array V mice genes) • Condition 1 : pygmee mouse 10 days old (test) • Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Array 1 Per gene, per condition 4 measurements available Array 2
Diff Expr Genes : test statistic Fold change (ratio test) 4 measurements per gene, condition Calculate average Sort averages log(Sample/control) > threshold (usually 2) • Arbitrary threshold • Discards all information obtained from replicates • Implicitly assumes constant variance but variance depends on expression value
Diff Expr Genes : test statistic Why does fold chance fail: • Majority of genes expressed at low levels where signal/noise is low => not sufficiently conservative • 2 fold change occurs at random for a large number of genes • High number of false positives • Higher levels of expression smaller changes in gene expression may be real => too conservative • High number of false negatives Improvement: • T-test • pairwise fold change: genes significantly differentially expressed if R=-fold change is observed consistently between paired samples • SAM http://www-stat-class.stanford/SAM/SAMServlet
Diff Expr Genes : test statistic T-test: hypothesis test • Possible if replicates of reference and test are available • Significance of the difference between the reference and test data (level of expression) relative to the observed level of within class variation(consistency) • Assumptions • Normal distribution of variables • Population mean and variance estimated from data => (Student t distribution for H0 hypothesis) • Not all genes need to have the same variance • Under null hypothesis sample means should be equal (rescaling obligatory)
Diff Expr Genes : test statistic Paired t-test (microarray data are paired) • Consider paired data as new variable • Calculate average ratio • Calculate standard deviation of the 4 ratio measurements Determine t-value df, student t distribution, t-value p-value p-value (represents the probability that a certain null hypothesis is true)
Gene x Type I Type II H0: D=0 H1: D<>0 H0 H1 Diff Expr Genes : test statistic t-test • Classical hypothesis tests (t-test, Wilcoxon rank-sum test, ...): • a test statistic is calculated (t-value) • the probability or p-value is calculated that an equally good or better test statistic is generated if a certain null hypothesis is true • The null hypothesis: gene has no difference in mean expression levels between 2 conditions • Low p-value (below rejection level ): null hypothesis is not likely: reject null hypothesis: there is a difference in (mean) expression between the two classes
Diff Expr Genes : test statistic Comparison of fold test with paired t-test • Gene expression levels measured under two different conditions • Rejection level • pj < : null hypothesis rejected (result Positive) • pj > : null hypothesis not rejected (result Negative) • But: Multiple testing: Type I and Type II error = False positives and negatives
Diff Expr Genes : test statistic SAM • Each gene is assigned a score on the basis of its change in expression relative to the standard deviation of repeated measurements for that gene • H0 (expected relative difference) is estimated by permutation analysis • Permute the samples • Calculate d(i) values for both the experimental samples and the permutated control samples • Rank genes by magnitude of their d(i) values for both the experimental and the permutated control samples
Diff Expr Genes : test statistic SAM • Observed values • Calculate d(I) value for each gene • Rank genes according to their d(I) value • Simulated values • Permute dataset • Calculate d(I) value for each gene in each permuted dataset • Calculate average d(I) value for each gene • Rank d(I) values • Make scatterplot
Diff Expr Genes : test statistic Test statistic Assumptions Distribution H0 T-test Errors normally distributed Parametrized : Student t-distribution Restricted number of repeat measurements Impossible to evaluate assumption Paired t-test Errors equal variance (iid) No explicit assumption Order statistics SAM Less stringent assumption
Diff Expr Genes: test statistic Multiple testing: problem • P value: measure of significance in terms of the false positive rate • The rate that truly null features are called significant • Significance is 5%: on average 5% of the truly null features will be called significant (type-I error) • Type I error: Null hypothesis rejected when it is true –‘accidental’ low p-value – falsely declared differentially expressed = false positive • Multiple testing: Example: 10000 genes with random expression profiles - = 5% - one would find 500 genes with a p-value lower than 5% = false positives • Type II error: Null hypothesis not rejected when it is not true (false negatives). Gene that is actually differentially expressed is not declared differentially expressed. Adapted from De Smet et al
Diff Expr Genes: test statistic Multiple testing: solutions • Control of the familywise error rate (FWE): • P(FP 1) – protection against type I errors • Bonferonni correction: reject null hypothesis at rejection level /N, which guarantees that FWE = P(FP 1) < • Is OK when very few genes are expected to be actually differentially expressed (i.e., affected by the difference in conditions / for which the null hypopthesis is false): every false positive is ‘costly’ • Rejection rate becomes very conservative • But in microarray data, usually a considerable number of genes is actually differentially expressed: control of the FWE results in a severe loss of statistical power (FN or type II error is large) • In practice we do not have to protect against every possible FP Better solution FDR: false positive discovery rate Adapted from De Smet et al
FDR Diff Expr Genes: test statistic • We need a sensible balance between the number of true positives and the number of false positives • Therefore is is better to control the ‘False Discovery Rate’ (FDR) instead of the FWE: • The false positive rate: The rate that truly null features are called significant • The FDR: = % of false positives among all the genes that are declared positive = % of true null hypotheses erroneously rejected among all the null hypotheses rejected Adapted from De Smet et al
Diff Expr Genes: test statistic Difference p-value and FDR • 5% FDR: 5% false positives among the features called significant • 5% p value cutoff: 5% false positives among all the null features in the dataset, says little about the content of the features actually called significant
An estimate of E[S(t)] is the observed S(t): i= the number of observed pvalues <pi • E[F(t)] = N0pi • Estimate N0 No real differential expression Randomised data set Uniform distribution Non-accidental differential expression Superposition of two distribuions Rejection level TP FN FP TN Adapted from De Smet et al
Overview MICROARRAY PREPROCESSING • Gene expression • Omics era • Transcript profiling • Experiment design • Preprocessing • Slide by slide normalisation • ANOVA • Exercises