470 likes | 619 Views
Statistical Analysis. Dr. Lars Eijssen. Contents. Statistics of differential gene expression Multiple testing Unsupervised methods The arrayanalysis statistics module. Part 1:. Statistics of differential gene expression. Data analysis overview. Microarray scans.
E N D
StatisticalAnalysis Dr. Lars Eijssen
Contents • Statistics of differential gene expression • Multiple testing • Unsupervisedmethods • The arrayanalysisstatistics module
Part 1: Statistics of differential gene expression
Data analysis overview Microarray scans Image analysis Raw data • Background correction • Normalisation Quality control Further pre-processing Normalised data Statistical analysis List of regulated genes Pattern analysis Pathway analysis Literature data Results Untreated (control) Exposed to compound Slidebasedon a slidefrom J. Pennings, RIVM, NL
What is changed? • “Every gene that has changedtwo-fold is relevant” • Doesn’ttakevariationinto account
Statisticaltesting • Soadd a statistical test withnull-hypothesisthat the gene is notchangedbetween the groups • Thisgivesyou – apart from the change – also a significance level (P value)
Input forstatistics • Normalised data table • Groupinginformation
Output of statistics • List of differentiallyexpressedgenesbetweenexperimentalgroups • Howmuchdifference? • How significant? • Replicates
The FoldChange • On is interested in computing the fold-changebetweenexperimentalgroups • For example: Gene_A is 2 foldupregulated in patients versus controls Gene_A_patient / Gene_A_control = 2 • This is a divisionbetweengroups
Asymmetry of the FoldChange • ‘raw’ ratio (FC) 0 ∞ ½ 1 2 Downregulated: packed in (0,1) Upregulated: spread over (1,∞)
Log transformation • Afterlogging (and normalisation) onecancompute the difference in means (‘logFC’) betweenseveralexperimentalgroups 2log(Gene_A_patient / Gene_A_control) = 2log(2) 2log(Gene_A_patient) - 2log(Gene_A_control) = 1 • A difference is easier to handlestatisticallythan a division
Symmetry of the loggedFoldChange • The logFC ‘spreads out’ the data and offers symmetry • ‘raw’ ratio (FC) • log ratio (logFC) ½ 1 2 2log of: ½ 1 2
Linearregressionmodelling • Oftenusedapproach • For thosefamiliar: corresponds to ANOVA analysis
A basicexample: twogroups • Suppose we have an experiment withpatients and controls • Howcan we compute the differencebetweenthosefor a certain gene?
The model Gene_expression_A ~ group • Thismeansthat gene expression is modelled basedongroup • The average in patients is allowed to be different from the average in controls
Contrasts • Linearregressionmodellingcomputescoefficientsforeach of the variables in the model • Such as group • From these we cancompute the differencesbetween the groups, calledcontrasts
Contrasts and foldchanges • Contrastsdirectlycorrespond to the logFCbetween the groups • To get the FC (ratio) for the data on the originalscale we caneasilycompute: FC = 2^logFC
More extensive models Gene_expression_A ~ group + day Gene_expression_A ~ group + day + group*day Orwhengroup has more thantwolevels Gene_expression_A ~ group enablesestimation of threecontrasts (group 1 versus 2, group 1 versus 3, and group 2 versus 3)
Significant, but … relevant??? • Is a FC of 1.005, with a p-value of 0.0001, biologically relevant? • Onecanalso put a cut-offon the FC
Volcano Plot • Shows both the significance and the logFC • P valueson a -10log scale Image: J. Pennings, RIVM, NL
Is a P of 0.05 significant? 5000 – 50000 tests
Suppose 7000 genes • 0.05: expected:7000 * 0.05 = 350 bychance
Correctionsfor multiple testing • FWER (family-wiseerror): correct the P-valuefor the number of tests • Most simpleexample is the Bonferronicorrection • Corrected P value = 0.05 / number of tests done • For example: 0.05 / 7000 = 7.14e-06 • Toostrict – anyresultsleft?
FDR • Othercorrections are more realistic • For examplecorrecting the FalseDiscoveryRate • These correctionsmakesurethat the number of FalsePositives is controlled • Number of wrong hits / totalnumber of hits • Thismeansone does not have to consider the totalnumber of tests, butonly the number of positive (significant) tests
Part 3: Unsupervisedmethods
Supervised versus unsupervised • Methodssuch as statisticaltesting are supervised • Onecanalsoapplyunsupervisedmethods • Two of those we have alreadyseen at the QC
Clustering • Onecan cluster samples, genes or both Image from J. Pennings, RIVM, NL
Similarity of twoexpression profiles • Euclideandistance • Correlationdistance
a,b a,b,c,d,e c,d,e d,e Building a tree 0 1 2 3 4 a b c d e Tree is constructed! Adapted from Kaufman and Rousseeuw (1990)
PCA analysis • alsoherescaling is important
Part 4: The arrayanalysis.orgstatistics module
Limma • A Bioconductor packagefor R thatallowsforlinearmodeling • Itusesan ‘adapted’ t-test • improvedestimate of variation • I does a Bayesiansmoothingon the P-values • Thispackage is calledbyarrayanalysis.org
arrayanalysis.org • Besides the QC and normalization module, itcontains a module forstatisticalanalysis • This has notyet been added to the open site • So we willworkon the developers’ site in the afternoon
Arrayanalysis.org • We use the bèta version of the AffyAnalysisStat module • Someinconveniences / small bugs are stillthere • Don’tworry! • In the practical youwillgetinstructionshow to operateit
Project members Lars Eijssen MagaliJaillard AnweshaDutta