Statistical Analysis

StatisticalAnalysis Dr. Lars Eijssen

Contents • Statistics of differential gene expression • Multiple testing • Unsupervisedmethods • The arrayanalysisstatistics module

Part 1: Statistics of differential gene expression

Data analysis overview Microarray scans Image analysis Raw data • Background correction • Normalisation Quality control Further pre-processing Normalised data Statistical analysis List of regulated genes Pattern analysis Pathway analysis Literature data Results Untreated (control) Exposed to compound Slidebasedon a slidefrom J. Pennings, RIVM, NL

What is changed? • “Every gene that has changedtwo-fold is relevant” • Doesn’ttakevariationinto account

Statisticaltesting • Soadd a statistical test withnull-hypothesisthat the gene is notchangedbetween the groups • Thisgivesyou – apart from the change – also a significance level (P value)

Input forstatistics • Normalised data table • Groupinginformation

Output of statistics • List of differentiallyexpressedgenesbetweenexperimentalgroups • Howmuchdifference? • How significant? • Replicates

Recall

The FoldChange • On is interested in computing the fold-changebetweenexperimentalgroups • For example: Gene_A is 2 foldupregulated in patients versus controls Gene_A_patient / Gene_A_control = 2 • This is a divisionbetweengroups

Asymmetry of the FoldChange • ‘raw’ ratio (FC) 0 ∞ ½ 1 2 Downregulated: packed in (0,1) Upregulated: spread over (1,∞)

Log transformation • Afterlogging (and normalisation) onecancompute the difference in means (‘logFC’) betweenseveralexperimentalgroups 2log(Gene_A_patient / Gene_A_control) = 2log(2) 2log(Gene_A_patient) - 2log(Gene_A_control) = 1 • A difference is easier to handlestatisticallythan a division

Symmetry of the loggedFoldChange • The logFC ‘spreads out’ the data and offers symmetry • ‘raw’ ratio (FC) • log ratio (logFC) ½ 1 2 2log of: ½ 1 2

Linearregressionmodelling • Oftenusedapproach • For thosefamiliar: corresponds to ANOVA analysis

A basicexample: twogroups • Suppose we have an experiment withpatients and controls • Howcan we compute the differencebetweenthosefor a certain gene?

Experimentaldesign

The model Gene_expression_A ~ group • Thismeansthat gene expression is modelled basedongroup • The average in patients is allowed to be different from the average in controls

Contrasts • Linearregressionmodellingcomputescoefficientsforeach of the variables in the model • Such as group • From these we cancompute the differencesbetween the groups, calledcontrasts

Contrasts and foldchanges • Contrastsdirectlycorrespond to the logFCbetween the groups • To get the FC (ratio) for the data on the originalscale we caneasilycompute: FC = 2^logFC

More extensive models Gene_expression_A ~ group + day Gene_expression_A ~ group + day + group*day Orwhengroup has more thantwolevels Gene_expression_A ~ group enablesestimation of threecontrasts (group 1 versus 2, group 1 versus 3, and group 2 versus 3)

Example output

Recall:

Significant, but … relevant??? • Is a FC of 1.005, with a p-value of 0.0001, biologically relevant? • Onecanalso put a cut-offon the FC

Volcano Plot • Shows both the significance and the logFC • P valueson a -10log scale Image: J. Pennings, RIVM, NL

Multiple testing

Is a P of 0.05 significant? 5000 – 50000 tests

Suppose 7000 genes • 0.05: expected:7000 * 0.05 = 350 bychance

Correctionsfor multiple testing • FWER (family-wiseerror): correct the P-valuefor the number of tests • Most simpleexample is the Bonferronicorrection • Corrected P value = 0.05 / number of tests done • For example: 0.05 / 7000 = 7.14e-06 • Toostrict – anyresultsleft?

FDR • Othercorrections are more realistic • For examplecorrecting the FalseDiscoveryRate • These correctionsmakesurethat the number of FalsePositives is controlled • Number of wrong hits / totalnumber of hits • Thismeansone does not have to consider the totalnumber of tests, butonly the number of positive (significant) tests

Part 3: Unsupervisedmethods

Supervised versus unsupervised • Methodssuch as statisticaltesting are supervised • Onecanalsoapplyunsupervisedmethods • Two of those we have alreadyseen at the QC

Clustering • Onecan cluster samples, genes or both Image from J. Pennings, RIVM, NL

Similarity of twoexpression profiles • Euclideandistance • Correlationdistance

a,b a,b,c,d,e c,d,e d,e Building a tree 0 1 2 3 4 a b c d e Tree is constructed! Adapted from Kaufman and Rousseeuw (1990)

PCA analysis • alsoherescaling is important

Part 4: The arrayanalysis.orgstatistics module

Limma • A Bioconductor packagefor R thatallowsforlinearmodeling • Itusesan ‘adapted’ t-test • improvedestimate of variation • I does a Bayesiansmoothingon the P-values • Thispackage is calledbyarrayanalysis.org

arrayanalysis.org • Besides the QC and normalization module, itcontains a module forstatisticalanalysis • This has notyet been added to the open site • So we willworkon the developers’ site in the afternoon

P value histogram

Number of significant genestable

Resultstable

Filteredresultstable(Significant genes list)

Arrayanalysis.org • We use the bèta version of the AffyAnalysisStat module • Someinconveniences / small bugs are stillthere • Don’tworry!  • In the practical youwillgetinstructionshow to operateit

Project members Lars Eijssen MagaliJaillard AnweshaDutta

Statistical Analysis