490 likes | 502 Views
Learn about the concept of r-power in statistical analysis for controlling Type I errors in multiple hypothesis testing. Explore the mathematical details, examples, and future directions in this research area.
E N D
R-power, conservatism and null Nairanjana Dasgupta, Director, Center for Interdisciplinary Statistics Education and Research Professor of Statistics Washington State University
Collaborators • Nicole Lazar, UGa and • Alan Genz, WSU • Xiongzhi Chen WSU • Boris Houenou, WSU • Papers: • Dasgupta, N., Lazar, N. A., & Genz, A. C. (2016). A look at multiplicity through misclassification. Sankhya, 98 (1), 98-116. • Misclassification using top-r under positive dependence. Houenou, B., Chen, X., and Dasgupta N., (to be submitted)
Outline: • Background • Bit o’ Lit • Idea of ranking • Notations • Concept of r-power • Mathematical details • What about dependence • Estimating the number of nulls • Examples • Where we go from here
Background: • With theinformaticsrevolutionithasbecomeincreasinglynecessaryandcommontotestlargenumbersof hypothesessimultaneously. • SirRonaldFisheriscredited withsaying“Naturewillrespondtoacarefullythoughtoutquestionnaire;indeedifweaskherasinglequestion,shewilloftenrefusetoanswertillsomeothertopichasbeendiscussed.” • However,simultaneoustestingalsoleadstothewell-knownproblemsofinflation ofType Ierrorandfalsepositives. • TheparticularproblemofcontrollingTypeIerrorinthepresenceofmultiplehypothesistestsisanoldone,goingbacktoworkbyFisher,Tukey and other stalwarts of Stats.
Bit o’ Lit: • Thegrowingprevalenceofmassivedatasetsarisingfrom thebiologicalandmedicalsciencesinparticularshiftedthestatisticalquestionfrommultiplecomparisonstomultipletesting. • Original Methods: BonferroniFWER • Holm (1979)introducedtheideaofstep-downmethods. • Simes(1986)formulatedaconjectureto providemorepowerfultestsformultipletesting; • ConjecturewaslaterprovedbySarkar(1998). • Thiswassoonfollowedbystep-upmethods(Hochberg,1988;Hommel,1988).
More Lit: FDR and the related papers • BenjaminiandHochberg(1995)suggestedanewcontrolquantity,theFalseDiscoveryRate(FDR) • Idea: insteadoftryingtocontroltheprobabilityof falselyrejectingevenonenull–the goalis to control the averageproportionoffalsediscoveries(rejectednulls)amongalloftherejectedhypotheses. • Theutilityof thisapproachforlarge-scalemultipletestingproblemswasquicklyrealized. • There wasaflurryofstatisticaldevelopmenttomodifytheoriginaltechniqueand,moregenerally,toexploretechniquesformassivemultipletesting.
Motivating data: The ALL Data Set Microarrays: • WeanalyzetheAcuteLymphoblasticLeukemia(ALL)example,whichisfreelyavailablefromtheBioconductorwebsite. • Thedatasetcontainsgeneticmicroarrayresultsfor128individualswithALL. • Outof these,95individualshaveB-cellALLand33haveT-cellALL. • Eachmicroarrayhas12625features,whichwewillcall“probes.” • ThePhiladelphiachromosome (BCR/ABL)wasthefirstcytogeneticaberrationthatwasprovedtobeassociatedwithuncontrolledcelldivision.However,notallindividualswithlymphomacarrythis aberration. • OUR questionof interestis:whetherornotthereisANYdifferenceinthegeneticsignaturesamongtheB-cellALLpatientswhocarrytheaberrationandthosewhodonot. • Thesamplesizesforthetwogroupsareb1=37(havetheBCR/ABLmutation)andb2=42(donothavetheBCR/ABLmutation) • We first look at all the probes to see the quality of the data. • Conductat-testtofindtheonesthataresignificantlydifferentlyexpressedinthetwogroupsofpatients.
Results for the ALL data set: • Atotalof N=6312outof theoriginal12625probespass“quality control”step.Wecomparethese fordifferentialexpressioninthetwogroups usingthepooledt-test. • Tocalculater-power,weneedvaluesof N,k,δ,andr;recallthatm=N−k. • WetakeN=6312andestimate kby1021,thenumberofnullhypothesesrejectedwithout anymultiplicitycorrection (naïve method). • Togetanideaofrwelookatthenumberofprobesrejectedusingsomedifferentmultiplicity criteria: • 31(Bonferroni,HochbergStep-up,Holm step-down), • 70(Benjamini-YekutieliFDR)or • 230 (Benjamini-HochbergFDR).
Idea of Ranking and Selection in Multiplicity • Recently,selectionbasedonrankingofteststatisticsorp-values;see,byKuoandZaykin(2011;2013). • Weapproachthemultiplicityquestionfromtheperspectiveofselectionviaranking. • Our angle:finding thecorrectwaytoconstructa“toptable”, or a list of “possible contenders” to follow up on. • The term “top-table”wasintroducedbySmythpubliclyataworkshopgivenattheAustralianGenstatConferenceinDecember2002. • TheLIMMA(Smyth,2005)packageof R- Bioconductordirectlygivestop-tablesaftermoderatedtorFstatisticsarecalculated. • Theranking approach hasalsobeenusedintheanalysisoffunctionalmagneticresonanceimaging(e.g.Knechtatal.,2003;Abbottet al.,2010)althoughit isnotasprevalentthere.
Ranking contd… • In general when one does multiple testing, the researcher cannot physically follow up on all the hypotheses. • There are time and financial constraints in most studies and practically the researcher can follow up on a limited number of hypothesis. • So, rather than following up on the ones that are “significant” why not focus on how many one can physically follow up on at a given time. • Let us call this number r. • Why don’t we focus on a list of top r contenders and see how reliable this list is (what is the probability that it came from the alternate hypothesis).
Our Idea: • This talk focusesontworelatedquestions: • Isthereatheoreticalwaytofindr,the topcandidatehypothesesinalarge-scalestudy? • If we already have a specific r in mind, can we figure out how reliable this top-r list is? • Relatedquestionsare: • Isthereawaytofindanidealr? • Whatshouldbetakenintoaccounttofindanoptimalr? • How do we actually define this, for practitioners to use. • Can we make this feasible? • Obviously,somecriteriaof optimalityhavetobedefined. • Inourapproachwedefinetheprobabilityof“correctclassification”
Notation: one sample, one sided • Let Y1j,...,YNjforj=1,...,bdenotebobservationsfromtheNhypothesesof interest. • totalofbNobservations, • Nhypothesiswithbobservationsforeach. • Weassumethatbis“large.” • Wefurtherassumethat:Yij∼N(µi,σ2)fori=1,...,N.For simplicityweassumeequalvariances. • Hypothesis of interest: H0i :µi≤µ0 HAi:µi>µ0 • Themeansµicomefromtwodifferentpopulations, • withkoftheµi=µ0and • (N−k) oftheµi=µ1,withµ1>µ0. • Lety¯iandsidenotethesamplemeanandsamplestandarddeviationfortheithhypothesisofinterest.
Defining the Problem Mathematically: • Test Statistic For One Sample, one sided: • Hence,underlargesamplesizes,tiisapproximately • N(0,1),denotedbyf0,forkofthehypotheses • and N(δ,1),denotedbyf1,for(N−k) of the hypotheses, • withd being the effect size
For Two Samples: one sided • LetYg1j,...,YgNjforj=1,...,bgdenotebgobservationsfromeachsampleg=(1,2)fortheNhypothesesofinterest. • Asbefore,assume:Ygij∼N(µgi,σg2)fori=1,...,N. • For simplicityweassumeequalvariances. • Our hypothesesofinterestare • H0i :µ1i−µ2i≤0 • HAi:µ1i−µ2i>0 • y¯giandsgidenotethesamplemeanandsamplestandarddeviationfortheithhypothesisofinterest for sampleg.
Paired: • Whenthesamplesarepairedb1=b2=bandwedefinethedifference • dij=Y1ij−Y2ij. • Hence,dij∼N(µdi,σd2)fori=1,...,N. • Withd¯iasthesamplemeanof the differencesand sias the sample standard deviationand the test statisticdefined as a pooled t or paired t. • As before,weassumethattiisapproximately • N(0,1),denotedbyf0,forkof thehypothesesand • N(δ,1),denotedbyf1,for(N−k) of the hypotheses,with • δbeing the effect size.
Un-paired • Using usual notation for two sample means we define our test statistic as: • As before,weassumethattiisapproximately • N(0,1),denotedbyf0,forkof thehypothesesand • N(δ,1),denotedbyf1,for(N−k) of the hypotheses,with • δbeing the effect size.
Mathematical Details • Forcomparingoneortwomeanstheteststatisticsarenormallydistributedcomingfromtwogroups: • group0(hypothesessupportingthenull)and • groupA(hypothesessupportingthealternative)with • kandN− kmembersrespectivelyand • δdefinedas the corresponding effect size Lettheteststatisticsforthetwogroupsbe • t0andtA
Introducing: Order Statistics Ingroup0thet0are iidN(0,1); denotebyz(1),...,z(k)the orderstatisticsoftheserandomvariables.Then by knownresultswehaveforthe largestorderstatistic: f(z(k))=k[Φ(z(k)]k−1f (z(k)), IngroupAthetAare iidN(d,1) with pdf and cdff d and Fd. We denotebyu(1),...,u(N-k)the orderstatisticsoftheserandomvariables. Then the pdf (N – k – r) = ( m - r)th order statistic, v, is given by:
Misclassification • Misclassification can happen if: • One of the entries in our top r list comes from the null. • This is possible if the (m-r)th order statistic from the alternative is smaller than the largest observation in the null. Hence, if we can find the probability of this we can find the probability, we can look at the reliability of our top r list.
Tables and Computation: • Hence, r-power, rPdependson : • r,thesizeof thetoptable; • N,thenumberof hypothesestested; • k,thenumberofhypothesescomingfromthenull; • andδ,thedistancebetweenthenullandthealternative. • We providerPforN=1000 forvariouschoicesoftheotherparametersfortheone-sidedcase. • Keepingr≤(N−k),wereportrPfor r=5,10,50and100. • Forthetwo-sidedcaseweprovidetablesforN=1000inwereportrPforr=5,10,50 and100.
Table1:Resultsforr-powerforN=1000 δ↓a→0.990.950.900.850.800.750.700.650.600.550.50 r=5 2.2 2.4 2.6 2.8 3.0 r=100 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.11 0.29 0.04 0.15 0.35 0.57 0.76 0.20 0.42 0.64 0.80 0.90 0.40 0.63 0.79 0.90 0.95 0.57 0.75 0.87 0.94 0.97 0.69 0.83 0.92 0.96 0.98 0.77 0.88 0.94 0.97 0.990 0.83 0.92 0.96 0.98 990.99 0.88 0.94 0.97 0.99 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0201 0.00 0.00 0.00 0.03 0.13 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.06 0.20 0.42 0.00 0.00 0.00 0.00 0.00 0.01 0.06 0.20 0.42 0.64 0.00 0.00 0.00 0.00 0.00 0.04 0.16 0.37 0.60 0.77 0.00 0.00 0.00 0.00 0.02 0.11 0.29 0.52 0.72 0.85 0.00 0.00 0.00 0.01 0.06 0.21 0.43 0.65 0.80 0.90 0.00 0.00 0.00 0.03 0.13 0.32 0.55 0.74 0.86 0.93
Table5:Resultsforr-powerforTwo-sidedcasewithN=1000 δ↓a→0.100.200.300.400.500.600.700.800.90 r=10
Findings on r-power, rP Fromthetablesitisevidentthatr-power,rP, • increasesastheeffectsizeδincreases, • decreaseswiththeratiop0=k/N,theproportionofthehypothesescomingfromthenull. • Astheeffectsizeincreases, itiseasiertodistinguishthealternativefromthenull,andsopowershouldincrease. • Similarly,whentheproportionofhypothesescomingfromthenullincreases,i.e.the“signal”ofinterestissparser,thesebecomeharder to detect. • A very important part of r-power is figuring out “k” the number of hypothesis that came from the null. • It makes sense that r < N-k, as if we pick r to be more than the number of hypothesis that are in the alternate, it makes sense that there will be some in our list that comes from the null.
Estimating the number of null hypothesis • To estimate the number of TRUE nulls we have in a large scale testing problem is getting some attention recently. • After FDR there has been more and more people looking at this, so the idea is not new. • This is where the collaboration with Chee (Dr. Xiongzhi Chen) begins. • The idea is based on some mathematical ideas and I will set up the idea here.
Estimating the number of nulls: • Set up: • Without loss of generality we will assume that we are testing: • Hypothesis of interest: H0i :µi= 0 HAi:µi≠0, • Here we are assuming that the null hypothesis is scaled to 0. • If we can find a function such that, • G(µi) = 1 when µi=0 • G(µi) = 0 when µi≠0 • Then ΣG(µi)/N would give us an estimate for p0. • So the question is there such a function?
A possibility: This gives us a function to look at. Now we need to tie this in to data. Ideally we would like this function to be an expectation.
An Estimator: • Now, let us consider the integral: • When x ~N(m, 1). • Now E(h(t,x)) = y(t; m). • Hence, for a given data set, we can calculate h(t,xi)and our estimator • = ( • Hence given a data set we can estimate k, the number of nulls in a data set. • To maintain an accurate estimate we need t to be large and so, we generally estimate it by
Back to r-power • As we saw, r-power depends upon k, the number of nulls (which now can be estimated) for a problem. • It also depends upon N (the number of hypothesis tested), known for a specific problem • Effect size, which we can vary as we get r-power. • So, in general for a specific problem we can estimate r-power. • However, we did make as assumption when we calculated r-power, that is our hypothesis were independent. • This is generally not a tenable assumption. However, without this assumption, it would be mathematically not feasible to compute r-power. • So let us look at the effect of dependence among the hypothesis for r-power.
Effect of Correlation: Simulation Study • To see the effect of correlation we conducted a simulation study: • 1000 hypothesis such that 100 are generated from the alternative and 900 from the null Normal (mean=0, sd=1). • The ones from the alternative were generated in clusters of 10 Multivariate normal with varying correlations from 0 to .95 with a mean of d (varying) and sd=1. • For example one of the data sets had 10 clusters each with a correlation of .95, from a MVN with mean of .5 and standard deviation of 1. • We then did our tests and ranked our test statistic from largest to smallest.
Simulation continued: • For each case, we estimated k using a naïve method (counting the number rejected without any multiplicity correction) and the method based on Dr. Chen’s method. • We then varied r, the number of hypothesis we had in our list, from 1 to 100. For each list we looked to see if the top r list came from the alternative population, or from the null. • If a single observation was from the null, that was considered a failure. • We computed r-power as, the number of correct lists out of the total number of simulations.
Graph showing the effect of correlation The worst case scenario for power was when correlation=0 (independent)
Conservative r-power • Given our simulation results and the fact is independence is the worst case scenario, we decided that, if we assume independence in our calculation of r-power, we will be conservative and thus err on the side of caustion. • Hence, we decided that the definition of r-power we provide would work for cases when independence is not satisfied and would be conservative.
Looking at r-power Chee’s method versus the naïve method As the effect size goes up, the two methods agree for the most part
Remember r-power under independence One-sided Two-sided
Results for the ALL data set: • Recall for the ALL data we had, • 31(Bonferroni,HochbergStep-up,Holm step-down), • 70(Benjamini-YekutieliFDR)or • 230 (Benjamini-HochbergFDR). • Now, we can calculate r-power for this data • Withδ=2.0,r-powersare • 0.83,0.46,and0, • Withδ=3.0,ther-powersare • 1.00,0.99,and0.79. • Thesecondwayofusingourmethodwouldbetoestimatergivenapreviouslychosenr-power. • Forexampleifwewantr-power to be0.75,withδ=2.0,rwouldneed to be40inthisexample.Withδ=3.0and r-powerof0.75,wehaver= 70.
Example 2: AnalysisofaFunctionalMagneticResonanceImaging(fMRI)Study • Inthisexperiment,subjectsperformaseries ofsaccade,oreyemovement,tasks(fMRI)dataarecollected. • Theexperimentiscarriedoutinablockdesign,withblocksof“anti-saccade”alternatingwithblocksoffixation.Intheanti-saccadecondition,avisualcueappearsonascreeninsidethescanner.Aftersomeshortdelay,thesubjectistodirectthegazetothemirror-imagelocationonthescreen;forexample,ifthecueappearsontheleft,thesaccadeshouldbetotheright.Duringfixationblocks, the subjectsimplylooksatacross-hairin the middleof the screen. • Moreprecisely,theexperimentconsistsofthirteenalternatingblocks,aninitialfixation blockfollowedbysix(anti-saccade,fixation)pairs. • Eachfixationblocklastsfor22.5seconds,whereaseachtaskblocklastsfor25.2seconds,duringwhichthereareeightanti-saccadetrials. • Overthecourseof thestudy,81wholebrainimagesarecollected. • Theimagesconsist of24slicesofsize64×64voxelseach.
Analysis and results from the fMRI study • Questionofinterest:identificationoflocationsinthebrainwhichshowincreasedlevelsof activationinresponse to the anti-saccadetask,compared to fixation. • Ignoringboththespatialandtemporalcorrelationthatarepresentinthedata,at eachvoxelwecalculatea two-samplet-statistictocomparelevelsofmeasuredsignalduringtaskscanstothoseacquiredduringrestscans. • Wecanthenobtainp-valuesrelativetoeitherastandardNormal distributionorat-distribution withdegreesoffreedomequalto(numberofscans-2);sincethisisalargenumber,thetwowillgivesimilarresults. • Usingourtop-rapproachforeachsubject,wefindthermostsignificantvoxels,withr = 19,48,96,192,representingroughly1%,2.5%,5%,10%oftheimageandthecorrespondingr-powertabulatedasfollowsfordifferentchoicesof kandδin • Notethatthesevaluesarerelativetothetotalnumberofvoxelsintheslice40×48=1920.
Figure1:Top19mostsignificantvoxelsplottedon the individualsubjectbrains.
Figure2:Top48mostsignificantvoxelsplottedon the individualsubjectbrains.
Figure3:Top96mostsignificantvoxelsplottedon the individualsubjectbrains.
Figure4:Top192mostsignificantvoxelsplottedon the individualsubjectbrains.
0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 Figure5:Foreachvalueoftopmostsignificantvoxels,theproportionofsubjectsshowingthatlocationintheindividualmap.TheseareobtainedbyaveragingthebinaryimagesinFigures1–4.
Results for fMRI data • TheseareplottedforallsubjectsandthedifferentvaluesofrinFigures1–4. • Also,asafinalstepwegetagroupmapforeachvalueofrbysummingup,ateachvoxel,these0sand1s,anddividingbythenumber ofsubjects. • Henceavoxelthatisnotinthetoprforanysubjectwillgetavalue of 0,andavoxelthat isinthetopr forallsubjectswillgetavalueof1inthegroupimage. • Alreadyforr=19,orthetop1%ofvoxelsintheimage,thetypicalregionsof activationforthistask–frontalandsupplementaryeyefields,andposteriorparietalcortex,areapparent. • Asrincreases, theseareasarehighlightedevermorestronglyinthegroupmap,however,ther-powerdecreases. • Groupsummariessuchastheoneswepresenthereareaneffectivewayof extractingtheoverall patternswhichcanbeobscuredat the levelofindividualsubjects.
Summary: • Thereisstillagapbetweentheoryandpractice • Ourmanuscriptsuggestsanextrapiecetothispractice,namelythecalculationof theprobabilityofcorrectclassification,giventhesizeofthe table.One canuse our methodintwodifferentways. • First,wecandecideonavalueofr-powertoachieveandthenbasedonthiswecandecideonhowmanyhypothesesweshouldfollowup • Second,oncewehavepickedther-topcandidatesusingsomecriterionwecanthencalculatetheprobabilitythat • Themaincontributionofthismanuscriptisquantificationofameasurewhichwehaveusedorseenusedformultiplicityadjustment. • We cannowanswerwithmorecertaintyhowsurewearethat the“discoveries”(voxels,genes,probes,etc.)didindeedcomefromtheresearch hypothesis. • Thisisatypeofreliabilitymeasurefor the top-table.
Future Work: • Analytically (and via simulation) look at see how well the Naïve method and Chee’s method compares to the TRUE number of nulls. (potential PhD project). • Write R-code or SAS code so that one can calculate r-power in general for data (MS project). • Our criteria for r-power was stringent, even if there is ONE member in the list that comes from the null, the list is considered wrong. This is Family-wise Error-esque in idea. We could extend it to what if we allow k-errors in the list and find r-power. (Related to k-FWER). This is a good PhD project.