Filtering Enhanced Variable Selection: A Method for High-Dimensional Data Sets

A Class Comparison Method with Filtering Enhanced Variable Selection for High-Dimensional Data Sets Lara Lusa* Edward L. Korn** Lisa M. McShane** *Institute of Biomedical Informatics, University of Ljubljana, Slovenia **Biometric Research Branch, National Cancer Institute, NIH, Bethesda, MD, USA

Filtering Enhanced Variable Selection (FEVS): A Class Comparison Method (Multiple Testing Strategy) for High-Dimensional Data Sets Removal of non-informative variables before data analysis • Identify variables whose expression is associated with a response or covariate • different in Normal and Tumor • different after a treatment • associated to survival • … Many variables measured for each sample: gene-expression, proteomics … Can reduce the multiple comparison problem Thousand of hypotheses are tested simultaneously • widely applied but arbitrary • beneficial?gain sensitivity • disadvantageous?filter out truly differentially expressed variables

What defines a filtering method? Which variables to filter out? How many variables to filter out? Stringency Filtering ranking-statistics (FRS) Class independent Class dependent • number of variables • K variables with largest FRS • percentage of variables • P% variables with largest FRS • variables with certain FRS • FRS>constant Variance Range Interquartile range 95th-5th percentile Fold difference between two classes

Filtering Enhanced Variable Selection Based onthemultivariate permutation testingmethod… • … that controls with specified confidence (1-α)% • the actual number of false positives (u) • (approximately) the actual proportion of false positives • Extendedin order to allow the use of statistics that are not just the function of the data for a single variable • Wi=g( {Xi1, …, Xin}, {Yi1, …, Yim}, {X1, …, Xn, Y1, …, Ym}) • instead of Wi=g( {Xi1, …, Xin}, {Yi1, …, Yim})

The choice of Wi: minimum Bonferroni-like adjusted P-value ranking-statistics Use mi instead of pi in the original and permuted data sets • An exhaustive and computationally convenient choice • S=k filtering methods • same filtering ranking-statistic • all possible degrees of stringency • rank of the i-th variable according to FRS (smaller rank= larger variability) FEVS: using mi it combines the results obtainable applying many different filtering methods

Results: simulations and real data • Using • a conservative algorithm (A* of Korn et al.) to reduce computational burden • IQR to rank the variables • “stop at 100” – excluding filtering methods that retain less than 100 variables • α = 0.05 • 99 permutations in the simulations, repeated 10,000 times Null Hypothesis: Levels of the FEVS are OK FEVS 50,000 variables for two groups of 5 samples, independent, g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), FRS=IQR, stop at 100, α=0.05

FEVS Alternative hypothesis: FEVS adapts to data and does not necessarily pick one filtering method n1= n2=5, k=50,000 n1= n2=20, k=50,000 FEVS Sensitivity Sensitivity Mean shift Mean shift MPT, u=0, 95% confidence, IQR, fixed-%, 300 DE genes, with mean shifts from 0.6 to 3.5 (10 DE each); g1i~id N(μi, σi), g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), μi =0 for i>300

ER (Ligand-Binding Assay): 34 ER-/65 ER+; GRADE: 54 Grade1-2/45 Grade3. 7650 clones (6878 unique)

Results: real data example FEVS FRS=IQR,α=0.05, FEVS stop at 100 Most “FEVS-exclusive genes” found to by associated to grade or ER status in other microarray data sets

Conclusions • It does not seem feasible to identify a universally optimalfiltering method • FEVS avoids the arbitrariness of the choice of a single pre-specified filtering method • Any type of filtering method can be embedded in the FEVS (or in multivariate permutation procedures) • Including class-dependent FRS • did not show clear benefit over class independent methods – we showed analytically that it is to be expected to be less powerful than class-independent filtering • Wrong levels of the tests associated to naïve filtering methods (longest list, wrong class dependent filtering) • Choice of FRS: IQR, variance, 95th-5th percentile gave similar results • FEVS works better for small sample sizes

Perspectives • FEVS can be extended to control the proportion of false positives • The loss of power observed when filtering methods which retained few variables were included in the FEVS could be due to the conservatism of the A* procedure and might be overcome • Fast implementation of the methods made publicly available • More statistical methods are going to be included

Acknowledgments NIH Biowulf/LoBoS3 cluster

Results: null hypothesis FEVS FEVS Levels of the FEVS are OKunder the null 50,000 variables for two groups of 5 samples, independent, g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), FRS=IQR, stop at 100, α=0.05

Results: “extreme” example n1= n2=5, k=50,000 independent, 100 DE (50 +50) FEVS FEVS big mean differences big variances g1i~id N(5.5, 0.8), 51≤i≤100 FEVS is the union of the results from various filtering methods It does not necessarily pick one small mean differences small variances g1i~id N(1,0.1), 1≤i≤50 g1i~id N(0, σi), i>100 g2i~id N(0, σi) i=1, …, 50000 σi2~InvGamma (a=3, b=1), u=0, FRS=IQR, α=0.05

Results: external “validation” Number of microarray data sets that included grading (15) in which we found the “FEFS-exclusive genes” (found by FEFS but not without filtering) to be associated with grading

Filtering Enhanced Variable Selection: A Method for High-Dimensional Data Sets