150 likes | 164 Views
The study presents "Filtering Enhanced Variable Selection" (FEVS), a class comparison method with filtering for high-dimensional data sets, aiming to identify informative variables while reducing multiple comparison issues. The method involves removal of non-informative variables before analysis, offering advantages in sensitivity gains. The research explores the criteria for filtering methods, variable selection, and statistical controls. FEVS demonstrates adaptability to data, avoiding arbitrary choices in filtering methods. The study provides insights into optimizing variable selection strategies for small sample sizes.
E N D
A Class Comparison Method with Filtering Enhanced Variable Selection for High-Dimensional Data Sets Lara Lusa* Edward L. Korn** Lisa M. McShane** *Institute of Biomedical Informatics, University of Ljubljana, Slovenia **Biometric Research Branch, National Cancer Institute, NIH, Bethesda, MD, USA
Filtering Enhanced Variable Selection (FEVS): A Class Comparison Method (Multiple Testing Strategy) for High-Dimensional Data Sets Removal of non-informative variables before data analysis • Identify variables whose expression is associated with a response or covariate • different in Normal and Tumor • different after a treatment • associated to survival • … Many variables measured for each sample: gene-expression, proteomics … Can reduce the multiple comparison problem Thousand of hypotheses are tested simultaneously • widely applied but arbitrary • beneficial?gain sensitivity • disadvantageous?filter out truly differentially expressed variables
What defines a filtering method? Which variables to filter out? How many variables to filter out? Stringency Filtering ranking-statistics (FRS) Class independent Class dependent • number of variables • K variables with largest FRS • percentage of variables • P% variables with largest FRS • variables with certain FRS • FRS>constant Variance Range Interquartile range 95th-5th percentile Fold difference between two classes
Filtering Enhanced Variable Selection Based onthemultivariate permutation testingmethod… • … that controls with specified confidence (1-α)% • the actual number of false positives (u) • (approximately) the actual proportion of false positives • Extendedin order to allow the use of statistics that are not just the function of the data for a single variable • Wi=g( {Xi1, …, Xin}, {Yi1, …, Yim}, {X1, …, Xn, Y1, …, Ym}) • instead of Wi=g( {Xi1, …, Xin}, {Yi1, …, Yim})
The choice of Wi: minimum Bonferroni-like adjusted P-value ranking-statistics Use mi instead of pi in the original and permuted data sets • An exhaustive and computationally convenient choice • S=k filtering methods • same filtering ranking-statistic • all possible degrees of stringency • rank of the i-th variable according to FRS (smaller rank= larger variability) FEVS: using mi it combines the results obtainable applying many different filtering methods
Results: simulations and real data • Using • a conservative algorithm (A* of Korn et al.) to reduce computational burden • IQR to rank the variables • “stop at 100” – excluding filtering methods that retain less than 100 variables • α = 0.05 • 99 permutations in the simulations, repeated 10,000 times Null Hypothesis: Levels of the FEVS are OK FEVS 50,000 variables for two groups of 5 samples, independent, g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), FRS=IQR, stop at 100, α=0.05
FEVS Alternative hypothesis: FEVS adapts to data and does not necessarily pick one filtering method n1= n2=5, k=50,000 n1= n2=20, k=50,000 FEVS Sensitivity Sensitivity Mean shift Mean shift MPT, u=0, 95% confidence, IQR, fixed-%, 300 DE genes, with mean shifts from 0.6 to 3.5 (10 DE each); g1i~id N(μi, σi), g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), μi =0 for i>300
ER (Ligand-Binding Assay): 34 ER-/65 ER+; GRADE: 54 Grade1-2/45 Grade3. 7650 clones (6878 unique)
Results: real data example FEVS FRS=IQR,α=0.05, FEVS stop at 100 Most “FEVS-exclusive genes” found to by associated to grade or ER status in other microarray data sets
Conclusions • It does not seem feasible to identify a universally optimalfiltering method • FEVS avoids the arbitrariness of the choice of a single pre-specified filtering method • Any type of filtering method can be embedded in the FEVS (or in multivariate permutation procedures) • Including class-dependent FRS • did not show clear benefit over class independent methods – we showed analytically that it is to be expected to be less powerful than class-independent filtering • Wrong levels of the tests associated to naïve filtering methods (longest list, wrong class dependent filtering) • Choice of FRS: IQR, variance, 95th-5th percentile gave similar results • FEVS works better for small sample sizes
Perspectives • FEVS can be extended to control the proportion of false positives • The loss of power observed when filtering methods which retained few variables were included in the FEVS could be due to the conservatism of the A* procedure and might be overcome • Fast implementation of the methods made publicly available • More statistical methods are going to be included
Acknowledgments NIH Biowulf/LoBoS3 cluster
Results: null hypothesis FEVS FEVS Levels of the FEVS are OKunder the null 50,000 variables for two groups of 5 samples, independent, g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), FRS=IQR, stop at 100, α=0.05
Results: “extreme” example n1= n2=5, k=50,000 independent, 100 DE (50 +50) FEVS FEVS big mean differences big variances g1i~id N(5.5, 0.8), 51≤i≤100 FEVS is the union of the results from various filtering methods It does not necessarily pick one small mean differences small variances g1i~id N(1,0.1), 1≤i≤50 g1i~id N(0, σi), i>100 g2i~id N(0, σi) i=1, …, 50000 σi2~InvGamma (a=3, b=1), u=0, FRS=IQR, α=0.05
Results: external “validation” Number of microarray data sets that included grading (15) in which we found the “FEFS-exclusive genes” (found by FEFS but not without filtering) to be associated with grading