1 / 15

Filtering Enhanced Variable Selection: A Method for High-Dimensional Data Sets

The study presents "Filtering Enhanced Variable Selection" (FEVS), a class comparison method with filtering for high-dimensional data sets, aiming to identify informative variables while reducing multiple comparison issues. The method involves removal of non-informative variables before analysis, offering advantages in sensitivity gains. The research explores the criteria for filtering methods, variable selection, and statistical controls. FEVS demonstrates adaptability to data, avoiding arbitrary choices in filtering methods. The study provides insights into optimizing variable selection strategies for small sample sizes.

zinaw
Download Presentation

Filtering Enhanced Variable Selection: A Method for High-Dimensional Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Class Comparison Method with Filtering Enhanced Variable Selection for High-Dimensional Data Sets Lara Lusa* Edward L. Korn** Lisa M. McShane** *Institute of Biomedical Informatics, University of Ljubljana, Slovenia **Biometric Research Branch, National Cancer Institute, NIH, Bethesda, MD, USA

  2. Filtering Enhanced Variable Selection (FEVS): A Class Comparison Method (Multiple Testing Strategy) for High-Dimensional Data Sets Removal of non-informative variables before data analysis • Identify variables whose expression is associated with a response or covariate • different in Normal and Tumor • different after a treatment • associated to survival • … Many variables measured for each sample: gene-expression, proteomics … Can reduce the multiple comparison problem Thousand of hypotheses are tested simultaneously • widely applied but arbitrary • beneficial?gain sensitivity • disadvantageous?filter out truly differentially expressed variables

  3. What defines a filtering method? Which variables to filter out? How many variables to filter out? Stringency Filtering ranking-statistics (FRS) Class independent Class dependent • number of variables • K variables with largest FRS • percentage of variables • P% variables with largest FRS • variables with certain FRS • FRS>constant Variance Range Interquartile range 95th-5th percentile Fold difference between two classes

  4. Filtering Enhanced Variable Selection Based onthemultivariate permutation testingmethod… • … that controls with specified confidence (1-α)% • the actual number of false positives (u) • (approximately) the actual proportion of false positives • Extendedin order to allow the use of statistics that are not just the function of the data for a single variable • Wi=g( {Xi1, …, Xin}, {Yi1, …, Yim}, {X1, …, Xn, Y1, …, Ym}) • instead of Wi=g( {Xi1, …, Xin}, {Yi1, …, Yim})

  5. The choice of Wi: minimum Bonferroni-like adjusted P-value ranking-statistics Use mi instead of pi in the original and permuted data sets • An exhaustive and computationally convenient choice • S=k filtering methods • same filtering ranking-statistic • all possible degrees of stringency • rank of the i-th variable according to FRS (smaller rank= larger variability) FEVS: using mi it combines the results obtainable applying many different filtering methods

  6. Results: simulations and real data • Using • a conservative algorithm (A* of Korn et al.) to reduce computational burden • IQR to rank the variables • “stop at 100” – excluding filtering methods that retain less than 100 variables • α = 0.05 • 99 permutations in the simulations, repeated 10,000 times Null Hypothesis: Levels of the FEVS are OK FEVS 50,000 variables for two groups of 5 samples, independent, g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), FRS=IQR, stop at 100, α=0.05

  7. FEVS Alternative hypothesis: FEVS adapts to data and does not necessarily pick one filtering method n1= n2=5, k=50,000 n1= n2=20, k=50,000 FEVS Sensitivity Sensitivity Mean shift Mean shift MPT, u=0, 95% confidence, IQR, fixed-%, 300 DE genes, with mean shifts from 0.6 to 3.5 (10 DE each); g1i~id N(μi, σi), g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), μi =0 for i>300

  8. ER (Ligand-Binding Assay): 34 ER-/65 ER+; GRADE: 54 Grade1-2/45 Grade3. 7650 clones (6878 unique)

  9. Results: real data example FEVS FRS=IQR,α=0.05, FEVS stop at 100 Most “FEVS-exclusive genes” found to by associated to grade or ER status in other microarray data sets

  10. Conclusions • It does not seem feasible to identify a universally optimalfiltering method • FEVS avoids the arbitrariness of the choice of a single pre-specified filtering method • Any type of filtering method can be embedded in the FEVS (or in multivariate permutation procedures) • Including class-dependent FRS • did not show clear benefit over class independent methods – we showed analytically that it is to be expected to be less powerful than class-independent filtering • Wrong levels of the tests associated to naïve filtering methods (longest list, wrong class dependent filtering) • Choice of FRS: IQR, variance, 95th-5th percentile gave similar results • FEVS works better for small sample sizes

  11. Perspectives • FEVS can be extended to control the proportion of false positives • The loss of power observed when filtering methods which retained few variables were included in the FEVS could be due to the conservatism of the A* procedure and might be overcome • Fast implementation of the methods made publicly available • More statistical methods are going to be included

  12. Acknowledgments NIH Biowulf/LoBoS3 cluster

  13. Results: null hypothesis FEVS FEVS Levels of the FEVS are OKunder the null 50,000 variables for two groups of 5 samples, independent, g2i~id N(0, σi), σi2~InvGamma (a=3, b=1), FRS=IQR, stop at 100, α=0.05

  14. Results: “extreme” example n1= n2=5, k=50,000 independent, 100 DE (50 +50) FEVS FEVS big mean differences big variances g1i~id N(5.5, 0.8), 51≤i≤100 FEVS is the union of the results from various filtering methods It does not necessarily pick one small mean differences small variances g1i~id N(1,0.1), 1≤i≤50 g1i~id N(0, σi), i>100 g2i~id N(0, σi) i=1, …, 50000 σi2~InvGamma (a=3, b=1), u=0, FRS=IQR, α=0.05

  15. Results: external “validation” Number of microarray data sets that included grading (15) in which we found the “FEFS-exclusive genes” (found by FEFS but not without filtering) to be associated with grading

More Related