230 likes | 456 Views
Université Libre de Bruxelles DEA Bioinformatique. Feature Selection Stability Analysis for Classification Using Microarray Data. Panagiotis Moulos. Outline. Introduction Motivation Stability Measure Approach The bias/variance tradeoff Contributions Materials and Methods
E N D
Université Libre de Bruxelles DEA Bioinformatique Feature Selection Stability Analysis for Classification Using Microarray Data Panagiotis Moulos
Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data
Prognosis Supervised/Unsupervised learning for tumor classification Cancer Genetic Signature Feature Selection techniques for important gene identification (Early) Diagnosis Feature Selection Classification Motivation • Microarrays are invaluable tools for cancer studies at the molecular level prognosis, early diagnosis • Microarray data analysis • However, these signatures are sensitive to perturbations: a small perturbation (e.g. remove 1 sample) may lead to a completely different signature Similarity between (2,5,3) and (1,3,4) ? full gene ranking list (1,3,4,5,2) signature (1,3,4) full gene ranking list (2,5,3,4,1) signature (2,5,3) BUT Feature Selection Stability Analysis for Classification Using Microarray Data
Stability Measure Approach • Problem of similarity between two gene lists can be approached mathematically by the theory of permutations • Given a set Gn = (g1, g2, …, gn) of objects, a permutation π is a bijective function between Gn and Gn • Concerning the frame of microarray data • The n genes – features involved are labeled with a unique number between 1, …, n • Every gene ranking list (full ranking list) is exactly a permutationπ on the set {1, …, n} where the image π(i) of the ith gene is its ranking inside π • If we are interested only for the top N ranked genes – features of Gn, we define as π* the partial ranking list of Gn which contains the first N elements of π Feature Selection Stability Analysis for Classification Using Microarray Data
Stability Measure Approach (Example) • A full ranking list: G5 = (1,2,3,4,5) • A permutation: π = (3,2,5,4,1) where π(1) = 3, π(2) = 2, π(3) = 5, π(4) = 4, π(5) = 1 • A partial ranking list with the top N = 3 ranked genes: π* = (3,2,5) where π*(1)= 3, π*(2) = 2, π*(3) = 5 • How can we summarize variability between • Full ranking lists πand σ • Partial ranking listsπ*and σ* • Several metrics proposed in statistical literature (e.g. Critchlow, 1985) Feature Selection Stability Analysis for Classification Using Microarray Data
Variance contribution Too many parameters Inclusion of noise Overfitting Bias contribution Too few parameters Not enough flexibility Misfit The bias/variance tradeoff • A central issue in choosing a model for a given problem is selecting the level of structural complexity (# variables/parameters etc) that best suits the data that it must accommodate • Deciding on the correct amount of flexibility in a model is therefore a tradeoff between these two sources of the misfit. This is called the bias/variance tradeoff Feature Selection Stability Analysis for Classification Using Microarray Data
Contributions • Experimental study of signature stability in gene expression datasets by resampling (bootstrap, jackknife) datasets for different ranking/feature selection methods • Study of several forms of feature selection stability using statistical similarity measures • Classification performance assessment for each feature selection and classification method • Study of possible correlation between feature selection stability and classification accuracy for all feature selection and classification methods • Proposal of a feature aggregation procedure to obtain more stable probabilistic gene signatures Feature Selection Stability Analysis for Classification Using Microarray Data
Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data
with Stability Metrics • Measures of Feature Selection Stability • Stability of Selection • Stability of selection for a given dataset is the stability of appearance of certain features after resampling the original dataset. • Hamming Distance • Inconsistency • Stability of Ranking • Stability of ranking for a given dataset is the stability of both the appearance and the ranking order of certain features after resampling the original dataset • Spearman’s Footrule • Kendall’s Tau Feature Selection Stability Analysis for Classification Using Microarray Data
Resampling 1 Original Dataset Resampling 2 Resampling 3 Resampling 4 Resampling 5 Example (Hamming Distance) • Example of stability metric: Hamming Distance • Calculation of Hamming Distance for: m = 5, n = 10, N = 5 Feature Selection Stability Analysis for Classification Using Microarray Data
Experimental Analysis • Datasets: Breast Cancer (HBC, Tamoxifen), Leukemia (MLL, Golub), Lymphoma • Classification algorithms: k-NN (k = 5), Support Vector Machines • Feature Selection algorithms • Filters: Gram – Schmidt orthogonalization, k-NN and SVM correlation based filter (gene ranking according to misclassification error by 1 – gene trained classifier) • Wrapper: Sequential Forward Selection wrapper • Feature aggregation (main personal contribution) • Gather together all different signatures • Remove duplicates • Exclude features with low selection frequency according to a threshold • Resampling strategies • Bootstrap (on each step resample patients of a dataset with replacement) • Jackknife (on each step remove 1 – 5% of samples) Feature Selection Stability Analysis for Classification Using Microarray Data
Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data
Visualizing Instability Feature Selection Stability Analysis for Classification Using Microarray Data
Stability Results (1) • Stability of Selection (Bootstrap) • Stability of Ranking (Bootstrap) Feature Selection Stability Analysis for Classification Using Microarray Data
Stability Results (2) • Stability of Selection (Jackknife) • Stability of Ranking (Jackknife) Feature Selection Stability Analysis for Classification Using Microarray Data
No Filtering or Wrapping Accuracy Results Feature Selection Stability Analysis for Classification Using Microarray Data
Remarks • Stability • Stability inversely proportional to size of perturbation • Gram – Schmidt orthogonalization outperforms classifier based correlations • Filters more stable than the wrapper • Correlation between stability of selection and stability of ranking • Accuracy • Accuracy proportional to size of perturbation • Gram – Schmidt orthogonalization is outperformed by classifier based correlations • Filters outperform the wrapper • Performance is improved after the application of Feature Selection techniques Feature Selection Stability Analysis for Classification Using Microarray Data
Feature Aggregation • Feature Aggregation • Class permutation test shows no overfitting • t – test between mean accuracies before and after aggregation reveals improvement in the performance of wrapper but not of the filters Feature Selection Stability Analysis for Classification Using Microarray Data
Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data
Aggregation does not improve Accuracy (e.g. Filters) bias/variance tradeoff Lower variance in Feature Selection Model less flexible Higher bias High Stability Aggregation improves Accuracy (e.g. Wrapper) bias/variance tradeoff Higher variance in Feature Selection Model more flexible Lower bias Low Stability General Remarks • Similarity metrics: their use depend on what kind of stability we study • Filters more stable and accurate than wrappers: although wrappers return few variables, their selection procedure can be highly variant • One would expect that high stability leads to high accuracy. However this is not always the case. Why? • Best compromise between bias and variance depends on many parameters (feature selection algorithm, top N ranked genes etc.) • Aggregation • Filters: Lower variance in Feature Selection Aggregation does not improve accuracy • Wrapper: Higher variance in Feature Selection Aggregation improves accuracy (by adjusting variance to achieve better compromise between bias and variance) Feature Selection Stability Analysis for Classification Using Microarray Data
Conclusions – Future Work • Conclusions • We have shown that genetic signatures are sensitive to perturbations • Stability analysis using similarity metrics is necessary in order to evaluate signature sensitivity • Aggregation procedure creates a distribution of selected genes which can be used as a more stable probabilistic genetic signature for cancer microarray studies • It is better to use a more stable probabilistic signature consisting of more genes than a perturbation sensitive signature consisting of less genes • Future work • Study gene ranking using Markov Chains (MC): the selection of a gene during the selection process could be dependent on the previous gene (1st order MC) or on more previously selected genes (higher order MC) • Comparison of stability between Forward Selection and Backward Elimination wrappers • Further research on the relation between stability and accuracy: use of more algorithms, feature ranking based on stability/accuracy ratio • Study the effect of updating classification models with new data on genetic signatures • Biological interpretation of selected genes in probabilistic signatures Feature Selection Stability Analysis for Classification Using Microarray Data
Acknowledgements • Many thanks to: • Gianluca Bontempi (Machine Learning Group, ULB) • Christos Sotiriou (Microarray Unit, IJB) • Benjamin Haibe – Kains (PhD student, MLG ULB, IJB) • Mrs Yiota Poirazi and the Computational Biology Group, FORTH (for this opportunity) Feature Selection Stability Analysis for Classification Using Microarray Data