420 likes | 632 Views
New EDA-approaches to feature selection for classification (of biological sequences). Yvan Saeys. Outline. Feature selection in the data mining process Need for dimensionality reduction techniques in biology Feature selection techniques EDA-based wrapper approaches Constrained EDA approach
E N D
New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys
Outline • Feature selection in the data mining process • Need for dimensionality reduction techniques in biology • Feature selection techniques • EDA-based wrapper approaches • Constrained EDA approach • EDA-ranking, EDA-weighting • Application to biological sequence classification • Why am I here ? Yvan Saeys, Donostia 2004
Feature selection in the data mining process pre-processing feature extraction feature selection model induction/classification post-processing Yvan Saeys, Donostia 2004
Need for dimensionality reduction techniques in biology • Many biological processes are far from being completely understood • In order not to miss relevant information • Take into account as much features as possible • Use dimension reduction techniques to identify the relevant feature subspaces • Additional difficulty: • Many feature dependencies Yvan Saeys, Donostia 2004
Dimension reduction techniques Feature selection Projection Feature ranking Feature weighting Compression … Yvan Saeys, Donostia 2004
Benefits of feature selection • Attain good or even better classification performance using a small subset of features • Provide more cost-effective classifiers • Less features to take into account faster classifiers • Less features to store smaller datasets • Gain more insight into the processes that generated the data Yvan Saeys, Donostia 2004
Feature selection: another layer of complexity • Bias-variance tradeoff of a classifier • Model selection: find the best classifier with the best parameters for the best subset • For every feature subset: model selection • Extra dimension in the search process Yvan Saeys, Donostia 2004
Feature selection strategies • Filter approach • Wrapper approach • Embedded approach Classification Model FS FS Search Method Classification Model Classification Model Classification Model Parameters FS • Feature selection based on signal processing techniques Yvan Saeys, Donostia 2004
Filter approach • Independent of classification model • Uses only dataset of annotated examples • A relevance measure for each feature is calculated: • E.g: Feature – Class entropy • Kullback-Leibler divergence (cross-entropy) • Information gain, gain ratio • Normalize relevance scores weights • Fast, but discards feature dependencies Yvan Saeys, Donostia 2004
Wrapper approach • Specific to a classification algorithm • The search for a good feature subset is guided by a search algorithm (e.g. greedy forward or backward) • The algorithm uses the evaluation of the classifier as a guide to find good feature subsets • Examples: sequential forward or backward search, simulated annealing, stochastic iterative sampling (e.g. GA, EDA) • Computationally intensive, but able to take into account feature dependencies Yvan Saeys, Donostia 2004
Embedded approach • Specific to a classification algorithm • Model parameters are directly used to discard features • Examples: • Reduced error pruning in decision trees • Feature elimination using the weight vector of a linear discriminant function • Usually needs only few additional calculations • Able to take into account feature dependencies Yvan Saeys, Donostia 2004
EDA-based wrapper approaches Yvan Saeys, Donostia 2004
EDA-based wrapper approaches • Observations for (biological) datasets with many features: • Many feature subsets result in the same classification performance • Many features are irrelevant • Search process spends most of its time in subsets containing approximately half of the number of features Yvan Saeys, Donostia 2004
EDA-based wrapper approaches Only a small fraction of the features are relevant Faster evaluation of a classification model when only a small number of features are present Constrained Estimation of Distribution Algorithm (CDA) : Determine an upper bound U for the maximally allowed number of features in every individual (sample) Apply a filter to the generated (sampled) individuals: allow at most U features in the subset Yvan Saeys, Donostia 2004
EDA-based wrapper approachesCDA • Advantages: • Huge reduction of the search space • Example : 400 features: • Full search space: 2400 feature subsets • U=100: 3.3E96 feature subsets • Reduction by 23 orders of magnitude • Faster evaluation of a classification model • Scalable to datasets containing a very large number of features • Scalable to more complex classification models (e.g. SVM using higher order polynomial kernel) Yvan Saeys, Donostia 2004
CDA: example # F eatures # Ev Av erage # F eatures Balanced Un balanced NBM 150 68875 294.40 0 h 34 m 1 h 58 m SBE 80 76960 275.98 0 h 36 m 2 h 09 m 40 79380 269.48 0 h 37 m 2 h 11 m NBM 150 67100 150 0 h 20 m 0 h 46 m CD A 80 67100 80 0 h 09 m 0 h 21 m 40 67100 40 0 h 05 m 0 h 11 m LSVM 150 68875 294.40 2 h 15 m 2 h 38 m SBE 80 76960 275.98 2 h 19 m 2 h 52 m 40 79380 269.48 2 h 20 m 2 h 54 m LSVM 150 67100 150 0 h 38 m 0 h 59 m CD A 80 67100 80 0 h 17 m 0 h 27 m 40 67100 40 0 h 14 m 0 h 19 m PSVM 150 13875 296.26 9 h 11 m 62 h 02 m SBE 80 15520 277.68 9 h 42 m 63 h 24 m 40 16020 271.03 9 h 48 m 63 h 40 m PSVM 150 13510 150 4 h 54 m 16 h 48 m CD A 80 13510 80 2 h 48 m 9 h 38 m 40 13510 40 1 h 52 m 6 h 16 m Yvan Saeys, Donostia 2004
EDA-based feature ranking • Traditional approach to FS • Only use the best individual found during the search -> optimal feature subset • Many questions remain unanswered • Single best subset provides a static view of the whole elimination process • How much features can still be eliminated before classification performance drastically drops down • Which features can still be eliminated ? • Can we get a more dynamical analysis ? Yvan Saeys, Donostia 2004
Feature ranking Yvan Saeys, Donostia 2004
EDA-based feature ranking/weighting • Don’t use the single best individual • Use the whole distribution to assess feature weights • Use the weights to rank the features Yvan Saeys, Donostia 2004
EDA-based feature weighting • Can be used to do : • Feature weighting • Feature ranking • Feature selection • Problem : how “convergent” should the final population be ? • Not enough convergence : no good feature subsets found yet (early stop) • Too much convergence (in the limit, all individuals are the same) • Solution • Convergent enough but not too convergent Yvan Saeys, Donostia 2004
How to quantify “enough but not too convergent” ? • Define the scaled Hamming distance between two individuals A and B as • Convergence of a distribution : • The average scaled Hamming distance between all pairs of individuals HD(A,B) HDS(A,B) = N Yvan Saeys, Donostia 2004
Transcription Pre-mRNA Splicing (removal of introns) mRNA Translation Protein Application to gene prediction Introns Transcription start site Start codon Stop codon Poly-A tail Core promoter Enhancer 5’ DNA Promoter region Exons Yvan Saeys, Donostia 2004
GT.. …AG Transcription Ex Ex I 1 I 2 I 3 Ex 1 Ex 2 Ex 3 Ex 4 Donor site Acceptor site Pre-mRNA splicing Ex 1 Ex 2 Ex 3 Ex 4 Translation Protein Splice site prediction Yvan Saeys, Donostia 2004
Splice site predictionFeatures • Position dependent features • e.g. an A on position 1, C on position 17, …. • Position independent features • e.g. subsequence “TCG” occurs, “GAG” occurs,… 1 2 3 17 28 atcgatcagtatcgat GT ctgagctatgag atcgatcagtatcgat GT ctgagctatgag Yvan Saeys, Donostia 2004
Acceptor prediction • Dataset: • 3000 positives • 18,000 negatives • Local context of 100 nucleotides [50,50] • 100 4-valued features • 400 binary features • Classifiers: • Naïve Bayes method • C4.5 • Linear SVM Yvan Saeys, Donostia 2004
2 convergent 2 convergent • A trial on acceptor prediction • 400 binary features (position dependent nucleotides) • Initial distribution : • P(fi) = 0.5 • C(D0) ~ 0.5 (each pair of individuals has on average half of the features in common) • C(D) = 0 (all individuals are the same) Yvan Saeys, Donostia 2004
Evolution of convergence Yvan Saeys, Donostia 2004
Evaluation of convergence rate Yvan Saeys, Donostia 2004
Evaluation of convergence rate Yvan Saeys, Donostia 2004
EDA-based feature ranking • Best results obtained with “semi-converged” population • Not looking for best subset anymore • Looking for best distribution • Advantage: • Need less iterations • Dynamical view of the feature selection process Yvan Saeys, Donostia 2004
EDA-based feature weighting • Color coding feature weights to visualize new patterns : • A color coded mapping of the interval • [0-1] Cold Middle Hot Yvan Saeys, Donostia 2004
Local context G-rich region ? Donor prediction : 400 features T-rich region ? 3-base periodicity Yvan Saeys, Donostia 2004
Donor prediction : 528 features Yvan Saeys, Donostia 2004
Donor prediction : 2096 features Yvan Saeys, Donostia 2004
3-base periodicity Local context T-rich region (poly-pyrimidine stretch) Acceptor prediction: 400 features Yvan Saeys, Donostia 2004
Acceptor prediction : 528 features Yvan Saeys, Donostia 2004
AG-scanning Acceptor prediction: 2096 features TG Yvan Saeys, Donostia 2004
Comparison with NBM Yvan Saeys, Donostia 2004
Related & Future work • Embedded feature selection in SVM with C-retraining • Feature selection tree: combination of filter feature selection and decision tree • Combining Bayesian decision trees and feature selection • Combinatorial pattern matching in biological sequences • Feature Selection Toolkit for large scale applications (FeaST) Yvan Saeys, Donostia 2004
Why am I here ? • Establish collaboration between our research groups • Getting to know each other • Think about future collaborations • Define collaborative research projects • Exchange thoughts/learn more about EDA methods • Probabilistic graphical models for classification • Biological problems • Some ‘test cases’ during this months: apply some of ‘your’ techniques to ‘our’ data • … Yvan Saeys, Donostia 2004
Thank you !! Yvan Saeys, Donostia 2004