450 likes | 592 Views
DATA MINING APPROACH IGES 2008. Principe. What is Machine Learning? – Algorithms that allow computers to learn. – Unsupervised learning • E.g. cluster analysis – Supervised learning • E.g. decision trees, logistic regression Data Mining – Machine learning in large databases.
E N D
DATA MINING APPROACH IGES 2008
Principe • What is Machine Learning? • – Algorithms that allow computers to learn. • – Unsupervised learning • • E.g. cluster analysis • – Supervised learning • • E.g. decision trees, logistic regression • Data Mining • – Machine learning in large databases. • Pattern Recognition • – An application of machine learning to • •Speech recognition • •Face recognition • •Etc.
Concept : classification Example of a « play golf dataset » Decision tree Lazy Classification humidity temperature
Classification : performances measurement Confusion Matrix : a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).
Classification : performances measurement Classification Measures: Accuracy = (TP + TN) / (TP + TN + FP + FN) What proportion of cases and controls are correctly classified? Sensitivity = TP / (TP + FN) What proportion of cases are correctly classified? Specificity = TN / (TN + FP) What proportion of controls are correctly classified? Balanced Accuracy = (Sensitivity + Specificity) / 2 TP = True Positive TN = Trus Negative
Classification : performances measurement Receiver Operating Characteristic (ROC) :
Classification : validation Cross-validation : • Leave One Out Cross Validation (LOOCV) • – Better for small datasets • – Unbiased estimate of prediction error • n-fold CV (e.g. 5-fold or 10-fold CV) • – Better for larger datasets • – Biased estimate of prediction error • – Lower variance • – May need to repeat several times and average results
Classification : filters Statistic Filters • Minor allele frequency (e.g. > 0.1) • LD (e.g. r2< 0.9) • Chi-square • Information gain • Interaction information • Gain ratio • Principle components • ReliefF
Classification : filters Knowledge Filters • Prior statistical results • Prior experimental results • Biochemical pathway • Gene Ontology • Protein-protein interactions
Classification : wrapper Many search methods – Exhaustive – Hill climbing •Neural networks (NN) •Simulated annealing (SA) – Beam •Evolutionary algorithms (EA) •Genetic algorithms (GA) •Genetic programming (GP) •Estimation of Distribution Algorithms (EDA)
GE 1 6 Paramétrage de la GE (taux de mutation, nombre de génération maximale,…) Erreur de classification et Erreur de prédiction 5 2 3 4 NN NN NN NN NN NN NN NN 1 2 3 NN NN NN NN Recombinaison Mutation Duplication NN NN NN NN NN NN NN NN Classification : wrapper Example of GENN : Grammatical Evolution to optimize Neural Network Etape 1 :Génération aléatoire d’une génération de réseaux de neurones. Etape 2 :Les données sont découpées en 10 parties. Les 9/10 sont utilisés pour l’apprentissage Etape 3 :Sélection des n meilleurs NN Etape 4 (GE part1):Évaluation des performances des RN sur les données d’apprentissage. Classement des méthodes en fonction de leur taux d’erreur et sélection des meilleurs RN. Etapes 5 (GE part2):Parmi les réseaux sélectionnés à l’étape 4, des phénomènes de recombinaison, de duplication et de mutation sont simulés, et créer ainsi une nouvelle génération de réseaux. Etape 6 :Evaluation du RN final sur le 1/10 des données et mesure de la capacité à classer les atteints et non-atteints.
Available data mining package http://www.ailab.si/orange/ http://www.cs.waikato.ac.nz/ml/
Ensemble Learning & Random Forest
Machine Learning Methods • Decision trees • Linear regression • Neural networks • k-nearest neighbour • Naïve Bayesian classifiers • Support Vector Machines • Ensemble Learning Methods • Bagging, Boosting, Random Forests& RuleFit • and many more ... 14
Ensemble Learning Methods Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions
Ensemble Learning Methods • Accuracy: a more reliable mapping can be obtained by combining the output of multiple "experts ” • Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection. • There is no single model that works for all pattern recognition problems! (no free lunch theorem) "To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991. 16
Ensemble Learning Methods • How to generate base classifiers?Generation strategy • Decision tree learning:ID3, C4.5 & CART • Instance-based learning: k-nearest neighbor • Bayesian classification: Naïve Bayes • Neural networks • Regression analysis • Clustering et.al. • How to integrate them?Integration strategy: • BAGGing = Bootstrap AGGregation (Breiman, 1996) • Boosting (Schapire and Singer, 1998) • Random Forests (Breiman, 2001)
Ensemble Learning Methods Bootstrap aggregating (bagging) : meta-algorithm to improve machine learning of classification and regression models in terms of stability and classification accuracy. Bagging generates sub-sample from a standard training set by sampling examples from D uniformly and with replacement (bootstrap sample). The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
Ensemble Learning Methods Boosting is a machine learning meta-algorithm for performing supervised learning. Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight
Yes No Yes No Random Forest Tree i Tree 1 Tree 2 Tree 3 Final Classification is based on votes from all N trees Final Decision
Random Forest Each tree is constructed using the following algorithm: Number of cases = N, Number of variables in the classifier = M. • Choose thenumber m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. • Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes. • For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. • Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
Random Forest Its accuracy is as good as Adaboost and sometimes better It is relatively robust to outliers and noise It is faster than bagging or boosting It gives useful internal estimates of error, strength, correlation and variable importance It is simple and can be easily parallelized
Random Forest Heidema 2006 BMC Genetics
Information Theory Shannon Entropy Probabilité = incertitude sur un état Entropie = incertitude sur le système B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics
Exemple pour un système X ayant deux états possible tel que : p1 = 10-5 s1 = 5 p2 1 s2 = 0 H(X) 0 Information Theory Surprisal or Self-Entropy of event i : si pas d’incertitude sur un système : entropie = 0
Information Theory Least biased probability distribution is the one that maximizes the information entropy subject to prior information Least biased (maximum entropy) probability is uniform: ! Entropy is not disorder Sequence 1: 1111100000 p1=0.5, p0=0.5 H(sequence1) = log(2) Sequence 2: 1100010110 p1=0.5, p0=0.5 H(sequence2) = log(2) Maximum Entropy Principle (Jaynes, Physical Review 106, 620 (1957))
Information Theory Mutual Information (correlation) Information = Removal of uncertainty Usually called Information Gain Uncertainty removed because of correlation between A and B Uncertainty from SNP A Uncertainty from SNP B B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics
Information Theory Information Gain • Consider two attributes, A and B (two SNPs), and a class label C (disease status).
Information Theory Information Gain A = SNP1 B = SNP2 C = disease status • If IG(ABC) > 0 • Evidence for an attribute interaction • If IG(ABC) < 0 • The information between A and B is redundant • If IG(ABC) = 0 • Evidence of conditional independence or a mixture of synergy and redundancy
Information Theory Attribute Selection based on Entropy • Entropy-based IG is estimated for each individual attribute (i.e. main effects) and each pairwise combination of attributes (i.e. interaction effects). • Pairs of attributes are sorted and those with the highest IG, or percentage of entropy in the class removed, are selected for further consideration.
> CCAGAGGCCCAACUGGUAAACGGGC > CCG-AAGCUCAACGGGAUAAUGAGC > CCG-AAGCCGAACGGGAAAACCGGC > CC-CAAGCGC-AGGGGAGAA-GCGC > CCG-ACGCCA-ACGGGAGAA-UGGC > CCGUUUUCAG-UCGGGAAAAACUGA > CCGUUACUCC-UCGGGAUAAAGGAG > CCGUAAGAGG-ACGGGAUAAACCUC > CCG-UAGGAG-GCGGGAUAU-CUCC Information Theory Exemple d’application pour mesurer la qualité du séquençage Entropy Kullback-Leibler pi = probabilité observé qi =probabilité à priori (sous H0 est = 0.25 pour chaque nucléotide) Relative Uncertainty on base U for one specific position Gorodkin, et. al. Comput. Appl. Biosci., Vol. 13, no. 6 pp 583-586, 1997. Tuerk et. al. PNAS 89, pp 6988-6992, 1992 Ahola et al. A statistical score for assessing the quality of multiple sequence alignments. Bioinformatics D'haeseleer. What are DNA sequence motifs? Nature Biotechnology
MDR : Multifactor Dimensionality Reduction
MDR MDR Step 1 & 2 • Step 1: partition the data into some number of equal parts for cross-validation • Step 2: a set of N genetic and/or discrete environmental factors is selected from the list of all factors
MDR MDR Step 3 • The N factors and their multifactor classes or cells are represented in N-dimensional space • The ratio of the number of cases to the number of controls is evaluated within each multifactor cell
MDR MDR Step 4 • Each multifactor cell in N-dimensional space is labeled as high-risk if the ratio meets or exceeds some threshold T (e.g. T = 1.0) and low-risk if otherwise • Those cells labeled high-risk are in one group and those low-risk are in another group, which reduces the N-dimensional model to one dimension
MDR MDR Step 5 & 6 • Step 5: all possible combinations of N factors are evaluated for their ability to classify affected and unaffected individuals in the training data, and the best N-factor model is selected. • Step 6: the independent test data from the cross-validation is used to estimate the prediction error of the best model selected.
MDR overview
MDR MDR final • Steps 1 through 6 are repeated for each possible cross validation interval • The final step: determine which multifactor levels (e.g. genotypes) are high risk and which are low risk using the entire dataset.
MDR : New features Interpretation – Interaction Graphs • Comprised of a node for each attribute with pairwise connections between them. • Each node is labeled the percentage of entropy removed (i.e. IG) by each attribute. • Each connection is labeled the percentage of entropy removed for each pairwise Cartesian product of attributes. IG > 0 Evidence for an attribute interaction IG< 0 The information between SNP1 and SNP2 is redundant IG = 0 Evidence of conditional independence or a mixture of synergy and redundancy
MDR : New features Interpretation – Interaction Graphs Attribute Interaction I(SNP1;SNP2;Class) Class Attribute effect Attribute effect SNP1 SNP2 Attribute correlation
MDR : New features Interpretation – Dendrograms • Hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree.