410 likes | 749 Views
Machine Learning with T MVA A ROOT based Tool for Multivariate Data Analysis. PANDA Computing Workshop Groningen 23.1.2008. The TMVA developer team: Andreas H ö cker, Peter Speckmeyer, J ö rg Stelzer , Helge Voss. General Event Classification Problem. x 2. H 2. H 1. H 3. x 1.
E N D
Machine Learning with TMVAA ROOT based Tool for Multivariate Data Analysis PANDA Computing Workshop Groningen 23.1.2008 The TMVA developer team:Andreas Höcker, Peter Speckmeyer, Jörg Stelzer, Helge Voss
General Event Classification Problem x2 H2 H1 H3 x1 • Event described by k variables (that are found to be discriminating) (xi) k • Events can be classified into n categories:H1 … Hn • General classifier:f: k, (xi) {1,…,n} • TMVA: only n=2 • Commonly the case in HEP (signal/background) • Most classification methodsf: kd, (xi)(yi) • Further:d , (yi){1,…,n} • TMVA: d=1 y≥ysep: signal, y<ysep: background Example: k=2, n=3 Multivariate Analysis with TMVA - Jörg Stelzer
Outline • Introduction to the event classification problem • Classifiers in TMVA • Usage of TMVA Multivariate Analysis with TMVA - Jörg Stelzer
General Event Classification Problem Rectangular Cuts ? Linear Boundaries ? Non-linear Boundaries ? x2 x2 x2 H2 H2 H2 H1 H1 H1 H3 H3 H3 x1 x1 x1 • Example: k=2 variables x1,2, n=3 categories H1, H2, H3 • The problem: How to draw the boundaries between H1, H2, and H3 such that f(x) returns the true nature of x with maximum correctness Simple example I can do it by hand. Multivariate Analysis with TMVA - Jörg Stelzer
General Event Classification Problem • What is the optimal boundary f(x) to separate the categories • More pragmatic: Which classifier is best to find this optimal boundary (or estimates it closest) Machine Learning • Large input variable space, complex correlations: manual optimization very difficult 2 general ways to build f(x): • Supervised learning: in an event sample the category of each event is known. Machine adapts to give the smallest misclassification error on training sample. • Unsupervised learning: the correct category of each event is unknown. Machinery tries to discover structures in the dataset • All classifiers in TMVA are supervised learning methods Multivariate Analysis with TMVA - Jörg Stelzer
Classification Problems in HEP • In HEP mostly two class problems – signal (S) and background (B) • Event level (Higgs searches, …) • Cone level (Tau-vs-jet reconstruction, …) • Track level (particle identification, …) • Lifetime and flavour tagging (b-tagging, …) • ... • Input information • Kinematic variables (masses, momenta, decay angles, …) • Event properties (jet/lepton multiplicity, sum of charges, …) • Event shape (sphericity, Fox-Wolfram moments, …) • Detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …) • … Multivariate Analysis with TMVA - Jörg Stelzer
Conventional Linear Classifiers Cut Based • Widely used because transparent • Machine optimization is challenging: • MINUIT fails for large n due to sparse population of input parameter space • Alternatives are Monte Carlo Sampling, Genetic Algorithms, Simulated Annealing Projective Likelihood Estimator • Probability density estimators for each variable combined into one • Much liked in HEP • Returns the likelihood of a sample belonging to a class • Projection ignores correlation between variables • Significant performance loss for correlated variables Linear Fisher Discriminant • Axis in parameter space on which samples are projected, chosen such that signal and background are pushed far away from each other • Optimal classifier for linearly correlated Gaussian-distributed variables • Means of signal and background must be different R.A. Fisher, Annals Eugenics 7, 179 (1936). Multivariate Analysis with TMVA - Jörg Stelzer
Common Non-linear Classifiers Neural Network • Feed forward multilayer perceptron • Non-linear activation function of each neuron • Weierstrass theorem: can approximate any continuous functions to arbitrary precision with a single hidden layer and an infinite number of neurons PDE Range-Search, k Nearest Neighbours • n- dimensional signal and background PDF, probability obtained by counting number of signal and background events in vicinityof test event • Range Search: vicinity is predefined volume • k nearest neighbor: adaptive (k events in volume) H1 x2 H0 x1 Function Discriminant Analysis • User provided separation function fitted to the training data • Simple, transparent discriminator for non-linear problems • In-between solution (better then Fisher, but not good for complex examples) (“Activation” function) test event T. Carli and B. Koblitz, Nucl. Instrum. Meth. A501, 576 (2003) [hep-ex/0211019] Multivariate Analysis with TMVA - Jörg Stelzer
Classifiers Recent in HEP D0 single top Boosted Decision Trees • Decision Tree is a series of cuts that split sample set into ever smaller sets, leafs are assigned either signal or background status • Each split try to maximizing gain in separation (Gini-index) • Bottom-up pruning of a decision tree • Protect from overtraining (*) by removingstatistically insignificant nodes • DT easy to understand but not powerful Boosting • Increase the weight of incorrectly identified events and build a new decision tree • Final classifier: ‘forest’ of decision trees linearly combined • Large coefficient for tree with small misclassification • Improved performance and stability * Performance on training sample statistically better than on independent test sample Little tuning required for good performance Multivariate Analysis with TMVA - Jörg Stelzer
Classifiers Recent in HEP Support Vector Machines x2 • Optimal hyperplane between linearly-separable data (1962) • Wrongly classified events add an extra term to the cost-function which is minimized • Non-separable data becomes linearly separable in higher dimensions F: RnR • Kernel trick (suggested 1964, applied to SVM 1992) • Cost function depends only on F(x)F(y) = K(x,y), no explicit knowledge of F required support vectors Separable data optimal hyperplane margin x1 Learning via Rule Ensembles • Rule is a set of cuts, defining regions in the input parameter space • Rules extracted from a forest of Decision Trees (either from BDT, or a random forest generator) • Linear combinations of rules, coefficients fitted by minimizing risk of misclassification • Good performance J. Friedman and B.E. Popescu, “Predictive Learning via Rule Ensembles”, Technical Report, Statistics Department, Stanford University, 2004. C. Cortes and V. Vapnik, “Support vector networks”, Machine Learning, 20, 273 (1995). Multivariate Analysis with TMVA - Jörg Stelzer
Data Preprocessing: Decorrelation SQRT derorr. PCA derorr. original • Various classifiers perform sub-optimal in the presence of correlations between input variables (Cuts, Projective LH), others are slower (BDT, RuleFit) • Removal of linear correlations by rotating input variables • Determine square-rootCof covariance matrixC, i.e.,C = CC • Transform original(xi)into decorrelated variable space(xi)by:x = C 1x • Also implemented Principal Component Analysis (PCA) • Note that decorrelation is only complete, if • Correlations are linear • Input variables are Gaussian distributed • Not very accurate conjecture in general Multivariate Analysis with TMVA - Jörg Stelzer
Is there a best Classifier • Performance • In the presence/absence of linear/nonlinear correlations • Speed • Training / evaluation time • Robustness, stability • Sensitivity to overtraining, weak input variables • Size of training sample • Dimensional scalability • Do performance, speed, and robustness deteriorate with large dimensions • Clarity • Can the learning procedure/result be easily understood/visualized Multivariate Analysis with TMVA - Jörg Stelzer
No Single Best Multivariate Analysis with TMVA - Jörg Stelzer
What is TMVA Motivation: • Classifiers perform very different depending on the data all should be tested on a given problem • Situation for many year: usually only a small number of classifiers were investigated by analysts • Needed a Tool that enables the analyst to simultaneously evaluate the performance of a large number of classifiers on his/her dataset Design Criteria: • Performance and Convenience (A good tool does not have to be difficult to use) • Training, testing, and evaluation of many classifiers in parallel • Preprocessing of input data: decorrelation (PCA, Gaussianization) • Illustrative tools to compare performance of all classifiers (ranking of classifiers, ranking of input variable, choice of working point) • Actively protect against overtraining • Straight forward application to test data • Special needs of high energy physics addressed • Two classes, events weights, familiar terminology Multivariate Analysis with TMVA - Jörg Stelzer
Using TMVA A typical TMVA analysis consists of two main steps: Training phase: training, testing and evaluation of classifiers using data samples with known signal and background composition Application phase: using selected trained classifiers to classify unknown data samples
Technical Aspects • TMVA is open source, written in C++, and based on and part of ROOT • Development on SourceForge, there is all the information http://sf.tmva.net • Bundled with ROOT since 5.11-03 • Training requires ROOT-environment, resulting classifiers also available as standalone C++ code • Four core developers, many contributors • > 2200 downloads since Mar 2006 (not counting ROOT users) • Mailing list for reporting problems Users Guide at http://sf.tmva.net: 97p., classifier descriptions, code examples arXiv physics/0703039 Multivariate Analysis with TMVA - Jörg Stelzer
Quick Start cd /afs/cern.ch/sw/lcg/external/root/5.18.00/slc4_ia32_gcc34/root . bin/thisroot.sh cd ~ cp -r $ROOTSYS/tmva/test macros; cd macros root -l TMVAnalysis.C\(\"MLP,BDT,SVM_Gauss\"\) directory needs to be called macros Multivariate Analysis with TMVA - Jörg Stelzer
Training with TMVA TMVA GUI • User usually starts with template TMVAnalysis.C • Choose training variables • Choose input data • Select classifiers (adjust training options – described in the manual by specifying option ‘H’) Template TMVAnalysis.C (also .py) available at $TMVA/macros/ and $ROOTSYS/tmva/test/ Multivariate Analysis with TMVA - Jörg Stelzer
Evaluation Output Remark on overtraining • Occurs when classifier training becomes sensitive to the events of the particular training sample, rather then just to the generic features • Sensitivity to overtraining depends on classifier: e.g., Fisher insensitive, BDT very sensitive • Detect overtraining: compare performance between training and test sample • Counteract overtraining: e.g., smooth likelihood PDFs, prune decision trees, … Evaluation results ranked by best signal efficiency and purity (area) ------------------------------------------------------------------------------ MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi- Methods: @B=0.01 @B=0.10 @B=0.30 Area | ration: cance: ------------------------------------------------------------------------------ Fisher : 0.268(03) 0.653(03) 0.873(02) 0.882 | 0.444 1.189 MLP : 0.266(03) 0.656(03) 0.873(02) 0.882 | 0.444 1.260 LikelihoodD : 0.259(03) 0.649(03) 0.871(02) 0.880 | 0.441 1.251 PDERS : 0.223(03) 0.628(03) 0.861(02) 0.870 | 0.417 1.192 RuleFit : 0.196(03) 0.607(03) 0.845(02) 0.859 | 0.390 1.092 HMatrix : 0.058(01) 0.622(03) 0.868(02) 0.855 | 0.410 1.093 BDT : 0.154(02) 0.594(04) 0.838(03) 0.852 | 0.380 1.099 CutsGA : 0.109(02) 1.000(00) 0.717(03) 0.784 | 0.000 0.000 Likelihood : 0.086(02) 0.387(03) 0.677(03) 0.757 | 0.199 0.682 ------------------------------------------------------------------------------ Testing efficiency compared to training efficiency (overtraining check) ------------------------------------------------------------------------------ MVA Signal efficiency: from test sample (from training sample) Methods: @B=0.01 @B=0.10 @B=0.30 ------------------------------------------------------------------------------ Fisher : 0.268 (0.275) 0.653 (0.658) 0.873 (0.873) MLP : 0.266 (0.278) 0.656 (0.658) 0.873 (0.873) LikelihoodD : 0.259 (0.273) 0.649 (0.657) 0.871 (0.872) PDERS : 0.223 (0.389) 0.628 (0.691) 0.861 (0.881) RuleFit : 0.196 (0.198) 0.607 (0.616) 0.845 (0.848) HMatrix : 0.058 (0.060) 0.622 (0.623) 0.868 (0.868) BDT : 0.154 (0.268) 0.594 (0.736) 0.838 (0.911) CutsGA : 0.109 (0.123) 1.000 (0.424) 0.717 (0.715) Likelihood : 0.086 (0.092) 0.387 (0.379) 0.677 (0.677) ----------------------------------------------------------------------------- Better classifier Multivariate Analysis with TMVA - Jörg Stelzer
More Evaluation Output Input Variable Ranking --- Fisher : Ranking result (top variable is best ranked) --- Fisher : ---------------------------------------------------------------- --- Fisher : Rank : Variable : Discr. power --- Fisher : ---------------------------------------------------------------- --- Fisher : 1 : var4 : 2.175e-01 --- Fisher : 2 : var3 : 1.718e-01 --- Fisher : 3 : var1 : 9.549e-02 --- Fisher : 4 : var2 : 2.841e-02 --- Fisher : ---------------------------------------------------------------- Better variable • how useful is a variable? Classifier correlation and overlap --- Factory : Inter-MVA overlap matrix (signal): --- Factory : ------------------------------ --- Factory : Likelihood Fisher --- Factory : Likelihood: +1.000 +0.667 --- Factory : Fisher: +0.667 +1.000 --- Factory : ------------------------------ • do classifiers perform the same separation into signal and background? If two classifiers have similar performance, but significant non-overlapping classifications check if you can combine them! Multivariate Analysis with TMVA - Jörg Stelzer
Graphical Evaluation • Classifier output distributions for independent test sample: Multivariate Analysis with TMVA - Jörg Stelzer
Graphical Evaluation • There is no unique way to express the performance of a classifier several benchmark quantities computed by TMVA • Signal eff. at various background effs.(= 1 – rejection) when cutting on classifieroutput • The Separation: • “Rarity” implemented (background flat): • Comparison of signal shapes between differentclassifiers • Quick check: background on data should be flat Multivariate Analysis with TMVA - Jörg Stelzer
Visualization Using the GUI • Projective likelihood PDFs, MLP training, BDTs, … average no. of nodes before/after pruning: 4193 / 968 Multivariate Analysis with TMVA - Jörg Stelzer
Choosing a Working Point • Depending on the problem the user might want to • Achieve a certain signal purity, signal efficiency, or background reduction, or • Find the selection that results in the highest signal significance (depending on the expected signal and background statistics) • Using the TMVA graphical output one can determine at which classifier output value he needs to cuts to separate signal from background Multivariate Analysis with TMVA - Jörg Stelzer
Applying the Trained Classifier • Use the TMVA::Reader class, example in TMVApplication.C: • Set input variables • Book classifier with the weight file (contains all information) • Compute classifier response inside event loop use it Templates TMVApplication.C available at $TMVA/macros/ and $ROOTSYS/tmva/test/ Multivariate Analysis with TMVA - Jörg Stelzer
Applying the Trained Classifier (II) • Also standalone C++ class without ROOT dependence • Can be put into executable (ideal for GRID jobs) std::vector<std::string> inputVars; … classifier = new ReadMLP ( inputVars ); for (int i=0; i<nEv; i++) { std::vector<double> inputVec = …; double retval = classifier->GetMvaValue( *inputVec ); } example from ClassApplication.C Multivariate Analysis with TMVA - Jörg Stelzer
Extending TMVA • A user might have an own implementation of a multivariate classifier, or wants to use an external one • With ROOT 5.18.00 (16.Jan.08) user can seamlessly evaluate and compare his own classifier within TMVA: • Requirement: An own class must be derived from TMVA::MethodBase and must implement the TMVA::IMethod interface • The class must be added to the factory via ROOT’s plugin mechanism • Training, testing, evaluation, and comparison can then be done as usual, • Example in TMVAnalysis.C Multivariate Analysis with TMVA - Jörg Stelzer
Conclusion Remarks • Multivariate classifiers are no black boxes, we just need to understand them • Cuts and Likelihood are transparent if they perform use them • In presence of correlations other classifiers are better • Difficult to understand at any rate • Enormous acceptance growth in recent decade in HEP • TMVA provides means to train, evaluate, compare, and apply different classifiers • TMVA also tries – through visualization – improve the understanding of the internals of each classifier Acknowledgments: The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. We thank in particular the CERN Summer students Matt Jachowski (Stanford) for the implementation of TMVA's new MLP neural network, Yair Mahalalel (Tel Aviv) for a significant improvement of PDERS, and Or Cohen for the development of the general classifier boosting, the Krakow student Andrzej Zemla and his supervisor Marcin Wolter for programming a powerful Support Vector Machine, as well as Rustem Ospanov for the development of a fast k-NN algorithm. We are grateful to Doug Applegate, Kregg Arms, René Brun and the ROOT team, Tancredi Carli, Zhiyi Liu, Elzbieta Richter-Was, Vincent Tisserand and Alexei Volk for helpful conversations. Multivariate Analysis with TMVA - Jörg Stelzer
Outlook Primary development from this Summer: Generalized classifiers Be able to boost or bag any classifier Combine any classifier with any other classifier using any combination of input variables in any phase space region 1. is ready – now in testing mode. To be deployed after upcoming ROOT release.
A Word on Treatment of Systematics? • There is no principle difference in systematics evaluation between single discriminating variables and MV classifiers • Control sample to estimate uncertainty on classifier output (not necessarily for each input variable) • Advantage: correlations automatically taken into account Some things could be done: • Example: var4 may in reality have a shifted central value and hence a worse discrimination power • One can: ignore the systematic in the training • var4 appears stronger in training than it might be • suboptimal performance (bad training, not wrong) • Classifier response will strongly depend on “var4”, and hence will have a larger systematic uncertainty • Better: Train with shifted (weakened) var4 • Then evaluate systematic error on classifier output Multivariate Analysis with TMVA - Jörg Stelzer
Checker Board Example • Performance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiers • After specific tuning, also SVM und MLP perform well Theoretical maximum Multivariate Analysis with TMVA - Jörg Stelzer
Linear-, Cross-, Circular Correlations • Illustrate the behavior of linear and nonlinear classifiers Linear correlations (same for signal and background) Linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Multivariate Analysis with TMVA - Jörg Stelzer
Linear-, Cross-, Circular Correlations • Plot test-events weighted by classifier output (red: signal-like, blue: background-like) Linear correlations (same for signal and background) Cross-linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Likelihood Likelihood - D PDERS Fisher MLP BDT Multivariate Analysis with TMVA - Jörg Stelzer
Final Performance • Background rejection versus signal efficiency curve: Linear Example Circular Example Cross Example Multivariate Analysis with TMVA - Jörg Stelzer
Stability with Respect to Irrelevant Variables • Toy example with 2 discriminating and 4 non-discriminating variables: use only two discriminant variables in classifiers use all discriminant variables in classifiers Multivariate Analysis with TMVA - Jörg Stelzer
TMVAnalysis.C Script for Training create Factory give training/test trees register input variables select MVA methods train, test and evaluate void TMVAnalysis( ) { TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" ); TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V"); TFile *input = TFile::Open("tmva_example.root"); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 ); factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); factory->AddVariable("var3", 'F'); factory->AddVariable("var4", 'F');factory->PrepareTrainingAndTestTree("", "NSigTrain=3000:NBkgTrain=3000:SplitMode=Random:!V" ); factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" ); factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" ); factory->TrainAllMethods(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory;} Multivariate Analysis with TMVA - Jörg Stelzer
TMVApplication.C Script for Application register the variables book classifier(s) prepare event loop compute input variables calculate classifier output create Reader void TMVApplication( ) { TMVA::Reader *reader = new TMVA::Reader("!Color"); Float_t var1, var2, var3, var4; reader->AddVariable( "var1+var2", &var1 ); reader->AddVariable( "var1-var2", &var2 );reader->AddVariable( "var3", &var3 ); reader->AddVariable( "var4", &var4 ); reader->BookMVA( "MLP classifier", "weights/MVAnalysis_MLP.weights.txt" ); TFile *input = TFile::Open("tmva_example.root"); TTree* theTree = (TTree*)input->Get("TreeS"); // … set branch addresses for user TTree for (Long64_t ievt=3000; ievt<theTree->GetEntries();ievt++) { theTree->GetEntry(ievt); var1 = userVar1 + userVar2; var2 = userVar1 - userVar2; var3 = userVar3; var4 = userVar4; Double_t out = reader->EvaluateMVA( "MLP classifier" ); // do something with it … } delete reader;} Multivariate Analysis with TMVA - Jörg Stelzer