Status of TMVA the T oolkit for M ulti V ariate A nalysis

Status of TMVA the Toolkit for MultiVariateAnalysis Eckhard von Toerne(University of Bonn) For the TMVA core developerteam: A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss

Outline • Overview • New developments • Recentphysicsresultsthatuse TMVA • Web-Site: http://tmva.sourceforge.net/ • See also: "TMVA - Toolkitfor Multivariate Data Analysis , A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.Toerne, H. Voss et al., arXiv:physics/0703039v5 [physics.data-an]

What is TMVA • Supervisedlearning • Classificationand Regression tasks • Easy totrain, evaluateandcomparevarious MVA methods • Variouspreprocessingmethods (Decorr.,PCA,Gauss...) • Integrated in ROOT MVA output

TMVA workflow • Training: • Classification: Learnthefeaturesofthe different eventclassesfrom a sample with knownsignal/backgroundcomposition • Regression: Learnthefunctionaldependence betweeninput variables andtargets • Testing: • Evaluatetheperformanceofthetrained classifier/regressor on an independent test sample • Compare different methods • Application: • Applytheclassifier/regressorto real data

x2 x2 x2 H1 H1 H1 H0 H0 H0 x1 x1 x1 Classification/Regression Classificationofsignal/background – Howto find bestdecisionboundary? Regression – Howtodeterminethecorrect model?

If you have a training sample with only few events?  Number of „parameters“ must be limited  Use Linear classifier or FDA, small BDT, small MLP Variables are uncorrelated (or only linear corrs)  likelihood I just want something simple  use Cuts, LD, Fisher Methods for complex problems  use BDT, MLP, SVM Howtochoose a method? List of acronyms: BDT = boosted decision tree, see manualpage 103 ANN = articifical neuralnetwork MLP = multi-layer perceptron, a specific form of ANN, also the name of our flagship ANN, manual p. 92 FDA = functional discriminant analysis, see manual p. 87 LD = linear discriminant , manual p. 85 SVM = support vector machine, manual p. 98 , SVM currently available only for classification Cuts = like in “cut selection“, manual p. 56 Fisher = Ronald A. Fisher, classifier similar to LD, manual p. 83

1 input layer k hidden layers 1 ouput layer ... 1 1 1 2 output classes (signal and background) . . . . . . . . . Nvar discriminating input variables i j Mk . . . . . . N M1 (“Activation” function) with: Artificial Neural Networks • Modellingofarbitrarynonlinearfunctionsas a nonlinearcombinationof simple „neuronactivationfunctions“ • Advantages: • very flexible, noassumption aboutthefunction necessary • Disadvantages: • „black box“ • needstuning • seeddependent Feed-forward Multilayer Perceptron

Boosted Decision Trees • Grow a forestofdecisiontreesanddeterminetheeventclass/targetbymajorityvote • Weightsofmisclassifiedeventsareincreased in thenextiteration • Advantages: • ignoresweak variables • works out of the box • Disadvantages: • vulnerable toovertraining

No Single Best Classifier… The properties of the Function discriminant (FDA) depend on the chosen function

Neyman-Pearson Lemma Type-1 error small Type-2 error large “limit” in ROC curve given by likelihood ratio 1 Neyman-Pearson: The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximizes the area under the “Receiver Operation Characteristics” (ROC) curve better classification 1- ebackgr. good classification random guessing Type-1 error large Type-2 error small 0 esignal 0 1 • Varying y(x)>“cut” moves working point (efficiency and purity) along ROC curve • How to choose “cut”?  need to know prior probabilities (S, B abundances) • Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p) • Discovery of a signal : maximum of S/√(B) • Precision measurement: high purity (p) • Trigger selection: high efficiency (e)

Performance withtoydata • 3-dimensional distribution • Signal: sumofGaussians • Background=flat • TheoreticallimitcalculatedusingNeyman-Pearson Lemma • Neural net (MLP) with two hidden layers and backpropagation training. Bayesian option has little influence on high statistics training • TMVA-ANN convergestowardstheoreticallimitforsufficientNtrain(~100k)

Recent developments • Current version: TMVA version 4.1.2 in root release 5.30 • Unit test framework for daily software and method performance validation (C. Rosemann, E.v.T.) • Multiclass classification for MLP, BDTG, FDA • BDT automatic parameter optimization for building the tree architecture • new method to treat data with distinct sub-populations (Method Category) • Optional Bayesian treatment of ANN weights in MLP with back-propagation (Jiahang Zhong) • Extended PDEFoam functionality (A. Voigt) • Variable transformations on a user-defined subset of variables

C. Rosemann, E.v.T. Unit test • automatedframeworktoverifyfunctionalityandperformance(oursbased on B. Eckel‘sdescription) • slimmedversionrunseverynight on various OS *************************************************************** * TMVA - U N I T test : Summary * *************************************************************** Test 0 : Event [107/107]....................................OK Test 1 : VariableInfo [31/31]...............................OK Test 2 : DataSetInfo [20/20]................................OK Test 3 : DataSet [15/15]....................................OK Test 4 : Factory [16/16]....................................OK Test 7 : LDN_selVar_Gauss [4/4].............................OK .... Test 107 : BoostedPDEFoam [4/4]..............................OK Test 108 : BoostedDTPDEFoam [4/4]............................OK Total number of failures: 0 ***************************************************************

Andnow: switchingfromstatisticstophysics …acknowledgingthehardworkofourusers

Review of recent results BDT trained on individual mH samples with 10 variables. Expect 1.47 ev signal events at mH=160GeV .. compared to 1.27ev with cut-based analysis (on 4 variables) and same bgd. • x Phys.Lett.B699:25-47,2011.

Review of recent results Super-Kamiokande Coll., “Kinematic reconstruction of atmospheric neutrino events in a large water Cherenkov detector with proton identification“ Phys.Rev.D79:112010,2009. 7 input variables MLP with one hidden layer Signal Background

Review of recent results • CDF Coll., “First Observation ofElectroweak Single Top Quark Production“, Phys.Rev.Lett.103:092002,2009. • BDT analysis with ~20 input variables • lepton + ET-mis+ jets • Results for s+t-channel

Review of recent results using TMVA • CDF+D0 combinedhiggsworkinggroup, hep-ex/1107.5518. (SVM) • CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT) • IceCubeColl., astro-ph/1101.1692. (MLP) • D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT) • IceCubeColl., Phys.Rev.D83:012001,2011. (BDT) • IceCubeColl., Phys.Rev.D82:112003,2010. (BDT) • D0 Coll., Higgssearch, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT) • D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood) • CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT) • Super-KamiokandeColl., Phys.Rev.D79:112010,2009. (MLP) • BABAR Coll., Phys.Rev.D79:051101,2009. (BDT) • + otherpapers • + several ATLAS paperswith TMVA abouttocome out… • + many ATLAS resultsabouttobepublished…

Review of recent results using TMVA • CDF+D0 combinedhiggsworkinggroup, hep-ex/1107.5518. (SVM) • CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT) • IceCubeColl., astro-ph/1101.1692. (MLP) • D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT) • IceCubeColl., Phys.Rev.D83:012001,2011. (BDT) • IceCubeColl., Phys.Rev.D82:112003,2010. (BDT) • D0 Coll., Higgssearch, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT) • D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood) • CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT) • Super-KamiokandeColl., Phys.Rev.D79:112010,2009. (MLP) • BABAR Coll., Phys.Rev.D79:051101,2009. (BDT) • + otherpapers • + several ATLAS paperswith TMVA abouttocome out… • + many ATLAS resultsabouttobepublished… Thankyouforusing TMVA !

Summary • TMVA versatile packageforclassificationandregressiontasks • Integrated into ROOT • Easy totrainclassifiers/regressionmethods • A multitudeofphysicsresultsbased on TMVA arecoming out • Thankyouforyourattention!

Credits • TMVA is open source software • Use & redistribution of source permitted according to terms in BSD license • Several similar data mining efforts with rising importance in most fields of science and industry Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Christoph Rosemann (DESY), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland), Jiahang Zhong (Academica Sinica, Taipeh).

Spare Slides

A complete TMVA training/testing session voidTMVAnalysis( ) { TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" ); TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V"); TFile *input = TFile::Open("tmva_example.root"); factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); //factory->AddTarget("tarval", 'F'); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 ); //factory->AddRegressionTree ( (TTree*)input->Get("regTree"), 1.0 ); factory->PrepareTrainingAndTestTree( "", "", "nTrain_Signal=200:nTrain_Background=200:nTest_Signal=200:nTest_Background=200:NormMode=None" ); factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" ); factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" ); factory->TrainAllMethods(); // factory->TrainAllMethodsForRegression(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); deletefactory; } Create Factory Add variables/ targets Initialize Trees Book MVA methods Train, test and evaluate

What is a multi-variate analysis? “Combine“ all input variables into one output variable Supervised learning means learning by example: the program extracts patterns from training data Input Variables Classifier Output

Metaclassifiers – Category Classifier and Boosting • The categoryclassifieriscustom-madefor HEP • Use different classifiersfor different phasespaceregionsandcombinetheminto a singleoutput • TMVA supportsboostingforall classifiers • Use a collectionof “weaklearners“ toimprovetheirperformace (boosted Fisher, boostedneuralnetswithfewneuronseach…)

Status of TMVA the T oolkit for M ulti V ariate A nalysis