290 likes | 302 Views
Explore the utilization of data mining in analyzing atmospheric neutrinos with IceCube. Discover data preprocessing, training algorithms, results, and implications for the IceCube detector configuration. Learn about random forest training, validation, and background rejection with real-life applications.
E N D
Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCubeTim Ruhe, TU Dortmund
Outline • Data mining is more... • Why is IceCube interesting (from a machine learning point of view) • Data preprocessing and dimensionality reduction • Training and validation of a learning algorithm • Results • Other Detector configuration? • Summary & Outlook
Data Mining is more... Learning Algorithm Beis Examples (annotated) Model Historical data, simulations Application I New data (not annotated) Information, knowledge Nobel prize(s)
Data Mining is more... Learning Algorithm Beis Examples (annotated) Model Historical data, simulations Application Preprocessing Garbage in/ Garbage out I New data (not annotated) Information, knowledge Nobel prize(s)
Data Mining is more... Learning Algorithm Validation Beis Examples (annotated) Model Historical data, simulations Application Preprocessing Garbage in/ Garbage out I New data (not annotated) Information, knowledge Nobel prize(s)
Why is IceCube interesting from a machine learning point of view? • Huge amount of data • Highly imbalanced distribution of event classes (signal and background) • Huge amount of data to be processed by the learner (Big Data) • Real life problem
Preprocessing (1): Reducing the Data Volume Through Cuts Background Rejection: 91.4% Signal Efficiency: 57.1% BUT: Remaining Background is significantly harder to reject!
Preprocessing (2): Variable Selection Check for missing values. Exclude if number of missing values exceed a 30%. Exclude everything that is useless, redundant or a source of potential bias. 2600 variables Check for potential bias. Check for correlations. Exclude everything that has a correlation of 1.0. Automated Feature Selection 477 variables
Relevance vs. Redundancy: MRMR (continuous case) Redundancy: Relevance: or MRMR:
Feature Selection Stability Jaccard: Average over many sets of variables:
Training and Validation of a Random Forest • use an ensemble of simple decision trees • Obtain final classification as an average over all trees
Training and Validation of a Random Forest • use an ensemble of simple decision trees • Obtain final classification as an average over all trees 5-fold cross validation to validate the performance of the forest.
Random Forest and Cross Validation in Detail (1) Background Muons 750,000 in total CORSIKA, Polygonato 600,000 available for training Sampling 27,000 27,000 Neutrinos 70,000 in total NuGen, E-2 Spectrum 56,000 available for training
Random Forest and Cross Validation in Detail (2) 500 Trees 150,000 available for testing 27,000 Train Apply 27,000 14,000 available for testing Repeat (x5)
Random Forest Output We need an additional cut on the output of the Random Forest!
Random Forest Output: Cut at 500 trees • 28830 ± 480 expected neutrino candidates • 28830 ± 480 expected background muons Apply to experimental data 27,771 neutrino candidates We need an additional cut on the output of the Random Forest! This yields • Background Rejection: 99.9999% • Signal Efficiency 18.2% • Estimated Purity: (99.59±0.37)%
Unfolding the spectrum This is no Data Mining... TRUEE ...but it ain‘t magic either
Moving on... IC79 • Entire analysis chain can be applied on other detector configurations • ...with minor changes (e.g. ice model) • 212 neutrino candidates per day • 66885 neutrino candidates in total • 330±200 background muons
Summary and Outlook MRMR Random Forest 99.9999% Background Rejection Future Improvements??? By starting at an earlier analysis level... Purities above 99% are routinely achieved
RapidMiner in a Nutshell • Developed at the Department of Computer Science at TU Dortmund(YALE) • Operator based, written in Java • It used to be open source • Many, many plugins due to a rather active community • One of the most widely used data mining tools
What I like about it • Data flow is nicely visualized and can be easily followed and comprehended • Rather easy to learn, even without programming experience • Large Community (Updates, Bugfixes, Plugins) • Professional Tool (They actually make money with that!) • Good support • Many tutorials can be found online, even special one • Most operators work like a charm • Extendable
Relevance vs. Redundancy: MRMR (discrete case) Redundancy: Relevance: Mutual Information or MRMR:
Feature Selection Stability Jaccard: Kuncheva:
Ensemblemethoden With Weight (e.g. Boosting) Ensemble methods Without Weight (e.g. Random Forest)
Random Forest: What is randomized? Randomness 1: Events the tree is trained on (bagging) Randomness 2: Variables that are available for a split