Gamma/Hadron separation in atmospheric Cherenkov telescopes Overview

Gamma/Hadron separation in atmospheric Cherenkov telescopes • Overview • multi-wavelength astrophysics • imaging Cherenkov telescopes (IACT-s) • image classification • methods under study • trying for a rigorous comparison R.K.Bock, Durham, March 2002

Wavelength regimes in astrophysics • extend over 20 orders of magnitude in energy, if one adds infrared, radio and microwave observations • Cherenkov telescopes use visible light, but few quanta: ‘imaging’ takes a different meaning • some instruments have to be satellite-based, due to the absorbing effect of the atmosphere R.K.Bock, Durham, March 2002

Full sky at different wavelengths R.K.Bock, Durham, March 2002

An AGN at different wavelengths R.K.Bock, Durham, March 2002

Objects of interest: active galactic nuclei Black holes spin and develop a jet with shock waves: electrons and protons get accelerated and impart their energy to high-E g-rays R.K.Bock, Durham, March 2002

Principle of imaging Cherenkov telescopes • a shower develops in the atmosphere, charged relativistic particles emit Cherenkov radiation (at WLs visible to UV) • some photons arrive at sea level, get reflected by a mirror to a camera • high sensitivity and good time resolution are vital, precision is not: high reflectivity mirrors, the best possible photomultipliers in the camera R.K.Bock, Durham, March 2002

Principle of imaging Cherenkov telescopes R.K.Bock, Durham, March 2002

Principle of image parameters • hadron showers (cosmics) dominate the hardware trigger, image analysis must discriminate gammas from hadrons • showers show different characteristics (like in any calorimeter): feature extraction using principal component analysis and other characteristics must be used - experiment in view of best separation R.K.Bock, Durham, March 2002

One of the predecessor telescopes (HEGRA) in 1999 R.K.Bock, Durham, March 2002

Photomontage of the MAGIC telescope in La Palma (2000) R.K.Bock, Durham, March 2002

Installing the mirror dish of MAGIC La Palma, Dec 2001 R.K.Bock, Durham, March 2002

R.K.Bock, Durham, March 2002

Multivariate classification • cuts are in the n-space of features (in our case image parameters), the problem gets unwieldy even at low n • correlations between the features cause simple cuts in variables to be an ineffective method • decorrelation by standard methods (e.g. Karhunen-Loeve) does not solve the problem, being a linear operation • finding new variables does help, so do cut parameters along one axis, that depend on features along a different axis: dynamic cuts (subjective!) • ideally, a transformation to a single test statistic should be found R.K.Bock, Durham, March 2002

Different classification methods • cuts in the image parameters (including dynamic cuts) • mathematically optimized cuts in the image parameters: classification and regression tree (CART), commercial products available • linear discriminant analysis (LDA) • composite (2-D) probabilities (CP) • kernel methods • artificial neural networks (ANN) R.K.Bock, Durham, March 2002

There are many general methods on the market (this slide from A.Faruque, Mississipi State University) R.K.Bock, Durham, March 2002

Method details and comments: • cuts and supercuts • wide experience exists in many physics experiments and for all IACT-s; any method claiming to be superior must use results from these as yardstick • does need an optimization criterion, will not result in a relation between gamma acceptance and hadron contamination (i.e. no single test statistic) • usually leads to separate studies and approximations for each new data set (this is past experience) - often difficult to reproduce R.K.Bock, Durham, March 2002

Method details and comments: CART • developed originally by high-energy physicists to do away with the randomness in optimizing cuts (Breimann, Friedmann, Olshen, Stone, 1984) • now developed into a data mining method, commercially available from several companies • basic operations: growing a tree, pruning it, splitting the leaves again - done in some heuristic succession • the problem is to find a robust measure to choose from the many trees that are (or can be) grown • made for large samples: no experience with IACT-s, but there are promising early results R.K.Bock, Durham, March 2002

Method details and comments: LDA • parametric method, finding linear combinations of the original image parameters such that the separation between signal (gamma) and background (hadron) distributions gets maximized • fast, simple and (probably) very robust • ignores non-linear correlations in n-dimensional space (because of linear transformation) • little experience with LDA in IACT-s, early tests show that higher-order variables are needed (e.g. x,y -> x2y) R.K.Bock, Durham, March 2002

Method details and comments: LDA R.K.Bock, Durham, March 2002

Method details and comments: LDA Like Principal Component Analysis (PCA), LDA is used for data classification and dimensionality reduction. LDA maximizes the ratio of between-class variance to within-class variance, for any pair of data sets. This guarantees maximal separability. The prime difference between LDA and PCA is that PCA performs feature classification(e.g. image parameters!) while LDA performs data classification. PCA changes both the shape and location of the data in its transformed space, whereas LDA provides more class separability by building a decision region between the classes. The formalism is simple: the transformation into the ‘best separable space’ is performed by the eigenvectors of a matrix readily derived from the data (for our application: in two classes, gammas and hadrons) Caveat: both the PCA and LDA are linear transformations; they may be of limited efficiency when non-linearity is involved. R.K.Bock, Durham, March 2002

Method details and comments: kernel • kernel density estimation is a nonparametric multivariate classification technique. The advantage is that of generality of the class-conditional and consistently estimated densities • uses individual event likelihoods, defined as the closeness to the population of gamma events or hadron events in n-dimensional space. The closeness is expressed by a kernel function as metric • mathematically convincing, but leading into practical problems, including limitations in dimensionality; there is also some randomness in choosing the kernel function • has been toyed with in Whipple (the earliest functioning IACT), results look convincing; however, Whipple still uses supercuts; only first experience with kernels in MAGIC: positive R.K.Bock, Durham, March 2002

Method details and comments: kernel R.K.Bock, Durham, March 2002

Method details and comments: • composite probabilities (2-D) • intuitive determination of event probabilities by multiplying the probabilities in all 2D projections that can be made from image parameters, using constant bin content for some data • shown on some IACT data to at least match best existing results (but strict comparisons suffered from moving data sets) R.K.Bock, Durham, March 2002

Method details and comments: composite probabilities (2-D) CP program uses same-content binning in 2 dimensions Bins are set up for gammas (red), probabilities are evaluated for protons (blue) all possible 2-D projections are used R.K.Bock, Durham, March 2002

Method details and comments: ANN-s • method has been presented often in the past - resembles the CART method but works in locally linearly transformed data • substantial randomness in choosing depth of tree, training method, transfer function….. • so far no convincing results on IACT-s, Whipple have tried and rejected R.K.Bock, Durham, March 2002

Gamma events in MAGIC before and after cleaning R.K.Bock, Durham, March 2002

Proton events in MAGIC before and after cleaning R.K.Bock, Durham, March 2002

Comparison MC gammas / MC protons R.K.Bock, Durham, March 2002

Different methods on the same data set Typically, optimization parameters are fully defined by cost, purity, and sample size R.K.Bock, Durham, March 2002

We are running a comparative study: criteria • strictly defined disjoint training and control samples • must give estimators for hadron contamination and gamma acceptance (purity and cost) • should ideally result in a smooth function relating purity with cost, i.e. result in a single test statistic • if not, must show results for several optimization criteria, e.g. estimated hadron contamination at fixed gamma acceptance values, significance, etc. • for MC events, can control results by comparing classification to the known origin of events R.K.Bock, Durham, March 2002

Even if there were a clear conclusion….. • there remain some serious caveats • these methods all assume an abstract space of image parameters, which is ok in Monte Carlo situations, only • real data are subject to influences that distort this space: • starfield and night sky background • atmospheric conditions • unavoidable detector changes and malfunction • no method can invent new independent parameters • we assume that in final analysis, gammas will be Monte Carlo, measurements are on/off: we must deal with variables which may not be representative in Monte Carlo events and yet influence the observed image parameters; e.g zenith angle changes continuously, energy is something we want to observe, hence unknown • some compromise between frequent Monte Carlo-ing and parametric corrections to parameters is the likely solution R.K.Bock, Durham, March 2002

Gamma/Hadron separation in atmospheric Cherenkov telescopes Overview