260 likes | 271 Views
Explore how artificial intelligence can assist in data mining and pattern recognition in astronomy, with a focus on neural networks and their applications.
E N D
Neural Tools for Astronomical Data Mining:The Astrovirtual collaboration Giuseppe LongoDepartment of Physics - DSF University Federico II of Napoli & INAF-NAlongo@na.infn.it In Collaboration with: C. Donalek, E.Puddu, S. Sessa – DSF & INAF-NAA. Ciaramella, G. Raiconi, A. Staiano, A.Volpicelli, R. Tagliaferri –DMI/SAF. Pasian, R. Smareglia & A. Zacchei - INAF-TS Munich 10-14-th of June 2002
Some quotes… A major paradigm shift is now taking place in astronomy and space science. Astronomy has suddenly become an immensely data-rich field, with numerous digital sky surveys across a range of wavelenghts, with many Terabytes of pixels and with billions of detected sources, often with tens of measured parameters for each object… traditional data analysis methods are inadequate to cope with this sudden increase in the data volume…” R.J. Brunner, S.G. Djorgovski and T.A. PrinceMassive Datasets in Astronomy, astro-ph/0106 We would all testify to the growing gap between the generation of data and our understanding of it … Ian H. Witten & E. Frank, Data Mining Munich 10-14-th of June 2002
where do A.I. may fit into astronomical work? K.D.D. A.I. tools (soft computing: neural, fuzzy sets, genetic algorithms, etc.) Munich 10-14-th of June 2002
The purpose of KDD is to identify patterns and to extract new knowledge from databases in which the dimension, complexity or amount of data has so far been prohibitively large for unaided human efforts….Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful… • This is not a technology which you can apply blindly and expect to get good results. Different problems yield to different techniques, …. • The implementation of effective KDD tools is expensive (time, computing, need for specialists), requires coordinated efforts between astronomers and computer scientists (even on a semantic level) Munich 10-14-th of June 2002
Neural networks as grey boxes guess INPUT OUTPUT zn feedback x4 x3 x2 y z3 output x1 z2 input z1 Hiddenlayer • input layer (n neurons) • M hidden layer (1 or 2) • Output layer (n' <n neurons) • Neurons are connected via activation functions • Different NN's given by different topologies, different activation functions, etc. INTERPOLATION PATTERN RECOGNITION Munich 10-14-th of June 2002
Some "astronomical" examples Pixel space Catalogue space • Object detection, deblending (segmentation) • Data quality (quality of auxiliary & scientific frames, …) • Data compression • Search for known (supervised clustering, ) • Search for unknown • Time series analysis(uneven sampled data, etc.) • All tasks requiring pattern recognition or interpolations (classification, etc.) • Visualization of multiparametric spaces Munich 10-14-th of June 2002
Supervised vs unsupervised • Supervised • The NN learns from a set of examples • Requires "a priori knowledge"(id est, training, validation & test sets) • Very accurate & faster than traditional methods • Unsupervised • The NN works on statistical properties of the data • Does not require any "a priori knowledge" • May be complemented by "labeled" data Munich 10-14-th of June 2002
Each tool has its pro's and con's • MLP's: fast, mainly supervised, easy implementation of non linearity • SOM: little slower, unsupervised, non linear, great visualization capabilities, non physical output • GTM: slower, unsupervised, great visualization, physical output • PCA & ICA linear and non linear: poor visualization, physical output, best on correlated imputs • Fuzzy Similarities: slow on large volumes of data, ill defined problems • Etc… Munich 10-14-th of June 2002
openimport open import non compliant header Head/proc. compliant preprocessing Supervisedunsupervised Parameter options supervised Parameter and training options unsupervised Labeledunlabeled Training setpreparation labeled Label preparation Feature selectionvia unsupervised clustering Feature selectionvia unsupervised clustering Fuzzy set Etc. SOM GTM MLP RBF Etc. INTERPRETATION The AstroVirtual package Code written in MATLAB & C++ DEMO on this Laptop Munich 10-14-th of June 2002
ASTRONOMICAL APPLICATIONS • Object extraction • Star/galaxy classification • Data quality from telemetry data (TNG – LTA) • Photometric redshifts for SDSS-EDR • Time series analysis (Cepheids, binaries, AGN, etc.) • PARTICLE PHYSICS • Data an. of VIRGO experiment (noise removal) • Data an. of neutrino-oscillation (CERN/INFN) experiment (apex position and energy) • Data analysis of ARGO experiment (event detection and energy) Munich 10-14-th of June 2002
Unsupervised S/G classification • Input data: DPOSS catalogue (ca. 5x105 objects) • SOM (output is a U-Matrix); GTM (output is a PDF) • Feature selection (backward elimination strategy) • Compression of input space and re-design of network • Classification • Labeling (500 well classified objects) Munich 10-14-th of June 2002
Star/Galaxy classificationAutomatic selection of significant features Unsupervised SOM (DPOSS data) Munich 10-14-th of June 2002
Labeling Localization of a set of 500 faint stars Munich 10-14-th of June 2002
Stars p.d.f galaxies p.d.f cumulative p.d.f G.T.M. unsupervised clustering; S/G – CDF Field Munich 10-14-th of June 2002
Stars p.d.f galaxies p.d.f cumulative p.d.f 5x105 obj. G.T.M. unsupervised clustering; S/G – CDF Field Munich 10-14-th of June 2002
Photometric redshifts: a mixed case SDSS-EDR DB SOM unsup.completeness ReliabilityMap SOM unsup.Set construction MLP supervisedexperiments SOM supervisedFeature selection Best MLP model • Input data set: SDSS – EDR photometric data (galaxies) • Training/validation/test set:SDSS-EDR spectroscopic subsample Munich 10-14-th of June 2002
Step 1: feature selection (BES) Unsupervised/labeled SOM Input parameters (ra, dec, fibermag, petromag, mag, petro_r50, rho, etc.) Selected features: r ; u-g ; g-r ; r-i ; i-z ; r50 ; r90; rho STEP 2: aux. Set construction Unsupervised (SOM) to identify significant clusters in N-dimensional input space (complete coverage of training set) Construction of training/validation and test sets representative of the input data Munich 10-14-th of June 2002
Step 3 - experiments to find the optimal architecture Varying n. of input, n. of hidden, n. of patterns in the training set, n. of training epochs, n. of Bayesian cycles and inner loops, etc. Convergence computed on validation set Error derived from test set Robust error: 0.02176 Munich 10-14-th of June 2002
Step 4 – computation of confusion matrices & Flagging out spurious outputs Unsupervised SOM clustering with a posteriori labeling from test set • We train the SOM and assign to each neuron a label corresponding to the class (e.g. redshift < 0.5 = Class 1, redshift > 0.5 = class 2) • Then we evaluate the confusion matrix on the test set and use these statistics for evaluate the completeness of the catalog 60 nodes 120 nodes ……. Munich 10-14-th of June 2002
+ new & deeper training set ASTROVIRTUAL CATALOGUE Munich 10-14-th of June 2002
Preliminary results from an application to TNG-LTA • TNG telemetry monitors continuously a series of parameters (pointing, tracking, actuators of mirrors, etc. • Imput data: 31 parameters (apparently uncorrelated) • SOM unsupervised clustering with "a posteriori" labeling • Quality labels from randomly choosen images obtained during the acquisition of telemetric data Munich 10-14-th of June 2002
3-D U Matrix Similarity coloring Munich 10-14-th of June 2002
? UP: good trackingBelow: bad tracking Munich 10-14-th of June 2002
CONCLUSIONS • KDD requires strong interaction of expert with "true" computer scientists • Implementation of KDD tools takes a lot of time… in order to be worth the effort, they need to be as general as possible • They may not be the "solution" but for sure they will help in any classification, pattern recognition, interpolation problem encountered in the usage of large DataBases • On a short time scale (ca. 3-5 years) KDD will not affect everyday astronomical work present day astronomical work not based on large DB's and will be confined to large projects only • On a longer time scale KDD will become a more widespread tool.. Most probably A.I. KDD Tools will be hidden behind most DB engines Munich 10-14-th of June 2002