Classification of GAIA data

Classification of GAIA data Overview GAIA classification objectives and available data Approaches to classification: principles and problems Example classification using RVS-like data Some specific issues Summary Coryn A.L. Bailer-Jones Max-Planck-Institut für Astronomie, Heidelberg calj@mpia.de

GAIA classification objectives • discrete classification of objects • as star, galaxy, quasar, solar system object, supernovae etc. • determination of astrophysical parameters (APs) for stars • Teff, logg, [Fe/H], [/Fe], CNO, A(), Vrot, Vrad, activity • combination with parallax to determine stellar: luminosity, radius, (mass, age) • identification of unresolved binaries (and parametrization of components where possible) • efficient identification of new types of objects Goal: catalogue of object classifications and astrophysical parameters

GAIA data BBP: 4+ broad band filters all objects MBP: 10-20 medium band filters all objects  object classification; stellar Teff, logg, [Fe/H], A() RVS: 849-874 nm spectrum, ~ 0.04 nm/pixel G<17  stellar Vrad, Vrot, specific element abundances Astrometry  parallax, kinematics, unresolved binaries Time domain  ~50 epochs over 5 years (photometric variability)  Inhomogeneous data “Redshift”problem: to get RV, need correct SpT template, but to determine SpT (may) need to know  shift  use MBP data to give SpT and iterate Generally: use MBP data to give initial classification of RVS data

Classification principles “Supervised” approach: • use pre-classified data (templates) to infer the desired mapping • apply mapping to any new data to give APs or classes But, the desired mapping is generally degenerate...

Minimum Distance Methods (MDMs) • Search for nearest neighbours (templates) in data space • Assign parameters according to these • Generally interpolate: either in data space:  = f(d; w) or in parameter space: D = g(; w) • Need to scale data dimensions • e.g. k-nn, 2 min, cross-correlation • a local classification method  astrophysical parameter(s) d1,d2 data D distance to a template

Classification principles • selecting just local neighbours in data space can lead to systematic errors or missed solutions • need to find global (forward) mapping and identify degenerate regions • more complex in higher dimensional spaces (data or parameters) • severity of degeneracy depends upon the density of template grid and noise in the data

As with MDM, degeneracy is a problem Artificial Neural Networks (ANNs) • Functional mapping: astrophysical parameters = f(data; weights) • Weights determined by training on pre-classified data (templates)  least squares minimization of total classification error (numerical methods)  global interpolation of data

Classification example with high-res spectra • Database of 611 real stellar spectra from Cenarro et al. (2001) • variation over Teff, logg, [Fe/H] • coverage: 849 - 874 nm (same as GAIA RVS) • resolution: 0.15 nm @ 0.075 nm/pixel (poorer than GAIA?) • SNR: median=70; 90% in range 20-140 Randomly split data set into two sets: train a neural network on one set and test its performance on the other.

Distribution over APs in Cenarro et al. data blue = training data (300) red = test data (311)

Results: Teff and logg

Results: [Fe/H]

Requirements of the classification scheme • produce both discrete classification and continuous parametrization (e.g. star vs. quasar, APs of stars) • recognition of degeneracies in presence of noise (i.e. recognise multiple classifications for given data vector) • robustly handle missing and censored data • possible RVS lossy compression (as function of magnitude)  handle different amounts/formats of data • reliable determination of parametrization uncertainties • accommodate ever-improving stellar models all this for a very wide range of type of objects ...

Hierarchical Parallel Classification schemes P = probability; APs = astrophysical parameters

Model training Real spectra and synthetic spectra not identical: • systematic differences (modelling uncertainties, e.g. opacities) • increased cosmic scatter in real spectra (unaccounted-for APs) 1. Can synthetic spectra be used to reliably parametrize GAIA data? 2. Are performances representative of what can be achieved? 3. Do synthetic spectra give the best optimization of phot/spec systems? 2+3 require accurate synthetic spectra (or large set of real spectra) Can overcome mismatch problem for (1): • use real GAIA data of pre-selected targets to apply corrections to synthetic SEDs • APs of these targets determined from higher resolution spectra from ground-based spectra

Summary • classification with GAIA data is a challenging problem • methods used so far in (astronomical) classification literature are suboptimal for this purpose  further development of methods is a high priority • particular problems to overcome are: - degeneracy (especially with MBP data and compressed RVS data) - inhomogeneous data • development of classification methods is very dependent on appropriate data (real or synthetic) - both of targets of interest - and of “contaminating” objects

ICAP: the GAIA classification working group • WG responsible for addressing classification issues for GAIA • 14 core members; 17 associate members GAIA Classification meeting 2-3 December Heidelberg, Germany Anyone interested in classification issues broadly related to GAIA is welcome to attend http://www.mpia.de/GAIA/

Classification of GAIA data

Classification of GAIA data

Presentation Transcript

Data Mining: Classification

GAIA

Gaia

Classification of Remotely Sensed Data

Data Classification

Gaia

1.2 Data Classification

CLASSIFICATION OF DATA: FREQUENCY DISTRIBUTION

Data Management at Gaia Data Processing Centers

Bayesian Classification of Protein Data

Data Mining Classification:

EPL660: DATA CLASSIFICATION

Classification of Microarray data

1.2 Data Classification

DATA CLASSIFICATION

Data Classification

Classification of unlabeled data:

Data Mining: Classification

Classification and Tabulation of data

Classification of Microarray Data

Data Management at Gaia Data Processing Centers

5 W’s OF DATA CLASSIFICATION