240 likes | 407 Views
Data provenance in astronomy. Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk). Outline. Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions. Outline. Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey
E N D
Data provenance in astronomy Bob Mann Wide-Field Astronomy UnitUniversity of Edinburgh(rgm@roe.ac.uk)
Outline • Data and databases in astronomy • Case Study: UK Infrared Deep Sky Survey • Conclusions
Outline • Data and databases in astronomy • Case Study: UK Infrared Deep Sky Survey • Conclusions
Astronomers observe across the whole electromagnetic spectrum • Galaxy images look different across spectrum, due to: • Inherent angular resolution of the telescope • Different emission processes
Astronomical data: original form • Different detector technologies used across the spectrum, yielding different types of data: e.g. • Ultraviolet/optical/infrared • Image: array of pixel values • X-ray • Event list: positions, arrival times, energies of all detected photons • Radio • Interferometric visibilities: sparse Fourier transform of a region of the sky
Astronomical data: final form • Most research done using catalogue data • i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc) • Data compression • Catalogue – few % of image data volume • Amenable to representation in relational DB • Natural indexing by location in sky • …but original data products (images, spectra, event lists) sometimes needed
Astronomical databases • Telescope archives • Heterogeneous collections of raw data files from all observations taken • Download data for reduction and analysis • Sky survey archives • Homogeneous data and pipeline reduction • “Science Archive” – do science on DB • Bibliographic archives – scans of journals
Astronomical data processing • Data reduction • Remove instrumental signatures from raw data and produce “science-ready” data • Software packages written for specific instruments • Data analysis • Derive scientific results from science-ready data products – e.g. statistical analyses • Some astro-specific packages/environments – e.g. IRAF • Some use of programming languages • Fortran, C/C++, Python, Java • Some use of commercial packages • e.g. Interactive Data Language (IDL)
Outline • Data and databases in astronomy • Case Study: UKIDSS • Introduction to UKIDSS • Data life-cycle in UKIDSS • Provenance in UKIDSS • Conclusions
UK Infrared Deep Sky Survey • Set of five infrared sky surveys • Covering ~1/6 of the sky • From large/shallow to very small/very deep • See www.ukidss.org • Observations: 2005-2012 using Wide Field Camera (WFCAM) on UK Infrared Telescope (UKIRT) in Hawaii
UKIDSS data life-cycle (1) • Summit of Mauna Kea • Data acquired from 4 WFCAM detectors • Summit pipeline: instrument health • Data written to LTO tape in NDF format • Tapes couriered to Cambridge weekly • Cambridge • Raw data converted from NDF to FITS • Data reduction pipeline run on nightly basis: ~100Gb/night • Remove instrumental signatures, combine images, detect and classify objects, calibrate positions & fluxes
UKIDSS data life-cycle (2) • Edinburgh • Ingest data from Cambridge:catalogues into RDBMS; image metadata into RDBMS; images on disk • Combine data from multiple nights: generate new catalogues from stacked images • Prepare release databases for WFCAM Science Archive (WSA): see http://surveys.roe.ac.uk/wsa • Users worldwide • Extract raw images from Cambridge • Extract image and catalogues in FITS files from Edinburgh • Run queries on catalogues & image metadata in WSA
Provenance in UKIDSS • Why is provenance important in UKIDSS? • What provenance information is recorded? • How will this be used?...and by whom? • …and is this adequate?
Importance of provenance • Much UKIDSS science is rare object search Objects with these colours would be very unusual – and possibly very interesting. Are they real? Need ability to trace back to reduced image within which object was detected – maybe back to raw image. Ratio of fluxes in H & K bands Ratio of fluxes in J & H bands
Primary Header Primary Data Array Header Data Header Data Structure of a FITS file Header: composedof 80-characterASCII records Data units can be images or tables Extensions
FITS header records • Almost all records of the formKEYWORD = ‘ value ‘ / COMMENT • Some standard keywords defined, butconsiderable freedom to define new ones • Relevant metadata for particular instruments • Amongst standard set is HISTORY • Format: HISTORY free text • Provenance information can be stored in a series of HISTORY records
UKIDSS FITS files (1) • Raw image files • Primary header: telescope/instrumentset-up, observing conditions, target,observational parameters • Primary data array: empty • Extensions: (header,data) pairs for each of four detectors: header has detector-specific metadata; data is compressed image • Header keywords defined in Interface Control Document between Hawaii & Cambridge
UKIDSS FITS files (2) • Reduced image files • Primary header & data array: metadatapropagated from raw data file • Headers of extensions include HISTORY records for data reduction steps run at Cambridge, e.g HISTORY 20060615 17:30:02 HISTORY $Id: cir_stage1.c,v 1.11 2005/12/15 14:44:04 jim Exp $ HISTORY 20060615 17:31:04 HISTORY $Id: cir_qblkmed.c,v 1.9 2005/08/12 14:35:19 jim Exp $ HISTORY 20060615 17:32:36 HISTORY $Id: cir_xtalk.c,v 1.5 2005/10/17 14:58:50 jim Exp $ HISTORY 20060615 20:01:58 HISTORY $Id: cir_arith.c,v 1.8 2005/02/25 10:14:55 jim Exp $ What When Who
UKIDSS FITS files (3) • Catalogue files • Primary header: metadata propagatedfrom raw image • Primary data array: empty • Headers of extensions include metadata for catalogue generation process – invocations of software modules in HISTORY records, with parameter values in separate records • Header keywords for both reduced images and catalogues are defined in an Interface Control Document between Cambridge & Edinburgh
User access to provenance info • All header records from all FITS files ingested into WSA except HISTORY records • So, users can track provenance through queries against WSA, and can get HISTORY records by downloading files • Hopefully enough to determined whether unusual object is real,but this is this good enough?
Recap:Astronomical data processing • Data reduction • Remove instrumental signatures from raw data and produce “science-ready” data • Software packages written for specific instruments • Data analysis • Derive scientific results from science-ready data products – e.g. statistical analyses • Some astro-specific packages/environments – e.g. IRAF • Some use of programming languages • Fortran, C/C++, Python, Java • Some use of commercial packages • e.g. Interactive Data Language (IDL) ?
Provenance in data analysis:Two main problems • Less controlled software environment • Little bits of code written for a specific analysis, not tried and tested pipeline modules • Use of data from many sources • UKIDSS/WSA is state-of-the-art for provenance • Many (esp. older) data resources not so good • Provenance of combined dataset only as good as provenance of worst constituent dataset?
Does this matter? • Provenance information for data analysis is recorded in the journal paper (sort of) • Improving links between online literature and data sources • Increasing importance of large sky surveys with well controlled environments • Moving more of the data analysis from the user’s desktop to the data centre
Conclusions • Modern sky survey systems record & publish extensive provenance for data reduction • Very little provenance recorded from data analysis – except description in journal paper • More could surely be done – but would researchers support overhead of doing so? • Improvements as more analysis in data centre • Could/should we be doing more?