1 / 24

Data provenance in astronomy

Data provenance in astronomy. Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk). Outline. Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions. Outline. Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey

makana
Download Presentation

Data provenance in astronomy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data provenance in astronomy Bob Mann Wide-Field Astronomy UnitUniversity of Edinburgh(rgm@roe.ac.uk)

  2. Outline • Data and databases in astronomy • Case Study: UK Infrared Deep Sky Survey • Conclusions

  3. Outline • Data and databases in astronomy • Case Study: UK Infrared Deep Sky Survey • Conclusions

  4. Astronomers observe across the whole electromagnetic spectrum • Galaxy images look different across spectrum, due to: • Inherent angular resolution of the telescope • Different emission processes

  5. Astronomical data: original form • Different detector technologies used across the spectrum, yielding different types of data: e.g. • Ultraviolet/optical/infrared • Image: array of pixel values • X-ray • Event list: positions, arrival times, energies of all detected photons • Radio • Interferometric visibilities: sparse Fourier transform of a region of the sky

  6. Astronomical data: final form • Most research done using catalogue data • i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc) • Data compression • Catalogue – few % of image data volume • Amenable to representation in relational DB • Natural indexing by location in sky • …but original data products (images, spectra, event lists) sometimes needed

  7. Astronomical databases • Telescope archives • Heterogeneous collections of raw data files from all observations taken • Download data for reduction and analysis • Sky survey archives • Homogeneous data and pipeline reduction • “Science Archive” – do science on DB • Bibliographic archives – scans of journals

  8. Astronomical data processing • Data reduction • Remove instrumental signatures from raw data and produce “science-ready” data • Software packages written for specific instruments • Data analysis • Derive scientific results from science-ready data products – e.g. statistical analyses • Some astro-specific packages/environments – e.g. IRAF • Some use of programming languages • Fortran, C/C++, Python, Java • Some use of commercial packages • e.g. Interactive Data Language (IDL)

  9. Outline • Data and databases in astronomy • Case Study: UKIDSS • Introduction to UKIDSS • Data life-cycle in UKIDSS • Provenance in UKIDSS • Conclusions

  10. UK Infrared Deep Sky Survey • Set of five infrared sky surveys • Covering ~1/6 of the sky • From large/shallow to very small/very deep • See www.ukidss.org • Observations: 2005-2012 using Wide Field Camera (WFCAM) on UK Infrared Telescope (UKIRT) in Hawaii

  11. UKIDSS data life-cycle (1) • Summit of Mauna Kea • Data acquired from 4 WFCAM detectors • Summit pipeline: instrument health • Data written to LTO tape in NDF format • Tapes couriered to Cambridge weekly • Cambridge • Raw data converted from NDF to FITS • Data reduction pipeline run on nightly basis: ~100Gb/night • Remove instrumental signatures, combine images, detect and classify objects, calibrate positions & fluxes

  12. UKIDSS data life-cycle (2) • Edinburgh • Ingest data from Cambridge:catalogues into RDBMS; image metadata into RDBMS; images on disk • Combine data from multiple nights: generate new catalogues from stacked images • Prepare release databases for WFCAM Science Archive (WSA): see http://surveys.roe.ac.uk/wsa • Users worldwide • Extract raw images from Cambridge • Extract image and catalogues in FITS files from Edinburgh • Run queries on catalogues & image metadata in WSA

  13. Provenance in UKIDSS • Why is provenance important in UKIDSS? • What provenance information is recorded? • How will this be used?...and by whom? • …and is this adequate?

  14. Importance of provenance • Much UKIDSS science is rare object search Objects with these colours would be very unusual – and possibly very interesting. Are they real? Need ability to trace back to reduced image within which object was detected – maybe back to raw image. Ratio of fluxes in H & K bands Ratio of fluxes in J & H bands

  15. Primary Header Primary Data Array Header Data Header Data Structure of a FITS file Header: composedof 80-characterASCII records Data units can be images or tables Extensions

  16. FITS header records • Almost all records of the formKEYWORD = ‘ value ‘ / COMMENT • Some standard keywords defined, butconsiderable freedom to define new ones • Relevant metadata for particular instruments • Amongst standard set is HISTORY • Format: HISTORY free text • Provenance information can be stored in a series of HISTORY records

  17. UKIDSS FITS files (1) • Raw image files • Primary header: telescope/instrumentset-up, observing conditions, target,observational parameters • Primary data array: empty • Extensions: (header,data) pairs for each of four detectors: header has detector-specific metadata; data is compressed image • Header keywords defined in Interface Control Document between Hawaii & Cambridge

  18. UKIDSS FITS files (2) • Reduced image files • Primary header & data array: metadatapropagated from raw data file • Headers of extensions include HISTORY records for data reduction steps run at Cambridge, e.g HISTORY 20060615 17:30:02 HISTORY $Id: cir_stage1.c,v 1.11 2005/12/15 14:44:04 jim Exp $ HISTORY 20060615 17:31:04 HISTORY $Id: cir_qblkmed.c,v 1.9 2005/08/12 14:35:19 jim Exp $ HISTORY 20060615 17:32:36 HISTORY $Id: cir_xtalk.c,v 1.5 2005/10/17 14:58:50 jim Exp $ HISTORY 20060615 20:01:58 HISTORY $Id: cir_arith.c,v 1.8 2005/02/25 10:14:55 jim Exp $ What When Who

  19. UKIDSS FITS files (3) • Catalogue files • Primary header: metadata propagatedfrom raw image • Primary data array: empty • Headers of extensions include metadata for catalogue generation process – invocations of software modules in HISTORY records, with parameter values in separate records • Header keywords for both reduced images and catalogues are defined in an Interface Control Document between Cambridge & Edinburgh

  20. User access to provenance info • All header records from all FITS files ingested into WSA except HISTORY records • So, users can track provenance through queries against WSA, and can get HISTORY records by downloading files • Hopefully enough to determined whether unusual object is real,but this is this good enough?

  21. Recap:Astronomical data processing • Data reduction • Remove instrumental signatures from raw data and produce “science-ready” data • Software packages written for specific instruments • Data analysis • Derive scientific results from science-ready data products – e.g. statistical analyses • Some astro-specific packages/environments – e.g. IRAF • Some use of programming languages • Fortran, C/C++, Python, Java • Some use of commercial packages • e.g. Interactive Data Language (IDL) ?

  22. Provenance in data analysis:Two main problems • Less controlled software environment • Little bits of code written for a specific analysis, not tried and tested pipeline modules • Use of data from many sources • UKIDSS/WSA is state-of-the-art for provenance • Many (esp. older) data resources not so good • Provenance of combined dataset only as good as provenance of worst constituent dataset?

  23. Does this matter? • Provenance information for data analysis is recorded in the journal paper (sort of) • Improving links between online literature and data sources • Increasing importance of large sky surveys with well controlled environments • Moving more of the data analysis from the user’s desktop to the data centre

  24. Conclusions • Modern sky survey systems record & publish extensive provenance for data reduction • Very little provenance recorded from data analysis – except description in journal paper • More could surely be done – but would researchers support overhead of doing so? • Improvements as more analysis in data centre • Could/should we be doing more?

More Related