210 likes | 360 Views
Introduction to Sky Survey Problems. Bob Mann. Introduction to sky survey database problems. Astronomical data Astronomical databases The Virtual Observatory – concept & status Large sky survey databases Spatial indexing in astronomical databases Case Study: SDSS & SkyServer.
E N D
Introduction to Sky Survey Problems Bob Mann
Introduction to sky survey database problems • Astronomical data • Astronomical databases • The Virtual Observatory – concept & status • Large sky survey databases • Spatial indexing in astronomical databases • Case Study: SDSS & SkyServer
ROSAT ~keV DSS Optical IRAS 25m 2MASS 2m GB 6cm WENSS 92cm NVSS 20cm IRAS 100m Observational Astronomy • Electromagnetic spectrum
Astronomical data – in original form • Optical • Image: array of pixel values • X-ray • Event list: positions, arrival times, energies of all detected photons • Radio • Interferometric visibilities: sparse Fourier transform of a region of the sky Very different types of data
Astronomical data – in final form • Most research done using catalogue data • i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc) • Data compression • Catalogue - few% of image data volume • Amenable to representation in relational DB • Natural indexing by location in sky
Astronomical Databases • Sky survey archives • Homogeneous data, standard reduction pipeline • “Science Archive” – do science on DB • Telescope archives • Semi-indexed collections of raw data files from all observations taken – heterogeneous • Download data for reduction and analysis • Specialist data centres – collections of catalogues • Bibliographic databases– scans of major journals
The Virtual Observatory • Concept: • Interoperable federation of all the world’s significant astronomical databases • Facilitate multi-wavelength astronomy • Status: • Several projects underway – AstroGrid in UK • 5+ years’ work to create a fully working VO The VO sets the context for the design of new sky survey databases
AstroGrid: www.astrogrid.org • Consortium: • Edinburgh, Leicester, Cambridge, RAL, MSSL, Jodrell Bank, Queens Belfast • 3 year (~£4M) project: • 1 yr Phase A Study – finished end of 2002 • 2 yr Phase B Implementation – to end 2004 • Web (later Grid) service framework; in Java • Currently building web services, portals, etc - researching OGSA and OGSA-DAI
Large sky survey databases • Major science driver for AstroGrid – and VO • New science – mining multi-wavelength data • Largest are optical/near-infrared sky surveys • Largest of these hosted in Edinburgh: • current - SuperCOSMOS, SDSS (mirror) • future - WFCAM, VISTA • Each yield 1-10TB of catalogue data in RDBMS
Spatial queries in astronomy • Two important types: • Select entries (with predicate) in area of sky • Match entries (esp. between two tables) • Second is special case of first • i.e. both boil down to “point-within-distance-of-point” • but distances in two cases can be very different • Advantage in using a hierarchical spatial indexing scheme • Perform spatial query at appropriate granularity
Spatial Indexingin Astronomy • The Celestial Sphere • Many coordinate systems • Most common is the equatorial system, with Right Ascension and Declination as analogues of Longitude & Latitude
Spatial indexing in astronomical databases • Basic DBMS indexes are 1-D – e.g. B-trees • Some DBMSs support general 2-D indexing • Usually using R-trees (or variants) – rectangles: astronomical experiments not too successful: [Clive] • Some DBMSs have native spatial indexing • Little knowledge of this in astronomy - want to know more But • The Celestial Sphere is a sphere(!) • Many geographical spatial DBs use planar projections • So, astronomers have felt the need to develop spatial indexing prescriptions of their own
Hierarchical Triangular Mesh - HTM • Developed by Sloan survey archive team at JHU • Start with projection of octahedron on sphere and subdivide triangles at their midpoints • Generate unique pixel ID code based on position in the sky and level in hierarchy – can index that with B-tree
Hierarchical Equal Area Iso-Latitude Pixelisation (HEALPix) • Developed by Kris Gorski (now JPL/Caltech) • Start with division of sphere into twelve equal area curvilinear quadrilaterals, then divide each into four • Like HTM, produces a pixel code on which a B-tree index can be made (Ian – HEALPix in Oracle?)
Sky survey DB case study:SkyServer for SDSS • Sloan Digital Sky Survey (SDSS): • first of new generation of sky surveys • US-led team, dedicated telescope & camera • Image half of northern sky in 5 optical bands • Then obtain optical spectra for 1,000,000 galaxies • Estimated ~1TB of catalogue data
SDSS Archive • First of new generation of sky survey archives • Represents the state-of-the-art in sky survey databases • Developed by Alex Szalay’s team at Johns Hopkins • Project started in earnest in about 1996 • OODBMSs seen as the coming thing • SDSS chose Objectivity/DB for their archive: ~15 staff-years of effort later, they’d rewritten much of the DBMS themselves…and then jumped ship and started using MS SQL Server! - SkyServer (in collaboration with Jim Gray, MS Research)
SkyServer design considerations • Power & flexibility to pose arbitrary queries • Simple – astronomers ignorant of SQL! • Hide messy spherical trigonometry • Distance on sphere between (a1,d1) and (a2,d2) is given in SQL by 2.0*asin(sqrt(square(sin(0.5*(radians(d1-d2)))) + cos(radians(d1))*cos(radians(d2))* square(sin(0.5*(radians(a1-a2))))) • Don’t want users typing this • Don’t really want DBMS to evaluate expressions like this often
SkyServer spatial queries • Simple table-valued functions exposed to user: • E.g. select count(*) fromfGetNearbyObjEq(a,d,radius) (a,d)=(Right Ascension, Declination) • Functions call SQL Server Extended Stored Procedure • HTM index manipulation routines, implemented in a Dynamically Linked Library (DLL) • DLL generated from HTM package in C++
Lessons from HTM implementation in SkyServer • SQL is not great for spherical trigonometry • Messy to write, slow to compute • Have to define stored procedures/functions • Expose a clean interface to users • Let them pose queries the way they want to • Replace trig operations by integer arithmetic • Library of HTM index operations underneath • Precompute tables of neighbouring objects • Far fewer spatial match operations at query time
Problems with this approach • How easy to develop stored procedures, etc? • Needs detailed knowledge of DBMS • Extended Stored Procedure calls slow • How well will query optimiser use HTM? • …less well than built-in spatial index?… • …but that might be poorly suited to astronomical applications… • How easy to implement all this in DBMSs other than SQL Server? But this works reasonably well in practice!