110 likes | 194 Views
László Dobos 1,2 , Tamás Budavári 2 , Nolan Li 2 , Alex Szalay 2 , István Csabai 1 dobos@ complex.elte.hu , budavari @ jhu.edu 1 Eötvös Loránd University, Budapest, Hungary Department of Physics of Complex Systems
E N D
László Dobos1,2, Tamás Budavári2, Nolan Li2,Alex Szalay2, István Csabai1 dobos@complex.elte.hu , budavari@jhu.edu 1 Eötvös Loránd University, Budapest, HungaryDepartment of PhysicsofComplex Systems 2 The JohnsHopkins University, Baltimore, USADepartment of Physics & Astronomy 24th SSDBM conference, 25-27 June, 2012 ‒Chania, Crete, Greece SkyQuery: distributedprobabilisticjoininfrastructureforastronomy
Crossmatching • Astronomicalcatalogs • in RDBMS • o(100 million) objects • o(1TB – 10TB) DB size • Donebycoordinates • RA, Dec • Astrometricerror • Differentskycoverage • Differentwavelengthrange • Movingobjects etc.
The multiwavelengthsky infrared (2MASS) visible (DSS) ultraviolet (Galex)
Problemdescription • Alldatain RDBMS • runcomputationinsidethedatabase • usemultiple servers and parallelize • must be transparentforusers • Astronomers„script” whattheydo • multiplere-runs, tweakparameters etc. • huge web forms: no-no • Use SQL toformulatetheproblem • functions and languageextensionstosupportastronomy • extra syntaxtodescribethecoordinate-basedprobabilisticjoin • spatialconstraints: celestialregions
Sample SQL query SELECTs.objId, g.objID, t.objID, s.ra, s.dec, g.ra, g.dec, t.ra, t.dec, x.ra, x.decFROMSDSSDR7:GalaxiesAS sCROSS JOIN Galex:GalaxiesAS g CROSS JOIN TwoMASS:ExtendedSourcesAS tXMATCH BAYESIAN AS xMUST s ONPOINT(s.cx, s.cy, s.cz), 0.1MUST g ONPOINT(g.ra, g.dec), 0.2 MAY t ONPOINT(t.ra, t.dec), 0.5HAVING LIMIT 1e3 REGIONCIRCLE J2000 165.7, 0.3, 60 Standard SQL Probabilisticcrossmatch Spatialconstraint
Howtoextend SQL • SQL is declarative • everythingcan be executedthatcanbeexpressed • extensions must be executableinanycase • Queryoptimization is hard • Design languagewitheasyoptimizationin mind • constrainonthelevel of thegrammar • customclausesinstead of complexwhereclauselogic
DistributedSkyQueryArchitecture GraywulfDatabase Server Cluster Internet Remote VirtualObservatory Data Source SQL queries MyDB SkyQueryWeb Interface Job Scheduler XMATCH query ClusterRegistry SDSS × 2MASS = ?
Modules I. • Registry • completedescription of the server cluster • frommachinegrouptodiskvolumes • containsallinfoforoptimaldatabaseallocation • Management tools • allocate, resize, copy, mirror etc. databases • monitor cluster status • Scheduler • co-locationawarequeryexecution • jobsimplementedasworkflows • .Net WorkflowFoundation (WF4) • parallel executionout-of-the-box • extensivelogging, persistence, retrylogic etc.
Modules II. • SQL parsergenerator • supportsgrammarinheritence • easyto add customextensionstoplain SQL • Metadatatools • Tag SQL scriptswithmetadata • Makeitaccessiblefrom web interface • Extractprovenanceinformationfromuserqueries • User web interface • write and submitqueries • accesstoowndatabase (MyDB) • Jobs • Crossmatchworkflow, etc.
Futurework • Currentsystem: focusonastronomy/crossmatch • Implementspatialconstraints • Extendto a genericframework + API • mirroring, sharding of datasets • querypartitioning • limited distributedjoins • transparentaccesstoremotedatasets • smartcaching of remotedata / queryresults