100 likes | 191 Views
On-demand associations using database server c lusters. László Dobos, Tamás Budavári , Alex Szalay , István Csabai Eötvös University / JHU. Cross-match problem in astronomy. Astronomical catalogs in the TB range, o(100M) detections per catalog Geographically distributed:
E N D
On-demand associations using database server clusters László Dobos, TamásBudavári, Alex Szalay, István Csabai Eötvös University / JHU IDIES Inaugural Symposium, Baltimore
Cross-matchprobleminastronomy • Astronomical catalogs in the TB range, o(100M) detections per catalog • Geographically distributed: • reliable, lightweight transfer protocol needed • shouldbenefit from co-located datasets • Goals: • find the same object in every catalog • find drop-outs (requires complete description of footprints) • on-demand: do it quickly (< 5 min) • Matching primarily based oncelestial coordinates • astrometric error • error can vary from object to object • Additional match criteria: size, color, etc. IDIES Inaugural Symposium, Baltimore
Cross-matchprobleminastronomy • The math: • Bayesian model selection[Budavári & Szalay 2008, „Probabilistic Cross-Identification of Astronomical Sources”] • First step: cut on distance • Including additional match criteria is easy and natural • Tested on simulations [Heinis et al. 2009] • The problems • one-to-one matching of objects is expensive • trigonometric computations • IO intensive if dataset is big: always have to keep the right subset of data in memory IDIES Inaugural Symposium, Baltimore
Hardware and datalayout • JHU Graywulf cluster: • Dell PowerEdge 2950 + Dell PowerVauld MD 1000,2 × PERC 5/e raid controller • 1.2-1.4 GB/sec nominal IO bandwidth, InfiniBand • 2x4 core iXeon, 8-32 GB RAM • 5-20 machines partially assigned tocross-match engine • Catalogs are mirrored on every node • User catalogs uploaded to / located at a dedicated node • Remote data sources (via various protocols) • Queries are partitioned and executed in parallel on every machine IDIES Inaugural Symposium, Baltimore
Xmatchdefinitionlanguage • A cross-matchquery: SELECT s.objId as SobjID, s.ra, s.dec,g.ra, g.dec, j_m FROM SDSS:PhotoObjAllAS s CROSS JOIN GALEX:PhotoObjAll AS g XMATCH BAYESIAN AS x MUST s ON Point(s.ra, s.dec), 0.1 MUST g ON Point(g.ra, g.dec), 0.5 HAVING x.BF > 1e3 WHERE s.type= 3 AND s.ra BETWEEN 200 AND 210 AND s.dec BETWEEN -2 AND 2 AND g.ra BETWEEN 200 AND 210 AND g.dec BETWEEN -2 AND 2 • A partitionedquery: SELECT s.ObjID FROM SDSS:PhotoObjAll s PARTITION ON Ra WHERE Ra BETWEEN 200 AND 210 AND Dec BETWEEN -5 AND 5 IDIES Inaugural Symposium, Baltimore
QueryExecution 1 • Parse: • proprietary SQL parser written from scratch • covers ~80% of SQL Server’s SELECT statement grammar • extensions can be added easily by changing BNF grammar • Job assignment: (to be implemented) • determine sets of collocated catalogs using a central registry • send part of cross-match job to remote service • return only cross-matched result, not full raw datasets • merge resultsets at any node • Partition: • cross-match queries: on right ascension • simple queries: on specified column • partitioned determined based on histogram:histogram query executed on a subsample to get metrics IDIES Inaugural Symposium, Baltimore
QueryExecution 2 • Cache: • cache remote datasets • copy myDB tables to worker nodes • can benefit from filters defined in query • Execute: • construct T-SQL queries • execute T-SQL queries on nodes in parallel • automatically retry on failure • Merge • merge resultsets • benefit from clever partitioning: no duplicates IDIES Inaugural Symposium, Baltimore
Appliedtechnologies • Relational Database Management System: • SQL Server 2008 • CLR integration with parallel execution support • Windows Workflow Foundation: • coordinates the complex execution workflow • transactions help keep the system consistent • parallel execution support • SMO • SQL management objects • easy access to the database schema IDIES Inaugural Symposium, Baltimore
Zonealgorithm • Zone algorithm: • Pure T-SQL: can leverage from query optimizer of SQL Server • Divide sphere into zones • ZoneID: very simple hash on declination • Indexes built on ZoneID and right ascension help very quick pre-filtering of match candidates • very well parallelized on multi-core machines • [Gray, Szalay & Nieto-Santisteban 2006, The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets] IDIES Inaugural Symposium, Baltimore
Summary and futurework • On-demand cross-matching is feasible • Parser and partitioning logic built for handling cross-match job descriptions • Workflow built for executing partitioned jobs • New technologies allow rapid development of complex workflows and high performance data warehouses • Future work: • Develop GUI • Install and publish system • Add support for remote datasets • Add support to benefit from collocated datasets IDIES Inaugural Symposium, Baltimore