170 likes | 321 Views
MonetDB/SQL Meets SkyServer: the Challenges of a Scientific Database. Milena Ivanova, Niels Nes, Romulo Goncalves, Martin Kersten CWI, Amsterdam Presented at SSDBM, July 2007, Banff, Canada. 230 million object images 1 million spectra 4TB catalog data 9TB images.
E N D
MonetDB/SQL Meets SkyServer:the Challenges of a Scientific Database Milena Ivanova, Niels Nes, Romulo Goncalves, Martin Kersten CWI, Amsterdam Presented at SSDBM, July 2007, Banff, Canada
230 million object images • 1 million spectra • 4TB catalog data • 9TB images A project to make a map of a large part of the Universe SkyServer provides public access to SDSS for astronomers, students, and wide public
SkyServer Schema Vertical fragment of 100+ popular columns 446 columns >370 million rows Materialized join of Photo and Spectra
Outline • MonetDB/SQL • SkyServer porting lessons • Query log lessons • Evaluation • Outlook
MonetDB Background PhotoObjAll Ra BAT Dec BAT U BAT
select count(*) from photoobjall; SQL XQuery function user.s3_1():void; X1:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",0); X6:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",1); X9:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",2); X13:bat[:oid,:oid] := sql.bind_dbat("sys","photoobjall",1); X8 := algebra.kunion(X1,X6); X11 := algebra.kdifference(X8,X9); X12 := algebra.kunion(X11,X9); X14 := bat.reverse(X13); X15 := algebra.kdifference(X12,X14); X16 := calc.oid(0@0); X18 := algebra.markT(X15,X16); X19 := bat.reverse(X18); X20 := aggr.count(X19); sql.exportValue(1,"sys.","count_","int",32,0,6,X20,""); end s3_1; MonetDB Kernel MonetDB Architecture MAL Tactical Optimizer MAL MonetDB Server
SkyServer with MonetDB Goal: To provide SkyServer mirror with similar functionality using MonetDB Three phases: 1%, 10%, entire SDSS data set Can we • Do better in terms of performance and functionality? • Improve query processing by novel parallelism and query cracking techniques? • Extend functionality to support, e.g. LOFAR?
Portability Lessons • Need for rich SQL environment (PSM) • Cast to SQL:2003 standard • Replacement of data types and operations • Specific extensions ignored or replaced • Avoid data redundancy • Auxiliary tables replaced by views:10% size reduction
Spatial Search Lesson • HTM (Hierarchical Triangular Mesh) • Implemented in C++, C# • Good for point-near-point and point-in-region queries • Zones • Implemented in SQL • Good for point-near-point (x3) • Efficient for batch-oriented spatial join(x32) • Enables SQL optimizer usage
Query Log Lessons • Query logs important for both application and science • Analysed 1.2M queries, August 2006 • Spatial access prevails (83%) • Small core of photo and spectro tables accessed • 64% photo, 44% spectro, 27% both
Common Patterns • Limited number of query patterns • Correlation to web site interface • Most popular query (25%) SELECT top 10 p.objID, p.run, p.rerun, p.camcol, p.field, p.obj, p.type, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, p.Err_u, p.Err_g, p.Err_r, p.Err_i, p.Err_z FROM fGetNearbyObjEq(195,2.5,3) n, PhotoPrimary p WHERE n.objID = p.objID;
Query coverage on the sky Query coverage on the sky Query Coverage
Spatial Overlap • 24% queries overlap • Mean sequence length of 9.4, max of 6200 • Overlap and equality patterns for script-based interaction • Zoom in/zoom out patterns for manual interaction
Evaluation on 100GB • ‘Color-cut’ for low-z quasars SELECT g, run, rerun, camcol, field, objID, FROM Galaxy WHERE ( ( g <= 22) and (u - g >= -0.27) and (u - g < 0.71) and (g - r >= -0.24) and (g - r < 0.35) and (r - i >= -0.27) and (r - i < 0.57) and (i - z >= -0.35) and (i - z < 0.7) ); • Moving asteroids SELECT objID, sqrt(power(rowv,2) + power(colv,2)) as velocity FROM PhotoObj WHERE power(rowv,2) + power(colv,2) > 50 and rowv >= 0 and colv >= 0;
Status • Staircase to the sky • 1GB: done • 100GB: towards completion • Entire 4TB DR6: in progress • Web site
Inspirations • Self-organization vs. hard-coded zoning • Adaptive segmentation (ICDE’08) • Adaptive replication (EDBT’08) • Results caching and reuse • Workload-driven optimization