E N D
Promised Abstract • Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex Szalay.) We have had some success but still face many problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. That segment will end with a discussion of how the integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture of SkyQuery a prototype portal to several archives each of which is a web service. Urls: http://research.microsoft.com/~grayhttp://skyserver.pha.jhu.edu/dr1/http://skyQuery.net
Session CodeYukon Features For SkyServer Database Jim Gray: Microsoft Alex Szalay (and friends): Johns Hopkins Help from: Cathan Cook (personal SkyServer), Maria A. Nieto-Santisteban (image cutout service)
SkyServer Overview (10 min) • 10 minute SkyServer tour • Pixel space http://skyserver.sdss.org/en/ • Record space: • Doc space: Ned • Set space: • Web & Query Logs • Dr1 WebService • You can download (thanks to Cathan Cook ) • Data + Database code: • Website: • Data Mining the SDSS SkyServer DatabaseMSR-TR-2002-01 select top 10 * from weblog..weblog where yy = 2003 and mm=7 and dd =25 order by seq desc select top 10 * from weblog..sqlLog order by theTime Desc http://research.microsoft.com/~gray/SDSS
Cutout Service (10 min)A typical web service • Show it • Show WSDL • Show fixing a bug • Rush through code. • You can download it.Maria A. Nieto-Santisteban did most of this (Alex and I started it) http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm
SkyQuery: http://skyquery.net/ • Distributed Query tool using a set of web services • Fifteen astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)… • Feasibility study, built in 6 weeks • Tanu Malik (JHU CS grad student) • Tamas Budavari (JHU astro postdoc) • With help from Szalay, Thakar, Gray • Implemented in C# and .NET • Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
Each SkyNode publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service ImageCutout SkyQuery Portal 2MASS INT SDSS FIRST SkyQuery Structure
Four Database Topics • Sparse tables: column vs row store tag and index tables pivot • Maplist (cross apply) • Bookmark bug • Object Relational has arrived.
Column Store Pyramid • Users see fat base tables (universal relation) • Define popular columns indextag table 10% ~ 100 columns • Make many skinny indices1% ~ 10 columns • Query optimizer picks right plan • Automate definition & use • Fast read, slow insert/update • Data warehouse • Note: prior to Yukon, index had 16 column limit. A bane of my existence. BASE Obese query TAG Fat query Typical Semi-join INDICIES Simple
Examples create tablebase (idbigint, f1intprimary key,f2int, …,f1000int) create index tagonbase (id)include(f1, …, f100) create indexskinnyonbase(f2,…f17) BASE Obese query TAG Typical Semi-join Fat query INDICIES Simple
A Semi-Join Example create tablefat(aint primary key, bint, cint, fatchar(988)) declare@iint, @jint; set@i = 0 again: insertfatvalues(@i, cast(100*rand() as int), cast(100*rand() as int), ' ') set@i = @i + 1; if(@i < 1000000)gotoagain create indexabonfat(a,b) create indexac onfat(a,c) dbcc dropcleanbuffers with no_infomsgs selectcount(*)fromfatwith(index (0))wherec = b -- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s. dbcc dropcleanbuffers with no_infomsgs selectcount(*) fromfatwhereb=c -- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s, elapsed: 1.4 s. b=c 137 K IO 31 sec 1GB ab 8MB 8MB ac b=c 3.4K IO 1.4 sec
T Object year color 4PNC4502000 white Moving From Rows to ColumnsPivot & UnPivot What if the table is sparse? LDAP has 7 mandatory and 1,000 optional attributesStore row, col, value create tableFeatures ( objectvarchar , attributevarchar, valuevarchar, primary key ( object, attribute)) select*from(featurespivotvalueonattribute in(year, color) )asTwhereobject = ‘4PNC450’ Features object attribute value ●●●● 4PNC450 year 2000 4PNC450 color white 4PNC450 make Ford 4PNC450 model Taurus ●●●●
p1 f(p1) p2 f(p2) pn f(pn) Maplist Meets SQL – cross apply selectp.*, q.* fromparentaspcross applyf(p.a, p.b, p.c)asq wherep.type = 1 • Your table-valued function F(a,b,c) returns all objects related to a,b,c. • spatial neighbors, • sub-assemblies, • members of a group, • items in a folder,… • Apply this function to each row • Classic drill-downuse outer apply if f() may be null
The Bookmark Bug • SQL is a non-procedural language. • The compiler/optimizer picks the procedurebased on statistics. • If the stats are wrong or missing….Bad things happen.Queries can run VERY slowly. • Strategy 1: allow users to specify plan. • Strategy 2: make the optimizer smarter (and accept hints from the user.)
An Example of the Problem • A query selects some fields of an index and of huge table. • Bookmark plan: • look in index for a subset. • Lookup subset in Fat table. • This is • great if subset << table. • terrible if subset ~ table. • If statistics are wrong, or if predicates not independent,you get the wrong plan. • How to fix the statistics? Huge table Index
A Fix: Let user ask for stats • Create Statistics on View(f1,..,fn) • Then the optimizer has the right dataPicks the right plan.Statistics on Views, C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003, • Q3: Select count(*) from Galaxy where r < 22 and r_extinction > 0.120Bookmark: 34 M random IO, 520 minutesCreate Statistics on Galaxy(objID ) Scan: 5 M sequential IO 18 minutes • Ultimately this should be automated,but for now,… it’s a step in the right direction.
Object Relational Has Arrived • VMs are moving inside the DB • Yukon includes Common Language Runtime (Oracle & DB2 have similar mechanisms). • So, C++, VB, C# and Java are co-equal with TransactSQL. • You can define classes and methodsSQL will store the instancesAccess them via methods • You can put your analysis code INSIDE the database. • Minimizes data movement.You can’t move petabytes to the clientBut we will soon have petabyte databases. data code data code +code
And.. • Fully-async and synchronous (blocking) callsand multi-concurrent-result sets per connection (transaction) • Queues built in (service broker): • Fire-and forget asynchronous processing • It listens to Port 80 for SOAP calls : TP-lite is back It’s a web service • Notification service and data mining and olap and reporting and xml and xquery and.... ) • But, back to OR.
Some BackgroundTable valued functions • SQL operates on tables. • If you can make tables, you can extend SQL • This is the idea behind OLE/DB create function Evens(@maxVal int) returns @T table (a int) begin while (@maxVal > 0) begin if (@maxVal % 2 = 0) insert @T values(@maxVal) set @maxVal = @maxVal -1 end return end select * from Evens(10) a ----------- 10 8 6 4 2
Using table Valued Functions For Spatial Search • Use function to return likely key ranges. • Use filter predicate to eliminate objects outside the query box. Table valued function returns candidate ranges of some space-filling curve. Select objID From Objects O join fGetRanges( @latitude, @longitude, @radius) R on O.htmID between R.begin and R.end where abs(o.Lat - @latitude) + abs(o.Lon – @longitude) < @radius Filter discards false positives.
The Pre CLR design Transact SQL sp_HTM (20 lines) 469 lines of “glue” looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL) ErrorExit("srv_paraminfo failed..."); // Is Coordinate param a character string if (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR && bType1 != SRVVARCHAR && bType1 != SRVCHAR) ErrorExit("Coordinate param should be a string."); // Is Coordinate param non-null if (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1) ErrorExit("Coordinate param is null."); // Get pointer to Coordinate param pzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1); if (pzCoordinateSpec == NULL) ErrorExit("Coordinate param is null."); pzCoordinateSpec[cbActualLen1] = 0; // Get OutputVector datatype, and param length information if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL) ErrorExit("Failed to get type info on HTM Vector param..."); The HTM code body
usingSystem; usingSystem.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM { public classHTM_wrapper { [DllImport("SQL_HTM.dll")]static extern unsafe void*xp_HTM_Cover_get (byte *str); public static unsafe voidHTM_cover_RS(string input) { // convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte* input;byte* output; // invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input); // Convert the array to a table SqlResultSet outputTable = SqlContext.GetReturnResultSet(); if(output[0] == 'O') { // if Output is “OK” uintc = *(UInt32*)(s + 4); // cast results as dataset Int64 *r = (Int64*)(s + 8); // Int64 r[c-1,2] for(inti = 0; i < c; ++i) { SqlDataRecord newRecord = outputTable.CreateRecord(); newRecord.SetSqlInt64(0, r[0]); newRecord.SetSqlInt64(1, r[1]); r++;r++; outputTable.Insert(newRecord); } } // return outputTable; } } } The “glue” CLR designDiscard 450 lines of UGLY code C# SQL sp_HTM (50 lines) Thanks!!! To Peter Kukol (who wrote this) The HTM code body
The Clean CLR designDiscard all glue codereturn array cast as table CREATE ASSEMBLY HTM_A FROM '\\localhost\HTM\HTM.dll' CREATE FUNCTION HTM_cover( @inputNVARCHAR(100) ) RETURNS@tTABLE ( HTM_ID_STARTBIGINT NOT NULLPRIMARY KEY, HTM_ID_END BIGINT NOT NULL ) AS EXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover Your/My code goes here usingSystem; usingSystem.Data; usingSystem.Data.Sql; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM_NS { public classHTM_C { public static Int64[,2]HTM_cover(string input) { // invoke the HTM routine return(Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here. } } }
Performance (Beta1) On a 2.2 Ghz Xeon • Call a Transact SQL function 33μs • Call a C# function 50μs • Table valued function 1,580 μs + per row 42 μs • Array (== table) valued function 200 μs + per row 27 μs
The Code CREATE ASSEMBLY ReturnOneA FROM '\\localhost\C:\ReturnOne.dll' GO CREATE FUNCTION ReturnOne_Int( @input INT) RETURNS INT AS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_Int GO --------------------------------------------- -- time echo an integer declare @i int, @j int, @cpu_seconds float, @null_loop float declare @start datetime, @end datetime set @j = 0 set @i = 10000 set @start = current_Timestamp while(@i > 0) begin set @j = @j + 1 set @i = @i -1 end set @end = current_Timestamp set @null_loop = datediff(ms, @start,@end) / 10.0 set @i = 10000 set @start = current_Timestamp while(@i > 0) begin select @j = dbo.ReturnOne_Int(@i) set @j = @j + 1 set @i = @i -1 end set @end = current_Timestamp set @cpu_seconds = datediff(ms, @start,@end) / 10.0 - @null_loop print 'average cpu time for 1,000 calls to ReturnOne_Int was ' + str(@cpu_seconds,8,2)+ ' micro seconds' Program in DB in different language (Tsql) calling function Function written in C# inside the DB using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace ReturnOneNS { public class ReturnOneC { public static int ReturnOne_Int(intinput) { return input; } } }
What Is the Significance? • No more inside/outside DB dichotomy. • You can put your code near the data. • Indeed, we are letting users put personal databases near the data archive. • This avoids moving large datasets. • Just move questions and answers.
Meta-Message • Trying to fit science data into databases • When it does not fit, something is wrong. • Look for solutions • Many solutions come from OR extensions • Some are fundamental engine changes • More structure in DB • Richer operator sets • Better statistics
© 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.