1 / 29

Promised Abstract

Promised Abstract.

fedora
Download Presentation

Promised Abstract

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Promised Abstract • Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex Szalay.) We have had some success but still face many problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. That segment will end with a discussion of how the integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture of SkyQuery a prototype portal to several archives each of which is a web service. Urls: http://research.microsoft.com/~grayhttp://skyserver.pha.jhu.edu/dr1/http://skyQuery.net

  2. Session CodeYukon Features For SkyServer Database Jim Gray: Microsoft Alex Szalay (and friends): Johns Hopkins Help from: Cathan Cook (personal SkyServer), Maria A. Nieto-Santisteban (image cutout service)

  3. SkyServer Overview (10 min) • 10 minute SkyServer tour • Pixel space http://skyserver.sdss.org/en/ • Record space: • Doc space: Ned • Set space: • Web & Query Logs • Dr1 WebService • You can download (thanks to Cathan Cook ) • Data + Database code: • Website: • Data Mining the SDSS SkyServer DatabaseMSR-TR-2002-01 select top 10 * from weblog..weblog where yy = 2003 and mm=7 and dd =25 order by seq desc select top 10 * from weblog..sqlLog order by theTime Desc http://research.microsoft.com/~gray/SDSS

  4. Cutout Service (10 min)A typical web service • Show it • Show WSDL • Show fixing a bug • Rush through code. • You can download it.Maria A. Nieto-Santisteban did most of this (Alex and I started it) http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm

  5. SkyQuery: http://skyquery.net/ • Distributed Query tool using a set of web services • Fifteen astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)… • Feasibility study, built in 6 weeks • Tanu Malik (JHU CS grad student) • Tamas Budavari (JHU astro postdoc) • With help from Szalay, Thakar, Gray • Implemented in C# and .NET • Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

  6. Each SkyNode publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service ImageCutout SkyQuery Portal 2MASS INT SDSS FIRST SkyQuery Structure

  7. Four Database Topics • Sparse tables: column vs row store tag and index tables pivot • Maplist (cross apply) • Bookmark bug • Object Relational has arrived.

  8. Column Store Pyramid • Users see fat base tables (universal relation) • Define popular columns indextag table 10% ~ 100 columns • Make many skinny indices1% ~ 10 columns • Query optimizer picks right plan • Automate definition & use • Fast read, slow insert/update • Data warehouse • Note: prior to Yukon, index had 16 column limit. A bane of my existence. BASE Obese query TAG Fat query Typical Semi-join INDICIES Simple

  9. Examples create tablebase (idbigint, f1intprimary key,f2int, …,f1000int) create index tagonbase (id)include(f1, …, f100) create indexskinnyonbase(f2,…f17) BASE Obese query TAG Typical Semi-join Fat query INDICIES Simple

  10. A Semi-Join Example create tablefat(aint primary key, bint, cint, fatchar(988)) declare@iint, @jint; set@i = 0 again: insertfatvalues(@i, cast(100*rand() as int), cast(100*rand() as int), ' ') set@i = @i + 1; if(@i < 1000000)gotoagain create indexabonfat(a,b) create indexac onfat(a,c) dbcc dropcleanbuffers with no_infomsgs selectcount(*)fromfatwith(index (0))wherec = b -- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s. dbcc dropcleanbuffers with no_infomsgs selectcount(*) fromfatwhereb=c -- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s, elapsed: 1.4 s. b=c 137 K IO 31 sec 1GB ab 8MB 8MB ac b=c 3.4K IO 1.4 sec

  11. T Object year color 4PNC4502000 white Moving From Rows to ColumnsPivot & UnPivot What if the table is sparse? LDAP has 7 mandatory and 1,000 optional attributesStore row, col, value create tableFeatures ( objectvarchar , attributevarchar, valuevarchar, primary key ( object, attribute)) select*from(featurespivotvalueonattribute in(year, color) )asTwhereobject = ‘4PNC450’ Features object attribute value ●●●● 4PNC450 year 2000 4PNC450 color white 4PNC450 make Ford 4PNC450 model Taurus ●●●●

  12. p1 f(p1) p2 f(p2) pn f(pn) Maplist Meets SQL – cross apply selectp.*, q.* fromparentaspcross applyf(p.a, p.b, p.c)asq wherep.type = 1 • Your table-valued function F(a,b,c) returns all objects related to a,b,c. • spatial neighbors, • sub-assemblies, • members of a group, • items in a folder,… • Apply this function to each row • Classic drill-downuse outer apply if f() may be null

  13. The Bookmark Bug • SQL is a non-procedural language. • The compiler/optimizer picks the procedurebased on statistics. • If the stats are wrong or missing….Bad things happen.Queries can run VERY slowly. • Strategy 1: allow users to specify plan. • Strategy 2: make the optimizer smarter (and accept hints from the user.)

  14. An Example of the Problem • A query selects some fields of an index and of huge table. • Bookmark plan: • look in index for a subset. • Lookup subset in Fat table. • This is • great if subset << table. • terrible if subset ~ table. • If statistics are wrong, or if predicates not independent,you get the wrong plan. • How to fix the statistics? Huge table Index

  15. A Fix: Let user ask for stats • Create Statistics on View(f1,..,fn) • Then the optimizer has the right dataPicks the right plan.Statistics on Views, C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003, • Q3: Select count(*) from Galaxy where r < 22 and r_extinction > 0.120Bookmark: 34 M random IO, 520 minutesCreate Statistics on Galaxy(objID ) Scan: 5 M sequential IO 18 minutes • Ultimately this should be automated,but for now,… it’s a step in the right direction.

  16. Object Relational Has Arrived • VMs are moving inside the DB • Yukon includes Common Language Runtime (Oracle & DB2 have similar mechanisms). • So, C++, VB, C# and Java are co-equal with TransactSQL. • You can define classes and methodsSQL will store the instancesAccess them via methods • You can put your analysis code INSIDE the database. • Minimizes data movement.You can’t move petabytes to the clientBut we will soon have petabyte databases. data code data code +code

  17. And.. • Fully-async and synchronous (blocking) callsand multi-concurrent-result sets per connection (transaction) • Queues built in (service broker): • Fire-and forget asynchronous processing • It listens to Port 80 for SOAP calls : TP-lite is back It’s a web service • Notification service and data mining and olap and reporting and xml and xquery and.... ) • But, back to OR.

  18. Some BackgroundTable valued functions • SQL operates on tables. • If you can make tables, you can extend SQL • This is the idea behind OLE/DB create function Evens(@maxVal int) returns @T table (a int) begin while (@maxVal > 0) begin if (@maxVal % 2 = 0) insert @T values(@maxVal) set @maxVal = @maxVal -1 end return end select * from Evens(10) a ----------- 10 8 6 4 2

  19. Using table Valued Functions For Spatial Search • Use function to return likely key ranges. • Use filter predicate to eliminate objects outside the query box. Table valued function returns candidate ranges of some space-filling curve. Select objID From Objects O join fGetRanges( @latitude, @longitude, @radius) R on O.htmID between R.begin and R.end where abs(o.Lat - @latitude) + abs(o.Lon – @longitude) < @radius Filter discards false positives.

  20. The Pre CLR design Transact SQL sp_HTM (20 lines) 469 lines of “glue” looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL) ErrorExit("srv_paraminfo failed..."); // Is Coordinate param a character string if (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR && bType1 != SRVVARCHAR && bType1 != SRVCHAR) ErrorExit("Coordinate param should be a string."); // Is Coordinate param non-null if (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1) ErrorExit("Coordinate param is null."); // Get pointer to Coordinate param pzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1); if (pzCoordinateSpec == NULL) ErrorExit("Coordinate param is null."); pzCoordinateSpec[cbActualLen1] = 0; // Get OutputVector datatype, and param length information if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL) ErrorExit("Failed to get type info on HTM Vector param..."); The HTM code body

  21. usingSystem; usingSystem.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM { public classHTM_wrapper { [DllImport("SQL_HTM.dll")]static extern unsafe void*xp_HTM_Cover_get (byte *str); public static unsafe voidHTM_cover_RS(string input) { // convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte* input;byte* output; // invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input); // Convert the array to a table SqlResultSet outputTable = SqlContext.GetReturnResultSet(); if(output[0] == 'O') { // if Output is “OK” uintc = *(UInt32*)(s + 4); // cast results as dataset Int64 *r = (Int64*)(s + 8); // Int64 r[c-1,2] for(inti = 0; i < c; ++i) { SqlDataRecord newRecord = outputTable.CreateRecord(); newRecord.SetSqlInt64(0, r[0]); newRecord.SetSqlInt64(1, r[1]); r++;r++; outputTable.Insert(newRecord); } } // return outputTable; } } } The “glue” CLR designDiscard 450 lines of UGLY code C# SQL sp_HTM (50 lines) Thanks!!! To Peter Kukol (who wrote this) The HTM code body

  22. The Clean CLR designDiscard all glue codereturn array cast as table CREATE ASSEMBLY HTM_A FROM '\\localhost\HTM\HTM.dll' CREATE FUNCTION HTM_cover( @inputNVARCHAR(100) ) RETURNS@tTABLE ( HTM_ID_STARTBIGINT NOT NULLPRIMARY KEY, HTM_ID_END BIGINT NOT NULL ) AS EXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover Your/My code goes here usingSystem; usingSystem.Data; usingSystem.Data.Sql; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM_NS { public classHTM_C { public static Int64[,2]HTM_cover(string input) { // invoke the HTM routine return(Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here. } } }

  23. Performance (Beta1) On a 2.2 Ghz Xeon • Call a Transact SQL function 33μs • Call a C# function 50μs • Table valued function 1,580 μs + per row 42 μs • Array (== table) valued function 200 μs + per row 27 μs

  24. The Code CREATE ASSEMBLY ReturnOneA FROM '\\localhost\C:\ReturnOne.dll' GO CREATE FUNCTION ReturnOne_Int( @input INT) RETURNS INT AS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_Int GO --------------------------------------------- -- time echo an integer declare @i int, @j int, @cpu_seconds float, @null_loop float declare @start datetime, @end datetime set @j = 0 set @i = 10000 set @start = current_Timestamp while(@i > 0) begin set @j = @j + 1 set @i = @i -1 end set @end = current_Timestamp set @null_loop = datediff(ms, @start,@end) / 10.0 set @i = 10000 set @start = current_Timestamp while(@i > 0) begin select @j = dbo.ReturnOne_Int(@i) set @j = @j + 1 set @i = @i -1 end set @end = current_Timestamp set @cpu_seconds = datediff(ms, @start,@end) / 10.0 - @null_loop print 'average cpu time for 1,000 calls to ReturnOne_Int was ' + str(@cpu_seconds,8,2)+ ' micro seconds' Program in DB in different language (Tsql) calling function Function written in C# inside the DB using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace ReturnOneNS { public class ReturnOneC { public static int ReturnOne_Int(intinput) { return input; } } }

  25. What Is the Significance? • No more inside/outside DB dichotomy. • You can put your code near the data. • Indeed, we are letting users put personal databases near the data archive. • This avoids moving large datasets. • Just move questions and answers.

  26. Meta-Message • Trying to fit science data into databases • When it does not fit, something is wrong. • Look for solutions • Many solutions come from OR extensions • Some are fundamental engine changes • More structure in DB • Richer operator sets • Better statistics

  27. © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

  28. evaluations

  29. © 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

More Related