290 likes | 301 Views
This presentation discusses various topics related to database activities and trends, including scalability projects, DB projects, eScience projects, and more. It covers topics such as geospatial data, disk and network performance, GPU programming, finite element analysis, and portable PubMed Central. The presentation will also include a Q&A session.
E N D
Database Activities and TrendsJim GrayMicrosoft Research2 June 2006, Microsoft, TechNet, LondonOutline • What I have been doing 15 minutes • BIG Changes in DBs 15 minutes • Q&A 20 minutes
Scalability Projects • TerraServer: Geospatial data online • Now part of Virtual Earth http://local.live.com/ • SATA disk evaluation • Copy 1.5 Petabytes (count types of errors) MSR-TR-2005-166 • Disk and Network performancemove 1GB/s from CERN to Pasadena MSR-TR-2004-62 • Bricks • BI-Bricks: cheap boxes/disks for BI • Server Bricks: TerraServer Bricks: MSR-TR-2004-107
DB Projects • Spatial data access inside SQL • Gives a good example of using CLR to extend SQL • Sample is part of SQL 2005 programming samples. • Many papers …, MSR-TR-2005-122, MSR-TR-2006-52 • To Blob or NOT to Blob? • Explored what is the break-even point of Blobs vs Files. Guess what! Almost all files should be blobs. MSR-TR-2006-45 • GPU TeraSort: • You have been hearing about Many-Core from Intel • Nvidia & ATI give you 100 cores today (2x next year) 10x the operations per second than the CPU 10x the memory bandwidth of the CPU • How to program them? • Sort represents IO, memory, processing. • GPU TerraSort demos this MSR-TR-2005-183 • Accelerator: C# extension is a GPU compiler. MSR-TR-2005-184 Not me, but very cool!
eScienceProjects • SkyServer: Astronomy data online http://skyserver.sdss.org/ • A real Data Grid app – Web services are popular • SkyQuery and CasJobs use web services. http://casjobs.sdss.org/CasJobs/ • Spatial access built as SQL 2005 C# extensions. • Doing Finite Element Analysis with a DB and Vis toolsSupporting Finite Element Analysis with a Relational Database Backend; Part I: There is Life beyond Files MSR-TR-2005-49 • Ecological sensors (soil, water, ocean,…) • Only public thing so far: http://lifeunderyourfeet.org/ • Many papers coming • Starting BioInfo efforts (Portable PubMed Central, ….)
Portable PubMedCentral • “Information at your fingertips” • Helping build PortablePubMedCentral • Deployed US, China, England, Italy, South Africa, (Japan soon). • Each site can accept documents • Archives replicated • Federate thru web services • Working to integrate Word/Excel/…with PubmedCentral – e.g. WordML, XSD, • To be clear: NCBI is doing 99% of the work,but it is very cool and very significant.
Outline • What I have been doing 15 minutes • BIG Changes in DBs 15 minutes • Q&A 20 minutes
Added: • +Text, Time, Space • + Cubes, Data mining • + XML, XQuery • + Programming Languages • + Triggers and queues • + Replication, Pub/sub • + Extract-Transform-Load • + Many more extensions coming Procedures Queues XML Replication ETL Text Cubes Data Mine Time Space Notification … sets sets utilities utilities records records os os DB System Architecture • The classic DBMS model Worked, but applications wanted to query other data types A Mess?
DataSet sets utilities records os DB Systems evolved to be containers for information servicesdevelop, deploy, and execution environment • Classic ++ • + Programming Languages • + Triggers and queues • + Replication, Pub/sub • + Extract-Transform-Load • + Text, Time, Space • + Cubes, Data mining • + XML, XQuery • + Many more extensions coming • DBMS is an ecosystemOO is the key structuring strategy: • Everything is a class • Database is a complex object • Core object is DataSet • Classes publish/consume them • Depends on strong Object Model
The Object-Relational Worldmarry programming languages and DBMSs • Stored procedures evolve to “real” languagesJava, C#,.. With real object models. • Data encapsulated: a class with methods • Classes may be persistent • Tables are enumerable & index-ablerecord sets with foreign keys • Records are vectors of objects • Opaque or transparent types • Set operators on transparent classes • Transactions: • Preserve invariants • A composition strategy • An exception strategy • Ends Inside-DB Outside-DB dichotomy Business Objects
Tables or Text or cube Or….. Question Dataset Entity Set in ADO.NET 3.0 Ask not“How to add objects to databases?”, Ask“What kind of object is a database?” Q: Given an object model, what is a DB? A: DataSet class and methods(nested relation with metadata) This is the basis for the ecosystem Distributed DB Extensible DB Interoperable DB …. This was implicit in ODBCbut is now explicit within the DBMS ecosystem Input: Command (any language) Output: Dataset
Queues & WorkflowsSODA (Service Oriented Data Architecture)Service Oriented Database Architecture: App Server-Lite? MSR TR 2005 129 • Apps are loosely connected via Queued messages • Queues are databases. • Basis for workflow • Queues: the first class to add to an OR DBMS • Queues fire triggers.Active databases • Synergy with DBMSsecurity, naming, persistence, types, query,… Workflow: Script Execute Administer & Expedite all built on queues
Text, Temporal, and Spatial Data Access select Title, Abstract, Rank from Books join FreeTextTable(Title, Abstract, ‘XML semistructured') T on BookID = T.Key • Q: What comes after queues? A: Basic types: text, time, space,… • Great application of OR technology • Key idea: table valued functions == indicesAn index is a table, organized differentlyQuery executor uses index to map: Key → set (aka sequence of rows) • Table valued function can do this mapOptimizer can use it. • +extras: cost function, cardinality,… • BIG DEAL: Approximate answers: Rank and Support select galaxy, distance fromGetNearbyObjEQ(22,37) select store, holiday, sum(sales) from Sales join HolidayDates(2004) T on Sales.day = T.day group by store, holiday
Tables or Text or cube Or….. Question Dataset What’s new here? • DBMS have tight-integration withlanguage classes (Java, C#, VB,.. ) • The DB is a class • You can add classes to DB. • Adding indices is “easy”If you have a new idea. • Now have solid Queue systemsAdding workflow is “easy”If you have a new idea. • This is a vehicle for publishing data on the Web.
1990 FORD CHEVY 1991 1992 1993 RED WHITE BLUE Cubes SELECT <axis_spec> FROM <cube_spec> WHERE <slicer_spec> • Data cubes now standard • MDX is very powerful(Multi-Dimensional eXpressions) • Dimension, Measure, Operator concepts highly evolved beyond snowflake schema • Cube stores cohabit with row storesROLAP + MOLAP + (x xOLAP)(relational +multidimensional online analytic processing) • Very sophisticated algorithms • A big part of the ecosystem
Semi-Structured Data • “Everyone starts with the same schema: <stuff/>.”Then they refine it.” J. Widom • We are a “strong schema” community • That has pros-and-cons. • Files <stuff/> and XML <<foo/> <bar/>>are here to stay. Get over it! • File directories are becoming databases; • Pivot on any attribute • Folders are standing queries. • Freetext+schema search (better precision/recall) • XSD (xml schema) and xQuery are transitional;But we have to do them to get to the real answer. • Cohabit with row-stores. • Challenge: figure out what comes after XSD+xQuery
Data Mining and Machine Learning • Tasks: classification, association, prediction • Tools: Decision trees, Bayes, A Priori, clustering, regression, Neural net,… • now unified with DBs • Create table T (x,y,z,u,v,w)Learn “x,y,z” from “u,v,w” using <algorithm> • Train T with data. • Then can ask: • Probability x,y,z,u,v,w • What are the u,v,w probabilities given x,y,z • Example: Learn height from age. • Anyone with a data mining algorithm hasfull access to the DBMS infrastructure. • Challenge: Better learning algorithms.
DM – DB Synergy Create the model: CREATE MINING MODEL HeightFromAgeSex ( ID long key, Gender text discrete, Age long continuous, Height long continuous PREDICT) USING Decision_Trees Train a data mining model: INSERT INTO Height SELECT ID, Gender, Age, Height FROM People Predict height from model: SELECT height, PredictProbability(height) FROM Height PREDICTION JOIN New ON New.Gender = Height.Gender AND New.Age = Height.Age learn height from Gender + Age DB verbs to drive Modeler Probabilistic Reasoning
Q? facts A! Q Q fact, fact, fact… Q Q Q Q Q Notification Notification,Stream Processing, and Sensor Processing • Traditionally: Query billions of facts • Streams:millions of queries one new fact • New protein compare to all DNA • Change in price or time • Implications • New aggregation operators (extension) • New programming style • Streams in products: • Queries represented as records • New query optimizations. • Sensor networks • push queries out to sensors. • Simpler programming model • Optimizes power & bandwidth
DataSet sets sets utilities utilities records records os os Restatement: DB Systems evolved to becontainers for information servicesdevelop, deploy, and execution environment • DBMS is an ecosystemKey structuring strategy: • Everything is a class • Database is a complex object • Core object is DataSet • The architecture lets you add your new ideas.
Language + DB Integration(the Microsoft contribution) • LINQ are a BIG deal (SQL and XML) http://msdn.microsoft.com/data • Entity Sets are next step in Data SetsADO.NET V3 automates entities
Relational Engine Customer SalesPerson Rows Connection Command DataReader Data access today Explicit DB connections void EmpsByDate(DateTime date) { using( SqlConnection con = new SqlConnection( Settings.Default.AdventureWorksSQL)) { con.Open(); SqlCommand cmd = con.CreateCommand(); cmd.CommandText = @" SELECT SalesPersonID, FirstName, HireDate FROM SalesPerson sp INNER JOIN Employee e ON sp.SalesPersonID = e.EmployeeID INNER JOIN Contact c ON e.EmployeeID = c.ContactID WHERE e.HireDate < @date"; cmd.Parameters.AddWithValue("@date", date); DbDataReader r = cmd.ExecuteReader(); while(r.Read()) { Console.WriteLine( "{0:d}:\t{1}", r["HireDate"], r["FirstName"]); } } Opaque command text Entities ≠ Rows Untyped resultsets
SalesData Order Domain Objects ObjectContext Objects Query<T> Relational Engine Customer SalesPerson MapConnection MapCommand Entities MapDataReader Connection Command Rows DataReader Data access tomorrow publicpartialclassAdventureWorksDB : System.Data.Objects.ObjectContext { publicSystem.Data.Objects. Query<SalesOrder> SalesOrders { … } publicSystem.Data.Objects. Query<SalesPerson> SalesPeople { … } } void EmpsByDate(DateTime date) { using (AdventureWorksDBaw = newAdventureWorksDB()){ varpeople =frompin aw.SalesPeople where p.HireDate < date select p; foreach (SalesPerson pinpeople) { Console.WriteLine("{0:d}\t{1}", p.HireDate, p.FirstName); } } Auto-Gen classes No explicit connections Strongly typed commands Strongly typed results
Outline • What I have been doing 15 minutes • BIG Changes in DBs 15 minutes • Q&A 20 minutes