1 / 19

Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems

Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems. Haixun Wang, Carlo Zaniolo drwang@us.ibm.com , zaniolo@cs.ucla.edu Computer Science Dept. University of California, Los Angeles. State of the Art. DBMSs struggling to keep up with new applications

elsa
Download Presentation

Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems Haixun Wang, Carlo Zaniolo drwang@us.ibm.com,zaniolo@cs.ucla.edu Computer Science Dept. University of California, Los Angeles

  2. State of the Art • DBMSs struggling to keep up with new applications • multimedia, time series, spatial/temporal DB • OLAP, data mining • New language constructs (based on aggregates) • Grouping Sets, Rollup, Cube, OLAP Functions, … • UDFs: the main extension mechanism • Data Blades for time series, spatial, XML, etc. • Very Hard to use • No aggregate functions supported

  3. User-Defined Aggregates (UDAs) • Not in SQL99—though in earlier SQL3 drafts, and supported in Informix (and others) • SQL3 UDAs suffer from serious limitations and ease-of-use problems • Claim: we have a great solution for the UDA problem • Our UDAs are better than UDFs for extending DBs • AXL – a system to make it easy to define UDAs • AXL – can be used to do Data Mining in SQL

  4. UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return • INITIALIZE: gives an initial value to the aggregate • ITERATE : computes the intermediate value for each new record • TERMINATE: returns the final value computed for the aggregate • myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language • On line aggregation: this would require another function for early returns

  5. Limitation of SQL3 UDAs • Cannot define on-line aggregation or rollups • Aggregation as a function from a set (or multiset) to a single value • Aggregates can not be used inside recursion (the nonmonotonicity curse) • Ease of use is a major issue

  6. Ease of Use • THE PROBLEM: UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. • ASOLUTION: Use a high-level language for defining UDAs. But who wants a new DB language? • THE IDEAL SOLUTION: Use SQL to define new aggregates. Substantial benefits: • Users are already familiar with SQL • No impedance mismatch of data types and programming paradigms • DB advantages: scalability, data independence, optimizability, parallelizability • But how far can we take SQL?

  7. AXL by Example: average AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE:{ INSERT INTO state(value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }

  8. Second Example Show the average salary of senior managers who make 3 times more than the average employees. • SQL: SELECTavg(salary) FROMemployee WHEREtitle = ‘senior manager’ AND salary > 3 * (SELECTavg(salary)FROMemployee) Two scans of the employee table required • With AXL UDAs: SELECTsscan(title, salary) FROMemployee

  9. AXL: Using a Single Scan AGGREGATEsscan(titleCHAR(20), salaryINT) : REAL { TABLEstate(sumINT, cntINT) ASVALUES(0,0); TABLEseniors(salaryINT); INITIALIZE: ITERATE: { UPDATE stateSETsum=sum+salary, cnt=cnt+1; INSERT INTOseniors VALUES(salary)WHEREtitle = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECTavg(s.salary)FROMseniors ASs WHEREs.salary > 3 * (SELECTsum/cntFROMstate); } }

  10. Early Returns • AVG normally converges early: an early approximation is all is needed in several applications • Online aggregation means that early returns are produced during the computation • Early returns are useful in many other computations: for instance to find the local max and min in a sequence of values -- and various temporal aggregates

  11. Return avg for Every 100 Records AGGREGATEolavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }

  12. Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state; } }

  13. Recursive Aggregates • In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); INSERT INTO RETURN SELECT alldesc(Child) FROM children WHERE Parent = P; } } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’;

  14. AXL: Where SQL and Data Mining Intersects • Loosely-coupled: • Cache Mining: current data mining applications have a loose connection with databases • Tightly-coupled: • UDFs based data mining functions • Ideal: • AXL powered by recursive aggregates

  15. Decision Tree Classifiers Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)

  16. Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn);} } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT sprint(recId, col, val, YorN) FROM col-val-pairs;

  17. SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3]TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6] TABLE node(Node INT) ASVALUES(iNode); [ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9]UPDATE summary [10]SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11]WHERE Col=iCol; [12] } [13] TERMINATE: { [14]INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17]WHERE t.Col = m.Col AND m.Gini > 0; [18]SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20]GROUP BY s.LeftOrRight; [21] } [22]}

  18. Performance SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C

  19. Implementation of AXL • AXL V1.2: approaching 40,000 lines of code • AXL compiler translates AXL programs into C++ code • DB2 add-on: emulate UDAs using UDFs • Standalone: running under both Win98/NT and UNIX • Open interface of physical data model. • Currently using Berkeley DB as our storage manager • In memory tables • Limited Optimization • Using B+-Tree indexes to support equality/range query • Predicate push-down / push-up

More Related