190 likes | 290 Views
Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems. Haixun Wang, Carlo Zaniolo drwang@us.ibm.com , zaniolo@cs.ucla.edu Computer Science Dept. University of California, Los Angeles. State of the Art. DBMSs struggling to keep up with new applications
E N D
Using SQL to Build User-Defined Aggregates and Extenders for O-R Systems Haixun Wang, Carlo Zaniolo drwang@us.ibm.com,zaniolo@cs.ucla.edu Computer Science Dept. University of California, Los Angeles
State of the Art • DBMSs struggling to keep up with new applications • multimedia, time series, spatial/temporal DB • OLAP, data mining • New language constructs (based on aggregates) • Grouping Sets, Rollup, Cube, OLAP Functions, … • UDFs: the main extension mechanism • Data Blades for time series, spatial, XML, etc. • Very Hard to use • No aggregate functions supported
User-Defined Aggregates (UDAs) • Not in SQL99—though in earlier SQL3 drafts, and supported in Informix (and others) • SQL3 UDAs suffer from serious limitations and ease-of-use problems • Claim: we have a great solution for the UDA problem • Our UDAs are better than UDFs for extending DBs • AXL – a system to make it easy to define UDAs • AXL – can be used to do Data Mining in SQL
UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return • INITIALIZE: gives an initial value to the aggregate • ITERATE : computes the intermediate value for each new record • TERMINATE: returns the final value computed for the aggregate • myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language • On line aggregation: this would require another function for early returns
Limitation of SQL3 UDAs • Cannot define on-line aggregation or rollups • Aggregation as a function from a set (or multiset) to a single value • Aggregates can not be used inside recursion (the nonmonotonicity curse) • Ease of use is a major issue
Ease of Use • THE PROBLEM: UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. • ASOLUTION: Use a high-level language for defining UDAs. But who wants a new DB language? • THE IDEAL SOLUTION: Use SQL to define new aggregates. Substantial benefits: • Users are already familiar with SQL • No impedance mismatch of data types and programming paradigms • DB advantages: scalability, data independence, optimizability, parallelizability • But how far can we take SQL?
AXL by Example: average AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE:{ INSERT INTO state(value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }
Second Example Show the average salary of senior managers who make 3 times more than the average employees. • SQL: SELECTavg(salary) FROMemployee WHEREtitle = ‘senior manager’ AND salary > 3 * (SELECTavg(salary)FROMemployee) Two scans of the employee table required • With AXL UDAs: SELECTsscan(title, salary) FROMemployee
AXL: Using a Single Scan AGGREGATEsscan(titleCHAR(20), salaryINT) : REAL { TABLEstate(sumINT, cntINT) ASVALUES(0,0); TABLEseniors(salaryINT); INITIALIZE: ITERATE: { UPDATE stateSETsum=sum+salary, cnt=cnt+1; INSERT INTOseniors VALUES(salary)WHEREtitle = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECTavg(s.salary)FROMseniors ASs WHEREs.salary > 3 * (SELECTsum/cntFROMstate); } }
Early Returns • AVG normally converges early: an early approximation is all is needed in several applications • Online aggregation means that early returns are produced during the computation • Early returns are useful in many other computations: for instance to find the local max and min in a sequence of values -- and various temporal aggregates
Return avg for Every 100 Records AGGREGATEolavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }
Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state; } }
Recursive Aggregates • In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); INSERT INTO RETURN SELECT alldesc(Child) FROM children WHERE Parent = P; } } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’;
AXL: Where SQL and Data Mining Intersects • Loosely-coupled: • Cache Mining: current data mining applications have a loose connection with databases • Tightly-coupled: • UDFs based data mining functions • Ideal: • AXL powered by recursive aggregates
Decision Tree Classifiers Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)
Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn);} } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT sprint(recId, col, val, YorN) FROM col-val-pairs;
SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3]TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6] TABLE node(Node INT) ASVALUES(iNode); [ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9]UPDATE summary [10]SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11]WHERE Col=iCol; [12] } [13] TERMINATE: { [14]INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17]WHERE t.Col = m.Col AND m.Gini > 0; [18]SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20]GROUP BY s.LeftOrRight; [21] } [22]}
Performance SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C
Implementation of AXL • AXL V1.2: approaching 40,000 lines of code • AXL compiler translates AXL programs into C++ code • DB2 add-on: emulate UDAs using UDFs • Standalone: running under both Win98/NT and UNIX • Open interface of physical data model. • Currently using Berkeley DB as our storage manager • In memory tables • Limited Optimization • Using B+-Tree indexes to support equality/range query • Predicate push-down / push-up