360 likes | 485 Views
QUERY BY EXCEL. Witkowski, S. Bellamkonda, T. Bozkaya, A. Naimat, L. Sheng, S. Subramanian, A. Waingold. Oracle Corporation. Spreadsheets. Spreadsheets are established analytical tools: Attractive user interface Easy to use computational model Interactivity for what if analysis.
E N D
QUERY BY EXCEL • Witkowski, S. Bellamkonda, T. Bozkaya, • A. Naimat, L. Sheng, • S. Subramanian, A. Waingold Oracle Corporation
Spreadsheets Spreadsheets are established analytical tools: • Attractive user interface • Easy to use computational model • Interactivity for what if analysis But, they do not offer: • Scalability • Parallelization • A unified view of the data model
Our proposal QUERY BY EXCEL (QBX) Combines • Presentational interactive modeling power of Excel (spreadsheet tools) • Computational power and scalability of RDBMS via analytical extensions
QBX – How it works • Analyst builds a model using Excel. The model is translated to SQL and stored in relational views. • Analysts designate areas in Excel as relational sources (RTables). An RTable can be transformed into another RTable using Excel operations corresponding to Outer Join, Selection, Projection and Aggregation. Analyst does not write any SQL during this process. • Analysts write Excel formulas on samples of relational sources that fit in a spreadsheet. The translated SQL works on the whole data for scalability. • Business Reporting tools can access the relational views for consolidation.
Analytic SQL Extensions for QBX • SQL MODEL (Witkowski, et al. Sigmod 2003) • SQL PIVOT (Cunningham, et al. Vldb 2004)
SQL MODEL • Allows mapping relational data from row format to multi-dimensional format for facilitation of spreadsheet-like computations. < existing parts of a query block > MODEL PARTITION BY (<keys>) DIMENSION BY (<keys>) MEASURES (<keys>) <Options> RULES ( <formula>, <formula>, …, <formula> )
prod time s vcr 2001 9 dvd 2001 0 SQL MODEL: Example SELECT prod, time, s FROM sales MODEL Relation Array 1 2 3 4 1999 5 6 7 8 2000 DIMENSION BY (prod, time) MEASURES (s) time 9 0 1 2 2001 vcr dvd tv pc prod
prod time s vcr 2001 9 dvd 2001 0 SQL MODEL: Example SELECT prod, time, s FROM sales MODEL Relation Array 1 2 3 4 1999 5 6 7 8 2000 DIMENSION BY (prod, time) MEASURES (s) time 9 0 1 2 2001 vcr dvd tv pc RULES UPSERT ( s[vcr, 2002] = s[vcr, 2001] + s[vcr, 2000], s[*, 2000] = s[cv(prod), cv(time) - 1] * 2 ) prod Predict vcr sales in 2002 Sales in 2000 2x of previous
prod time s vcr 2001 9 dvd 2001 0 SQL MODEL: Example SELECT prod, time, s FROM sales MODEL Relation Calc 1 2 3 4 1999 5 6 7 8 2000 DIMENSION BY (prod, time) MEASURES (s) 9 0 1 2 2001 2002 11 RULES UPSERT ( s[vcr, 2002] = s[vcr, 2001] + s[vcr, 2000], s[*, 2000] = s[cv(prod), cv(time) - 1] * 2 ) vcr dvd tv pc Predict vcr sales in 2002 Sales in 2000 2x of previous
prod time s vcr 2001 9 dvd 2001 0 SQL MODEL: Example SELECT prod, time, s FROM sales MODEL Relation Calc 1 2 3 4 1999 2 4 6 8 2000 DIMENSION BY (prod, time) MEASURES (s) 9 0 1 2 2001 2002 11 RULES UPSERT ( s[vcr, 2002] = s[vcr, 2001] + s[vcr, 2000], s[*, 2000] = s[cv(prod), cv(time) - 1] * 2 ) vcr dvd tv pc Predict vcr sales in 2002 Sales in 2000 2x of previous
prod time s vcr 2001 9 dvd 2001 0 SQL MODEL: Example SELECT prod, time, s FROM sales MODEL Relation Calc 1 2 3 4 1999 2 4 6 8 2000 DIMENSION BY (prod, time) MEASURES (s) 9 0 1 2 2001 2002 11 RULES UPSERT ( s[vcr, 2002] = s[vcr, 2001] + s[vcr, 2000], s[*, 2000] = s[cv(prod), cv(time) - 1] * 2 ) vcr dvd tv pc Predict vcr sales in 2002 Sales in 2000 2x of previous Relation again vcr 2001 9 dvd 2001 0 vcr 2002 11
QBX Architecture Excel Analyst Application RDBMS User Excel -> SQL Translation QBX generated SQL Objects RDBMS Interaction & Modeling QBX Database Schema Persistence EXCEL RDBMS
QBX Metadata • Excels( eid, name, owner, ExcelBinary, SQLView) • Cells( eid, sheet, row, col, x, f) For this Excel spreadsheet, we store five rows in Cells Table: C1, C2, C3, D1, D3 • RTables( eid, RTable, sheet, row, col, sample, RTableView, …)
QBX Infrastructure • Interaction and Modeling Component • Persistence Component • Translation Component
Excel to SQL Translation • Fix Frame translation • Table Translation • Unified Translation
Fix Frame Translation SELECT sheet, row, col, x FROM cells MODEL DBY (sheet,row,col) MEA (x) RULES AUTOMATIC ORDER ( x[1,1,2] = x[1,1,1] + x[1,2,1], -- B1=A1+A2 x[1,2,2] = x[1,3,1] + 1, -- B2=A3+1 x[1,3,2] = sum(x)[1,1<=row<=3,1] –- B3=sum(A1:A3) );
Fix Frame Translation-VLOOKUP - HLOOKUP We use REFERENCE SQL MODEL: VLOOKUP(key, (<rs,cs>, <re,ce>), col): REFERENCE vlookup_ref ON ( SELECT k.x key, v.x value FROM cells k, cells v WHERE k.col=cs AND v.col=cs+col-1 AND k.row >= rs AND k.row <= re AND v.row=k.row ) DIMENSION BY (key) MEASURES (value)
C3 Fix Frame Translation-VLOOKUP - HLOOKUP EXAMPLE: A3 =Vlookup(C3, A1:B4, 2) SELECT row, col, x FROM cells MODEL REFERENCE vlookup_ref ON (SELECT k.x key,v.x value FROM cells k,cells v WHERE k.col = 1 AND v.col = 2 AND k.row >= 0 AND k.row <= 4 AND v.row = k.row) DIMENSION BY(key) MEASURES(value) MAIN DIMENSION BY (row, col) MEASURES (x) RULES ( x[3,1] = vlookup_ref.value[ x[3,3] ] ); 6
Table Translation • Table Translation creates named, protected regions within Excel named RTables. • We remember associated metadata for RTable regions (PK, PK-FK constraints, etc..) • A direct RTable represents a • An RDBMS table (entire table or sample) • An RDBMS view (Direct Rtables can be created through QBX menu) • A derived RTable represents the result of relational operations on other RTables.
Rtable Example TIME_D(I2:J4) PROD_D(I7:J9) FACT(A2:D10) REGION_D(I12:J14)
Table Translation - Operations • Inter-column calculations Adding new (calculated) column to an Rtable • Projection • Joining of Rtables The closest Excel operation to join is Hlookup/Vlookup, Which is similar to relational OUTER JOIN. Steps: (R1 LEFT OUTER JOIN R2) • A new column is added. • The new column is populated with (VLOOKUP(R1.Col1, R2, R2.Col2) • Aggregation
Inter-column Calculations • Computations involving columns of the same row. EX: A1 = B1+D1 MODEL DBY(row,col) MEASURES (x) RULES ( x[ANY, 1] = x[cv(row),2] + x[cv(row),4] )
Table Translation – Join Example E3=VLOOKUP(B3,I8:J9,2) F3=VLOOKUP(C3,I3:J4,2)
Join SQL SELECT f.city, f.prod, f.month, g.state, p.categ, t.year, sale, row_number() over (order by state,year,categ) rn FROM fact f outer join time_d t on f.month = t.month outer join prod_d p on f.prod = p.prod outer join geog_d g on f.city = g.city ORDER BY city NULLS LAST, prod NULLS LAST, month NULLS LAST, state NULLS LAST, categ NULLS LAST, year NULLS LAST;
Table Translation- Aggregation Aggregation in Excel can be done through DATA PIVOTTABLE operation. This corresponds to (in RDBMS): • Aggregation via SQL GROUP BY operator • Aggregation via SQL PIVOT operator
Aggregation – SQL GROUP BY SELECT state, year, sum(amt) amt, row_number() over (order by state,year) rn FROM fact f outer join time_d t on f.month = t.month outer join prod_d p on f.prod = p.prod outer join geog_d g on f.city = g.city GROUP BY GROUPING SETS ((state,year),(state)) ORDER BY state NULLS LAST, year NULLS LAST;
Relative Referencing to RTables Introducing a new lookup function for referencing values in Rtables: RTLOOKUP(RTREGION, COL, {PKEYS}) O3 = N3/rtlookup(L2:N8,3,L3,NULL) O4 = N3/rtlookup(L2:N8,3,L3,NULL)
Relative Referencing to RTables CREATE VIEW AGG_Q AS SELECT state, year, ratio, sum(amt) amt, row_number() over () (order by state, year) rn FROM fact f outer join time_d t on f.month=t.month outer join prod_d p on f.prod = p.prod outer join geog_d g on f.city = g.city GROUP BY GROUPING SETS ((state,year),(state)) MODEL DBY (state, year) MEA (total, 0 ratio) ( ratio[ANY,ANY]= total[CV(state),CV(year)]/total[CV(state), null] ) ORDER BY state nulls last, year nulls last;
Optimizations • Collapsing of Equivalent Rules • For Loops vs. Existential Form • Existing optimization of SQL Model functionality • Rule pruning • Filter pushdown • Others…
A1=B1+D1 MODEL DBY(row,col) MEA(x) RULES ( x[for row from 1 to 20,1] = x[CV(row),2] + x[CV(row), 4] ); == Optimizations- Collapsing of Rules MODEL DBY(row,col) MEA(x) RULES ( x[1,1] = x[1,2] + x[1,4], x[2,1] = x[2,2] + x[2,4], … x[20,1] = x[20,2] +x[20,4] );
A1=B1+D1 MODEL DBY(row,col) MEA(x) RULES ( x[1<=row<=20,1] = x[CV(row),2] + x[CV(row), 4] ); VS Optimization – For Loops vs Existential Rules MODEL DBY(row,col) MEA(x) RULES ( x[for row from 1 to 20,1] = x[CV(row),2] + x[CV(row), 4] );
Use For Loops Use scan Optimization – For Loops vs Existential Rules
Conclusion • Our goal is to translate Excel computation to SQL so that Business Models built in Excel can be stored and queried in RDBMS. • We proposed translation techniques for expressing Excel computation in RDBMS SQL using new analytic extensions. • We proposed representation techniques for relational data in Excel by using Rtables and described how Excel operations on RTables can be simulated in SQL. • We discussed how this proposed system would fit into our RDBMS SQL execution engine and benefit from all its capabilities and optimizations.
What is ahead? • Excel • Pivoting and advanced filtering turned out to be essential, but a few more relational friendly extensions would go a long way, particularly in simulating joins, window function computations. • SQL • RDBMS SQL needs to be extended to cover the functionality provided in Excel, particularly financial functions.
Q & Q U E S T I O N S A N S W E R S A