390 likes | 495 Views
A Thin Monitoring Layer for Top- k Aggregation Queries over a Database. Data Cube. s um(price *quantity). Brand. Country. Product Type. Data Cube. What are the top-2 product types with the highest revenue of each brand in each country?
E N D
A Thin Monitoring Layer for Top-k Aggregation Queries over a Database
Data Cube sum(price*quantity) Brand Country Product Type
Data Cube What are the top-2 product types with the highest revenue of each brand in each country? What are the top-2 brands with the highest revenue in each country?
Top-k Queries • Primary Attribute: The attribute/dimension over which the selection is performed (e.g. product type) • Secondary Attributes: Used to filter specific results (e.g. brand, country) • Aggregated Attributes: Used to compute an aggregated score (e.g. price, quantity) • Aggregate Function: e.g. sum One top-k query for each combination of secondary attribute instances (filtering condition)
Filtering Conditions: Example (1) brand={X} - country = {Y, W} brand=X AND country=Y brand=X AND country=W SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=W GROUP BY type ORDER BY SUM(price*quantity) LIMIT K
Filtering Conditions: Example (2) country = {Y, W} - brand={X} country=Y country=W brand=X SELECT type, SUM(price*quantity) FROM relation WHERE country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE country=W GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE brand=X GROUP BY type ORDER BY SUM(price*quantity) LIMIT K
Filtering Conditions: Example (3) country = {Y, W} - brand={X} SELECT type, SUM(price*quantity) FROM relation GROUP BY type ORDER BY SUM(price*quantity) LIMIT K
Updates Insertions to the underlying database that contain all information related to the top-k queries INSERT INTO relation (type, brand, country, price, quantity) VALUES (T, X, Y, 100, 3)
Problem How to maintain all these queries in the presence of fast updates?
Outline • Setting/Problem • Algorithms • Naïve Approach • Estimates Approach • Groups Approach • Experimental Results • Conclusions
Example SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT 2 Update: (type, X, Y, 300)
Naïve Approach Case 1: type in the top-2, e.g. (B,X,Y,300) +300 Case 2: type NOT in the top-2, e.g. (K,X,Y,300) Verification Query: SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type=K GROUP BY type
Estimates ApproachIn-memory Structures • top-(k+N) instances with exact aggregated scores • B instances with estimated aggregated scores • best possible score (basic score) + inserted values Buffer top-2 top-5
Estimates Approach Case 1.1: type in the top-2, e.g. (B,X,Y,300) +300 top-2 top-2 top-5 top-5
Estimates Approach Case 1.2: type in the top-5, e.g. (D,X,Y,300) top-2 top-2 +300 top-5 top-5
Estimates Approach Case 2: type in the Buffer, e.g. (P,X,Y,300) Buffer Buffer top-2 +300 top-5 Verification Query:SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type=P GROUP BY type
Estimates Approach Sub-case 2.1: score(P) < score(E), e.g. score(P) = 756 Buffer top-2 top-5
Estimates Approach Sub-case 2.2: score(P) > score(E), e.g. score(P) = 2178 Buffer top-2 top-5
Estimates Approach Sub-case 2.3: score(P) > score(B), e.g. score(P) = 2407 Buffer top-2 top-5
Estimates Approach Case 3: type NOT in in-memory structures, e.g. (T,X,Y,300) Estimated Score(T) = basic score + 300 = 2287 Buffer Full Reset Query SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type IN (O,P,Q,R) GROUP BY type
Estimates Approach Case 3: type NOT in in-memory structures, e.g. (T,X,Y,300) score(O)=1254, score(P)=432, score(Q)=2050, score(R)=1990 Buffer top-2 top-5
Queries Characteristics • SAME primary attribute • SAME aggregate attributes • SAME aggregate function • SAME top-k condition • DIFFERENT filtering condition
Groups Approach • The updates are forwarded from top to bottom in the lattice • Each ranking forwards the queried results to the rankings lying in lower levels in the lattice
Groups Approach: Example Update: (type, X, Y, 300) Ranking: brand=X, country=ANY SELECT type, SUM(price*quantity) FROM relation WHERE brand=X GROUP BY type ORDER BY SUM(price*quantity) LIMIT 2
Groups Approach Case 2: type in the Buffer, e.g. (P,X,Y,300) Verification Query: SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type=P Case 4: type NOT in in-memory structures, e.g. (T,X,Y,300) Buffer Reset Query: SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type IN (O,P,Q,R)
Groups Approach • Tuples(type, brand, country, price*quantity) limited to those satisfying its filtering condition • Uses them to compute the scores. • Forwards them to the rankings lower in lattice • Rankings receiving tuples use those qualifying to their filtering condition to compute the scores
Outline • Problem • Algorithms • Naïve Approach • Estimates Approach • Groups Approach • Experimental Results • Conclusions
Experiments (1) • TPC-H data • Select on part.p_partKey (200,000 unique values) • Filter on customer.c_mktsegment, orders.o_orderpriority and region.r_name • Aggregation sum on lineitem.l_quantity • 216 total rankings • 30,000 updates/insertions
Experiments (2) Updates • Random: inserts quantity between 1 and 50 for a random part.p_partKey. • 80-20: inserts quantity between 1 and 50 for a part.p_partKey selected according to the 80-20 rule N-extra Gap • Difference between top-k and top-(k+N) scores 100% (1*50) and 200% (2*50)
Naïve Approach • 80-20 updates: 239,985 Verification Queries, 4 secs/update • Random updates: 239,977 Verification Queries, 4 secs/update
Outline • Problem • Algorithms • Naïve Approach • Estimates Approach • Groups Approach • Experimental Results • Conclusions
Conclusion • Twoalgorithms to maintain top-k rankings in the presence of fast updates arriving in an underlying database • Exact top-kresults • Faster than a Naïve approach while Groups Approach limits further the communication with the database • Preliminary results which provide insights on the impact of the various parameters in the effectiveness of our methods