A Thin Monitoring Layer for Top- k Aggregation Queries over a Database

A Thin Monitoring Layer for Top-k Aggregation Queries over a Database

Data Cube sum(price*quantity) Brand Country Product Type

Data Cube What are the top-2 product types with the highest revenue of each brand in each country? What are the top-2 brands with the highest revenue in each country?

Top-k Queries • Primary Attribute: The attribute/dimension over which the selection is performed (e.g. product type) • Secondary Attributes: Used to filter specific results (e.g. brand, country) • Aggregated Attributes: Used to compute an aggregated score (e.g. price, quantity) • Aggregate Function: e.g. sum One top-k query for each combination of secondary attribute instances (filtering condition)

Filtering Conditions: Example (1) brand={X} - country = {Y, W} brand=X AND country=Y brand=X AND country=W SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=W GROUP BY type ORDER BY SUM(price*quantity) LIMIT K

Filtering Conditions: Example (2) country = {Y, W} - brand={X} country=Y country=W brand=X SELECT type, SUM(price*quantity) FROM relation WHERE country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE country=W GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE brand=X GROUP BY type ORDER BY SUM(price*quantity) LIMIT K

Filtering Conditions: Example (3) country = {Y, W} - brand={X} SELECT type, SUM(price*quantity) FROM relation GROUP BY type ORDER BY SUM(price*quantity) LIMIT K

Updates Insertions to the underlying database that contain all information related to the top-k queries INSERT INTO relation (type, brand, country, price, quantity) VALUES (T, X, Y, 100, 3)

Problem How to maintain all these queries in the presence of fast updates?

Outline • Setting/Problem • Algorithms • Naïve Approach • Estimates Approach • Groups Approach • Experimental Results • Conclusions

Example SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT 2 Update: (type, X, Y, 300)

Naïve Approach Case 1: type in the top-2, e.g. (B,X,Y,300) +300 Case 2: type NOT in the top-2, e.g. (K,X,Y,300) Verification Query: SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type=K GROUP BY type

Estimates ApproachIn-memory Structures • top-(k+N) instances with exact aggregated scores • B instances with estimated aggregated scores • best possible score (basic score) + inserted values Buffer top-2 top-5

Estimates Approach Case 1.1: type in the top-2, e.g. (B,X,Y,300) +300 top-2 top-2 top-5 top-5

Estimates Approach Case 1.2: type in the top-5, e.g. (D,X,Y,300) top-2 top-2 +300 top-5 top-5

Estimates Approach Case 2: type in the Buffer, e.g. (P,X,Y,300) Buffer Buffer top-2 +300 top-5 Verification Query:SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type=P GROUP BY type

Estimates Approach Sub-case 2.1: score(P) < score(E), e.g. score(P) = 756 Buffer top-2 top-5

Estimates Approach Sub-case 2.2: score(P) > score(E), e.g. score(P) = 2178 Buffer top-2 top-5

Estimates Approach Sub-case 2.3: score(P) > score(B), e.g. score(P) = 2407 Buffer top-2 top-5

Estimates Approach Case 3: type NOT in in-memory structures, e.g. (T,X,Y,300) Estimated Score(T) = basic score + 300 = 2287 Buffer Full Reset Query SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type IN (O,P,Q,R) GROUP BY type

Estimates Approach Case 3: type NOT in in-memory structures, e.g. (T,X,Y,300) score(O)=1254, score(P)=432, score(Q)=2050, score(R)=1990 Buffer top-2 top-5

Queries Characteristics • SAME primary attribute • SAME aggregate attributes • SAME aggregate function • SAME top-k condition • DIFFERENT filtering condition

Lattice organisation

Groups Approach • The updates are forwarded from top to bottom in the lattice • Each ranking forwards the queried results to the rankings lying in lower levels in the lattice

Groups Approach: Example Update: (type, X, Y, 300) Ranking: brand=X, country=ANY SELECT type, SUM(price*quantity) FROM relation WHERE brand=X GROUP BY type ORDER BY SUM(price*quantity) LIMIT 2

Groups Approach Case 2: type in the Buffer, e.g. (P,X,Y,300) Verification Query: SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type=P Case 4: type NOT in in-memory structures, e.g. (T,X,Y,300) Buffer Reset Query: SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type IN (O,P,Q,R)

Groups Approach • Tuples(type, brand, country, price*quantity) limited to those satisfying its filtering condition • Uses them to compute the scores. • Forwards them to the rankings lower in lattice • Rankings receiving tuples use those qualifying to their filtering condition to compute the scores

Outline • Problem • Algorithms • Naïve Approach • Estimates Approach • Groups Approach • Experimental Results • Conclusions

Experiments (1) • TPC-H data • Select on part.p_partKey (200,000 unique values) • Filter on customer.c_mktsegment, orders.o_orderpriority and region.r_name • Aggregation sum on lineitem.l_quantity • 216 total rankings • 30,000 updates/insertions

Experiments (2) Updates • Random: inserts quantity between 1 and 50 for a random part.p_partKey. • 80-20: inserts quantity between 1 and 50 for a part.p_partKey selected according to the 80-20 rule N-extra Gap • Difference between top-k and top-(k+N) scores 100% (1*50) and 200% (2*50)

80-20 Updates: Queries

80-20 Updates: Time

Random Updates: Queries

Random Updates: Time

Naïve Approach • 80-20 updates: 239,985 Verification Queries, 4 secs/update • Random updates: 239,977 Verification Queries, 4 secs/update

Outline • Problem • Algorithms • Naïve Approach • Estimates Approach • Groups Approach • Experimental Results • Conclusions

Conclusion • Twoalgorithms to maintain top-k rankings in the presence of fast updates arriving in an underlying database • Exact top-kresults • Faster than a Naïve approach while Groups Approach limits further the communication with the database • Preliminary results which provide insights on the impact of the various parameters in the effectiveness of our methods

Thank you!

Additional Instances

A Thin Monitoring Layer for Top- k Aggregation Queries over a Database

A Thin Monitoring Layer for Top- k Aggregation Queries over a Database

Presentation Transcript

THIN LAYER CHROMATOGRAPHY

Evaluating Top- K Selection Queries

Evaluating Top-k Queries Over Web-Accessible Databases

A Generic Framework for Monitoring Continuous Spatial Queries over Moving Objects

Nested Queries and Aggregation

Thin Layer Chromatography

THIN-LAYER CHROMATOGRAPHY

6 Rank Aggregation and Top-k Queries

Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Cleaning Uncertain Data for Top-k Queries

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

Best Position Algorithms for Top-k Queries

Continuous Top-k Dominating Queries

Evaluating top-k Queries over Web-Accessible Databases

Reverse Top- k Queries

Window-aware Load Shedding for Aggregation Queries over Data Streams

Thin – Layer Chromatography

Evaluating Top-k Queries over Web-Accessible Databases

THIN LAYER CHROMATOGRAPHY

THIN LAYER CHROMATOGRAPHY

THIN LAYER CHROMATOGRAPHY