250 likes | 387 Views
Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data. Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi Marolia. MOTIVATION. DATA CACHING OR REPLICATION IS A COMMON TECHNIQUE FOR REDUCING THE LATENCY TO ACCESS DATA FROM REMOTE SOURCES
E N D
Offering a Precision-Performance Tradeoff for AggregationQueries over Replicated Data Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi Marolia
MOTIVATION • DATA CACHING OR REPLICATION IS A COMMON TECHNIQUE FOR REDUCING THE LATENCY TO ACCESS DATA FROM REMOTE SOURCES • PROPOGATING ALL UPDATES MAY BE INFEASIBLE DUE TO NETWORK AND/OR COMPUTATIONAL RESOURCES • STALE REPLICATION – CACHED COPIES OF DATA VALUES ARE ALLOWED TO BECOME OUT OF DATE
MOTIVATION • QUERY OVER CACHED DATA ITEM MAY RETURN AN UNBOUNDEDLY IMPRECISE ANSWER • QUERIES OVER REMOTE MASTER DATA WOULD RETURN A PRECISE ANSWER BUT WITH POOR PERFORMANCE • HENCE WE HAVE A PRECISION-PERFORMANCE TRADEOFF • IT IS REQUIRED TO LET THE USER CONTROL THE PRECISION AND PERFORMANCE OFFERED BY THE SYSTEM
TRAPP • TRADEOFF IN REPLICATION PRECISION AND PERFORMANCE • THE SYSTEM SUPPORTS A CONTINUOUS, MONOTONICALLY DECREASING TRADEOFF BETWEEN PRECISION AND PERFORMANCE
TRAPP/AG – OVERVIEW • FOCUS IS OVER AGGREGATION QUERIES OVER NUMERIC DATA • AGGREGATE QUERIES DISCUSSED ARE THE STANDARD RELATIONAL AGGREGATE FUNCTIONS – COUNT, MIN, MAX, SUM AND AVG • CONVENTIONAL SYSTEMS - QUERY RETURNS SINGLE VALUE. TRAPP/AG RETURNS A BOUNDED IMPRECISE ANSWER [L , H] – THE RANGE IN WHICH THE PRECISE ANSWER IS GUARANTEED
TRAPP/AG - APPROACH • PRECISION IS QUANTIFIED AS THE WIDTH OF THE RANGE ( H – L ) • THE PRECISION CONSTRAINT IS USER SPECIFIED AND IS PROVIDED AS A CONSTANT R >= 0 • HENCE, THE QUERY IS REQUIRED TO RETURN A RESULT [L , H] SUCH THAT 0 <= H – L <= R • NOTE THIS ALLOWS FOR BOTH EXACT AND UNBOUNDED PRECISION
TRAPP/AG - APPROACH • WHENEVER A DATA SOURCE REFRESHES A CACHE’S VALUE FOR A DATA OBJECT, ALONG WITH THE EXACT VALUE IT ALSO SENDS A PARAMETRIZED BOUND [L(t) , H(t)]. • AT t=0 BOTH FUNCTIONS EQUAL THE CURRENT VALUE • EVENTUALLY WHEN ANOTHER REFRESH OCCURS, THE SOURCE SENDS A NEW PAIR OF BOUND FUNCTIONS
TRAPP/AG - APPROACH REFRESHES OCCUR FOR ONE OF TWO REASONS: • VALUE-INITIATED REFRESH THE MASTER VALUE DOES NOT SATISFY THE BOUND SPECIFIED AT A CACHE. THE DATA SOURCE IS THEN OBLIGATED TO REFRESH THE CACHE • QUERY-INITIATED REFRESH QUERY EXECUTED AT CACHE REQUIRES THE EXACT VALUE IN ORDER TO MEET THE PRECISION CONSTRAINT
TRAPP/AG - APPROACH BOUND AND PRECISE VALUE OVER TIME
EXECUTING A QUERY • STANDARD FORM OF A QUERY SELECT AGGREGATE(T.a) WITHIN R FROM T WHERE PREDICATE • AGGREGATE IS ONE OF THE AGGREGATION FUNCTIONS. R IS A NON-NEGATIVE CONSTANT SPECIFYING THE PRECISION REQUIRED. IF R IS NOT SPECIFIED THEN R = INFINITY
EXECUTING A QUERY Suppose one required to compute an aggregate over column T.a of a cached table T. The value of T.a for each tuple t_i is stored in the cache as a bound [L_i , H_i] While computing the aggregate, the query processor has the option for each tuple t_i of either reading the cached bound [L_i , H_i] or refreshing t_i to obtain the master value V_i. The cost to refresh t_i is C_i The final answer to the aggregate is a bound [L , H]
EXECUTING A QUERY • COMPUTE INITIAL BOUNDED ANSWER AND CHECK WHETHER THE PRECISION CONSTRAINT IS MET • IF NOT, AN ALGORITHM CHOOSE_REFRESH IS INVOKED WHICH SELECTS THE SUBSET OF TUPLES THAT HAVE TO BE REFRESHED. QUERY-INITIATED REFRESHES ARE TRIGGERED • RECOMPUTE THE BOUNDED ANSWER BASED ON THE CACHE’S PARTIALLY REFRESHED COPY OF THE TABLE
TRAPP/AG - ASSUMPTIONS • TIME TO REFRESH A BOUND IS SMALL SO THE IMPRECISION INTRODUCED IS INSIGNIFICANT • VALUE-INITIATED REFRESHES DO NOT OCCUR WHEN A QUERY IS BEING PROCESSED • OBJECT INSERTIONS OR DELETIONS ARE PROPOGATED IMMEDIATELY TO ALL CACHES • LEVEL OF PRECISION OFFERED DOES NOT ACCOUNT FOR ELAPSES OF TIME WHILE SENDING REFRESH MESSAGES OR IN PROCESSING A QUERY
TRAPP/AG - ASSUMPTIONS • THE SYSTEM KNOWS THE COST OF REFRESHING THE INDIVIDUAL DATA OBJECTS FROM THEIR SOURCES • THE COST OF REFRESHING A SET OF TUPLES IS THE SUM OF THE COSTS OF REFRESHING EACH MEMBER OF THE SET (ALTHOUGH IN PRACTICE, ALL TUPLES FROM THE SAME SOURCE MAY INCUR A COMMON COST OVERHEAD)
Computing MIN (MAX) (No Selection Predicate) • [ L , H ] = [ min(L_i) , max(H_i) ] COMPUTED OVER ALL TUPLES t_i IN TABLE T • THE CHOOSE_REFRESH ALGORITHM SELECTS ALL TUPLES t_i SUCH THAT L_i < min( H_k ) – R HENCE THE TUPLES REFRESHED ARE THOSE WHOSE LOWER BOUND IS LESS THAN THE MINIMUM UPPER BOUND MINUS THE PRECISION CONSTRAINT
Computing SUM(No Selection Predicate) • [ L , H ] = [ sum(L_i) , sum(H_i) ] COMPUTED OVER ALL TUPLES t_i IN TABLE T • CHOOSE_REFRESH INVOLVES FINDING THE TUPLES NOT TO REFRESH. CALL THIS SET K. AFTER REFRESHING ALL REQUIRED TUPLES, sum( H_i – L_i ) in T = sum( H_i – L_i ) in K HENCE WE REQUIRE TO CHOOSE K SO AS TO MAXIMIZE sum( C_i ) over K UNDER THE CONSTRAINT sum( H_i – L_i ) in K <= R
Computing COUNT(No Selection Predicate) • SINCE WE REQUIRE THAT ALL INSERTIONS AND DELETIONS BE PROPOGATED TO ALL THE DATA CACHES IMMEDIATELY, THE CARDINALITY OF THE CACHED COPY IS ALWAYS EQUAL TO THE CARDINALITY OF THE MASTER COPY. HENCE THERE IS NO NEED FOR ANY REFRESH
Computing AVG(No Selection Predicate) • FIRST COMPUTE COUNT • THEN COMPUTE SUM WITH THE PRECISION CONSTRAINT R = R*COUNT
INCLUDING Selection PREDICATES Consider a selection predicate involving at least one column of T that contains bounded values. The system can partition T into three disjoint sets: • T- :Tuples that cannot possible satisfy the predicate • T+ : Tuples that are guaranteed to satisfy the predicate • T? : Tuples that may or may not satisfy the predicate All algorithms for computing bounded answers and choosing tuples to refresh examine only T+ and T?
Computing MIN (MAX)(With Selection Predicate) The bounded MIN answer as well as the CHOOSE_REFRESH algorithm are identical to the case without a selection predicate but with taking T+ union T? as the set of tuples.
Computing SUM(With Selection Predicate) • The bounded answer [ L , H ] is computed as: L = sum(L_i) over all t_i inT+ + sum(L_i) over all t_i inT? with L_i < 0 H = sum(H_i) over all t_i inT+ + sum(H_i) over all t_i inT? with H_i > 0
Computing COUNT(With Selection Predicate) • The bounded answer [ L , H ] is computed as: L = cardinality( T+ ) H = cardinality( T+ ) + cardinality( T? ) • The CHOOSE_REFRESH algorithm is based on the fact that refreshing a tuple in in T? is guaranteed to remove it from T?. Hence, one is required to find the cardinality( T? ) – R cheapest tuples in T?
Computing AVG(With Selection Predicate) • WITH A PREDICATE, COUNT IS A BOUNDED VALUE AS WELL AS SUM. HENCE IT IS NOT A SIMPLE MATTER OF DIVIDING THE ENDPOINTS OF THE SUM BOUND BY THE EXACT COUNT • ONE CAN CALCULATE AN APPROXIMATE BOUND [ L , H ] BY COMPUTING SUM AND COUNT AND SETTING L = min ( Lsum / Hcount , Lsum / Lcount ) & H = max ( Hsum / Lcount , Hsum / Hcount )
Online References An intra-IIT reference of this presentation is at: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data The complete paper together with algorithms and proofs of correctness exists at: http://www-db.stanford.edu/pub/papers/trapp-ag.ps