430 likes | 537 Views
Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations. Thanh Tran, Andrew McGregor, Yanlei Diao , Liping Peng , Anna Liu University of Massachusetts, Amherst Presented by Xin Miao. Previous works: PODS: A New Model and Processing Algorithms
E N D
Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations Thanh Tran, Andrew McGregor, YanleiDiao, LipingPeng, Anna Liu University of Massachusetts, Amherst Presented by Xin Miao
Previous works: PODS: A New Model and Processing Algorithms for Uncertain Data Streams. SIGMOD 2010 Capturing Data Uncertainty in High-Volume Stream Processing. CIDR 2009 Probabilistic Inference over RFID Streams in Mobile Environments. ICDE 2009 Efficient Data Interpretation and Compression over RFID Streams. ICDE 2008 Thanh Tran YanleiDiao
TV Uncertain Data Streams Data:incomplete, imprecise, misleading Results: unknown quality
Computational Astrophysics • Astrophysical surveys generate observations of 108 stars and galaxies, 0.5 TB – 20 TB nightly data rates • Observations are noisy (o_id, time, (x,y)p, luminosityp, colorp) continuous discrete
Computational Astrophysics • Queries are issued to detect dynamic features, transient events, anomalous behaviors Quality of the returned answer? Q1: SELECT group_id, max(O.luminosity) FROM Observations O [RANGE 1 hour] GROUP BY area_id(O.(x,y), AREA_DEF) as group_id HAVING max(O.luminosity) > 20 Query answer group_id existence prob. max_luminosityp 0.60 10 max_luminosity existence prob. group_id
(time, tag_id, weight, (x,y)p, sizep) RFID Tracking and Monitoring • RFID technology used for object tracking and monitoring • E.g., Supply chain, health care management • Raw RFID readings are noisy and incomplete • Inference yields stream with object locations continuous discrete
RFID Tracking and Monitoring • Query detecting violations of a fire code: Location: (time, tag_id, weight, (x,y)p, sizep) Quality of the returned alert? Q2: SELECTgroup_id, sum(S.weight) FROM Locations S GROUP BY area_id(S.(x,y)) as group_id HAVINGsum(S.weight) > 200 group_id existence prob. sum_weightp 0.75 10 Query answer
Problem Statement Uncertain attributes: discrete and continuous • Complex relational operations • WHERE, GROUP BY-HAVING, AGGREGATION • Objectives: • Computing result distributions with bounded errors • Query processing on high-volume streams
Challenges • Characterizing the uncertainty of query results requires the probability distributions of uncertain attributes in intermediate and final query results • Computing distributions of aggregates is hard • To compute an aggregate of n discrete random variables may require enumerating an exponential number 2n of possible worlds E.g.: Q1 SELECT group_id, max(O.luminosity) FROM Observations O [RANGE 1 hour] GROUP BY area_id(O.(x,y), AREA_DEF) as group_id HAVINGmax(O.luminosity) > 20 TEP group_id max_luminosityp 0.6 10
Challenges • Offering query answers with bounded errors is crucial. • State-of-the-art: Monte Carlo simulation with unbounded errors • MCDB, SIGMOD’08 • Handling uncertain data in array database, ICDE’08 • Database support for probabilistic attributes and tuples, ICDE’08 • In data streams, query processing needs to employ incremental computation as tuples arrive
State-of-The-Art • Continuous random variables modeled by Gaussian mixture models [SIGMOD’10] • Closed-form solutions for aggregates • Cannot be applied for selection, group by, aggregates • Monte Carlo simulation gives relative approximations for evaluating HAVING predicates [Re & Suciu, VLDBJ’09] • For discrete distributions, best technique to compute sum needs O(nD3), where D is the domain size [Kanagal & Deshpande, ICDE’09]
Solution • Query evaluation framework: • Mixed-type data model • Approximate representations • Approximation metrics • Approximation algorithms for aggregates • Randomized algorithms: all aggregates • Deterministic algorithms • max, min • sum, count • Query planning
Mixed-type Data Model for Query Evaluation • Uncertain attributes: • Ax: continuous • Ay: discrete • Certain attributes • Az: continuous/discrete • Mixed-type distribution: g=<p, f> • p:tuple existence probability (TEP) • f: joint density function for all uncertain attributes
Data Model Example Location:(o_id, time, xp, luminosityp, colorp) TEP
Data Model Example Location:(o_id, time, xp, luminosityp, colorp) TEP
Conditioning • What is the TEP and pdf function after conditioning? 1<xP<2 ? 2<xP<3 1 3 ?
Conditioning • Truncated distribution Condition on Range I Normalize Mixed-type: <p, f> Support S Support S U Mixed-type: <p’, f’> Support S’= S I
Relational Processing under Mixed-type Model • Execution of Q1: Group I {areaNo, max_luminp} σMAX>20 CGi iL≤ x ≤ (i+1)L GROUPBY/ AGGRMAX ObjStream
Relational Processing under Mixed-type Model • Execution of Q1: maxGi(luminosity) {areaNo, max_luminp} {areaNo, max_luminp} σMAX>20 σMAX>20 σ(max>20) GROUPBY/ AGGRMAX GROUPBY/ AGGRMAX ObjStream ObjStream Gi Normalized again!
Approximation Framework: Representation • Employ cumulative distribution functions (CDFs) • To approximate distributions of aggregates • Two forms of CDFs: StepCDFs and LinCDFs × × × × × × × × × × × × × × × × ×
Approximation Framework: Metric • Kolmogorov-Smirnov (KS) distance: • Between two CDFs F, F’ KS(F, F’) = supx |F(x) – F’(x)| • (ε, δ) approximation: • KS distance between the approximate distribution and the • exact distribution is at most ε, with probability (1-δ) • δ=0: deterministic, δ>0: randomized
Bounded-Error Monte-Carlo Simulation • (ε, δ) approximation of aggregates A = f(Y1, Y2, ...) e.g., sum, count, avg, min, max time yt1 y31 y11 y21 m samples at each time instance y32 yt2 y12 y22 …… … … … … y3m ytm y1m y2m For the t-thtuple, generate m=ln(2δ-1)/(2ε2) samples, yt1, yt2,… from the distribution Yt Compute m aggregated values, ai=A(y1i, y2i,…) (Based on existing deterministic algorithm Φ) Return the CDF from these values O(ε-2logδ-1) greater than the time and space of Φ Number of samples is O(1/ε2); high cost for small ε
Distributions of MAX • Mt = max(Y1, Y2, …, Yt) • (Simple Case)Yi: modeled by a distribution that can take values from a universe U={1,2,3,…,n} • Objective: compute the CDF of Mt,namely FtM • Basic Algorithm • Complexity O(tn) • Inefficient for stream processing for large n
MAX: Intuition • Dynamically partition the universe into consecutive intervals • Use the estimates for any intermediate point since CDF is non-decreasing of the two ends of an interval to estimate
MAX: Approximate Representation with Invariants • Approximate using StepCDF • Partition the universe into consecutive intervals: [1, n] = [ai, bi], ai+1=bi+1 • Maintain the estimates of cumulative probabilities, cai and cbi, for [ai, bi] I1 I2 ca2, cb2 cb1 ca1 a1 b1 a2 b2 • Invariants: • Estimates of the two ends of an interval are close Accuracy (2) Estimates of two adjacent intervals are separated Performance
MAX: Algorithm • Employ a splitting scheme On seeing tuple t: • Update: • Subpartition to ensure Invariant (1) • Adjust by splitting and shifting while ensuring Invariant (2) I cb ca I I21 I21 I1 I1 I22 I22 c’b c’b c’b I1 I2 v1 v2v3 c’b c’a c’a c’a a b a b c’a v1 v2v3 a b v1 v2v3 a b Step 1 Step 2 a b v1 v2v3 v1 v2 v3 Step 3 Step 3
MAX: Algorithm • At any step in the algorithm, the number of intervals is bounded as follows: • The maximum generation of an interval is
MAX: Analysis • Estimates of the two ends of an interval are bounded Estimates of any point in an interval are bounded • Number of intervals is bounded Number of times an interval is split is bounded, i.e., logU (ε, 0) algorithm for max, update time is O(ε-1logU lnε-1 ) Extend to continuous distributions: A general approach is to consider a large universe of size 264. The complexity is then proportional to log 264 =64.
Distributions of SUM • Approximate representation using quantiles • Uniform quantiles CDF 1 1/3 2/3 a b
SUM: Intuition • Assume each takes values from a finite set of size at most • On receiving each new tuple, we can produce an intermediate approximation Intermediate approximation of Approximation of using LinCDF
SUM: Algorithm c) Composing with linear interpolation a) LinCDF before updating, which is b) Shifting and scaling LinCDF
SUM: Algorithm Simplify c) Composing with linear interpolation d) Simplify . Get a new LinCDF, which is Can be improved to log time using binary search
SUM: Analysis • (ε, 0) algorithm for sum • Space • Update time • Supporting continuous distributions • Discretize input distribution by LinCDF or StepCDF • Total error is the sum of discretization error and approximation error
Approximate Query Answers Bound quantities such as , and • Query approximation objective: For query answer set and exact answer set • and should have the same number of tuples • The corresponding attribute in the corresponding tuple in is at most with prob. 1- • Extended KS distance, named KSM KSM(G, G’) = max(|p – p’|, supx |p F(x) – p’ F’(x)|, supx|p(1 – F(x)) – p’ (1 – F’(x))|)
Approximate Query Answers • Consider Select-From-Where-Groupby-Having block • One aggregate predicate in the Having clause Selection/Projection gid TEP max_luminosityp ε > 0 Aggregation ε 0.6 10 ε = 0 Selection/Group By Error occurs here!
Query Planning • Planning: find a query plan that meets the objective • Proposition on Selection: • Selection on an attribute with (ε, δ) –approx, using a • range condition is (2ε, δ) • If the selection uses a union of ranges, the approximation • error is twice the sum, i.e., 2εi. • Top-down approach to provision error bounds • If the error is ε, we should provision ε/2 for the approximation of sum
Conclusion • Evaluation framework and approximation techniques for complex operations • Randomized algorithms: general • Deterministic algorithms: often better • For complex queries, the errors are bounded while having throughput of thousands of tuples per sec • Future work • Wider range of aggregates • Correlation among derived attributes • Query optimization
Discussion • Error bound of SUM • Assumption on Vt