Multiple Aggregations over Data Stream

Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005

Outline • Introduction to Giga-Scope DSMS • Multiple Aggregations Problem • The proposed approach - choice of phantoms - space allocation problem • Conclusion

Giga-Scope • A DSMS appears to monitor high speed IP traffic data. Main Memory Processing low speed data stream seed by LFTA. DSMS HFTA Network Interface Card Simple low level query over high speed data stream , which serve to reduce data volumes LFTA

Single Aggregation in Giga-Scope Select A, count(*) From R Group by A; (group, count) LFTA HFTA R 2,1 2,3 2,2 • 2 3,1 • 24 24,1 4,1 • 2 • 2 • 3 17,1 • 17 • 4

Cost of Processing a Single Aggregation • probe (c1) : The cost of looking up the hash table in LFTAs and possible update in case of a collision • eviction (c2) : The cost of transferring an entry from LFTAs to HFTAs

Processing Multiple Aggregation Naively R(A, B, C) Select A, count(*) From R Group by A; LFTA HFTA • (2, 3, 4 ) • (2,3) • (2,1) Hash Table A • (24, 4, 3) • (24,1) • (2, 3, 4) • (4,1) Select B, count(*) From R Group by B; • (2, 3, 4) • (4, 2, 3) Hash Table B • (3,1) • (3,3) • (4,1) The end of Epoch !! • (2,1) Select C, count(*) From R Group by C; Hash Table C • (4,1) • (4,3) • (3,2) • (3,1) 15c1 +1c2+7c2

Processing Multiple Aggregation by maintaining phantoms R(A, B, C) LFTA HFTA • (2, 3, 4 ) (24, 1 ) Hash Table ABC Hash Table A • (24, 4, 3) (2, 3 ) • (2, 3, 4) (4, 1 ) • (2, 3, 4) (4, 1 ) • (4, 2, 3) Hash Table B (2, 3, 4, 2 ) (2, 3, 4, 1 ) (2, 3, 4, 3 ) (3, 3 ) The end of Epoch !! (2, 1 ) (24, 4, 3, 1) (4, 2, 3, 1 ) Hash Table C (3, 1 ) (3, 1 ) (3, 2 ) (4, 3 ) 14c1 +8c2

The problem • Consider a set of aggregation queries over a data stream that differ only in their group attribute. Determine an optimal sharing setting for the queries with limit memory. • choice of phantoms ABCD • space allocation ABC ABD BCD Given queries AB BC BD CD Q1 Q3 Q4 Q2

Idea by maintaining phantoms • : the collision rate without phantoms • : collision rate with phantoms • : the collision rate of phantom ABC • The total cost: • Without phantom : • With the phantom : x1 x1’ x2 E1= 3nc1+3x1nc2 E2= nc1+3x2nc1+3x1’x2nc2

Example In the case, the phantom benefits the cost To be fair ,the total space used for the hash tables should be the same with or without the phantoms C2 A A x1 M/4 M/3 E1= 3c1+3x1c2 B E2= c1+3x2c1+3x1’x2c2 M/4 B x1 E1-E2=(2-3x2)c1+3(x1-x1’x2)c2 M/3 C x1’ M/4 When x20, the phantom benefits the cost. C1 C x1 ABC M/3 E1-E2=F(x1, x2 , x1’) M/4 x2 C1 C1

The collision rate estimation The probability of k groups out of g hashed to a buckets g=3000 b=1000 g: number of groups of a relation b: number of buckets in the hash table Bk is the number of buckets having k groups nrg :The expected number of record for each group (1-1/k): the collision rate in the bucket :collision happen in the bucket Key point

Algorithmic strategies for choosing the phantoms • Benefit=the difference between the maintenance costs without or with the phantom. Greedy by Increasing Collision Rate • The configuration I only includes all the queries • We calculate the maintenance cost if a phantom R is added to I • By comparing with the maintenance cost when R is not in I , we can get the benefit • After we add this phantom to I ,we iterate with the other phantoms • As more phantoms are added into I, the overall collision rate goes up and benefit decreases • Stop when the benefit becomes negative.

Algorithmic strategies for choosing the phantoms Available memory=12000 E1-E2=F(x1, x2, x1’) Allocate AB=(1846/7690)*120000 Allocate BC… Allocate BD… Allocate CD… Greedy by Increasing Collision Rate g=2837 bABCDxABCDBenefit ABCD Try ABCD (Linear proportional Allocation) Allocate ABCD=(2837/10527)*12000 Allocate AB=(1846/10527)*12000 Allocate BC… Allocate BD… Allocate CD… g=2117 g=2387 g=2249 ABC ABD BCD The process ends when benefit become negative g=1846 g=1946 g=1899 g=1999 AB BC BD CD Q1 Q3 Q4 Q2

Space Allocation x0 AB x1 A B x2 By partial derivatives of e to 0. Optimal solution for the two level graph When , e has minimum cost. Thereby, the space allocated is proportional to square root of number of group.

Algorithmic strategies for choosing the phantoms • One way of allocating hash table space to a relation is proportional to the number of groups in the table • We can allocate space for a relation with g • is a constant and we set it large

Algorithmic strategies for choosing the phantoms Greedy by Increasing Space • We calculate the benefit of each phantom according to the cost model • We calculate the benefit per unit space for each phantom R, benefit/ • We choose the phantom with the largest benefit per unit space as the first phantom to instantiate • The process ends when the benefit per unit space becomes negative

Algorithmic strategies for choosing the phantoms E1-E2=(2-3x2)c1+3(x1-x1’x2)c2 Benefit/Space as a metric • Greedy by Increasing Space g=2837 Benefit=2 Try ABCD Available memory=12000 12000-7690=4310 4310-2837=1473 ABCD Benefit=-1 Benefit=1 g=2117 g=2387 g=2249 • The process ends when • Benefit become negative • The space is exhausted ABC ABD BCD g=1846 g=1946 g=1899 g=1999 AB BC BD CD Q1 Q3 Q4 Q2

Drawback • needs to be tuned to find the best performance

Space Allocation • According to Abel’s impossibility theorem, equations of order higher than 4 cannot be solved algebraically, we say unsolvable • More general multi-level configurations generate equations of even higher order which are unsolvable • We would use heuristics to decide space allocation for the these unsolvable cases based on the analysis available

Space Allocation • Super-node with Linear Combination • Super-node with Square Root Combination • Linear Proportional Allocation • Square Root Proportional Allocation

Conclusion • We address the problem of efficiently computing multiple aggregations over high speed data streams • In real DSMS, the value of “g” is unknown.

Multiple Aggregations over Data Stream

Multiple Aggregations over Data Stream

Presentation Transcript

Data Stream Mining

Data Stream Processor

Stream Data

Data Stream Clustering

RTP Multiple Stream Sessions and Simulcast

Data Stream Protocol

Data Stream Management

Estimating Rarity and Similarity over Data stream Windows

Data Stream Computation

Maintaining Variance over Data Stream Windows

Answering Arbitrary Conjunctive Queries over Incomplete Data Stream Histories

Social threshold aggregations

Data Stream Mining

Multiple Data Providers

Statistic estimation over data stream

Multiple Data Structuring

Overheads in Data Stream Over WLAN

Multiple Aggregations Over Data Streams

Multiple Aggregations Over Data Streams

Aggregations

Data Stream Mining