1 / 21

Multiple Aggregations over Data Stream

Multiple Aggregations over Data Stream. Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava. SIGMOD 2005. Outline. Introduction to Giga-Scope DSMS Multiple Aggregations Problem The proposed approach - choice of phantoms - space allocation problem Conclusion. Giga-Scope.

orsin
Download Presentation

Multiple Aggregations over Data Stream

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005

  2. Outline • Introduction to Giga-Scope DSMS • Multiple Aggregations Problem • The proposed approach - choice of phantoms - space allocation problem • Conclusion

  3. Giga-Scope • A DSMS appears to monitor high speed IP traffic data. Main Memory Processing low speed data stream seed by LFTA. DSMS HFTA Network Interface Card Simple low level query over high speed data stream , which serve to reduce data volumes LFTA

  4. Single Aggregation in Giga-Scope Select A, count(*) From R Group by A; (group, count) LFTA HFTA R 2,1 2,3 2,2 • 2 3,1 • 24 24,1 4,1 • 2 • 2 • 3 17,1 • 17 • 4

  5. Cost of Processing a Single Aggregation • probe (c1) : The cost of looking up the hash table in LFTAs and possible update in case of a collision • eviction (c2) : The cost of transferring an entry from LFTAs to HFTAs

  6. Processing Multiple Aggregation Naively R(A, B, C) Select A, count(*) From R Group by A; LFTA HFTA • (2, 3, 4 ) • (2,3) • (2,1) Hash Table A • (24, 4, 3) • (24,1) • (2, 3, 4) • (4,1) Select B, count(*) From R Group by B; • (2, 3, 4) • (4, 2, 3) Hash Table B • (3,1) • (3,3) • (4,1) The end of Epoch !! • (2,1) Select C, count(*) From R Group by C; Hash Table C • (4,1) • (4,3) • (3,2) • (3,1) 15c1 +1c2+7c2

  7. Processing Multiple Aggregation by maintaining phantoms R(A, B, C) LFTA HFTA • (2, 3, 4 ) (24, 1 ) Hash Table ABC Hash Table A • (24, 4, 3) (2, 3 ) • (2, 3, 4) (4, 1 ) • (2, 3, 4) (4, 1 ) • (4, 2, 3) Hash Table B (2, 3, 4, 2 ) (2, 3, 4, 1 ) (2, 3, 4, 3 ) (3, 3 ) The end of Epoch !! (2, 1 ) (24, 4, 3, 1) (4, 2, 3, 1 ) Hash Table C (3, 1 ) (3, 1 ) (3, 2 ) (4, 3 ) 14c1 +8c2

  8. The problem • Consider a set of aggregation queries over a data stream that differ only in their group attribute. Determine an optimal sharing setting for the queries with limit memory. • choice of phantoms ABCD • space allocation ABC ABD BCD Given queries AB BC BD CD Q1 Q3 Q4 Q2

  9. Idea by maintaining phantoms • : the collision rate without phantoms • : collision rate with phantoms • : the collision rate of phantom ABC • The total cost: • Without phantom : • With the phantom : x1 x1’ x2 E1= 3nc1+3x1nc2 E2= nc1+3x2nc1+3x1’x2nc2

  10. Example In the case, the phantom benefits the cost To be fair ,the total space used for the hash tables should be the same with or without the phantoms C2 A A x1 M/4 M/3 E1= 3c1+3x1c2 B E2= c1+3x2c1+3x1’x2c2 M/4 B x1 E1-E2=(2-3x2)c1+3(x1-x1’x2)c2 M/3 C x1’ M/4 When x20, the phantom benefits the cost. C1 C x1 ABC M/3 E1-E2=F(x1, x2 , x1’) M/4 x2 C1 C1

  11. The collision rate estimation The probability of k groups out of g hashed to a buckets g=3000 b=1000 g: number of groups of a relation b: number of buckets in the hash table Bk is the number of buckets having k groups nrg :The expected number of record for each group (1-1/k): the collision rate in the bucket :collision happen in the bucket Key point

  12. Algorithmic strategies for choosing the phantoms • Benefit=the difference between the maintenance costs without or with the phantom. Greedy by Increasing Collision Rate • The configuration I only includes all the queries • We calculate the maintenance cost if a phantom R is added to I • By comparing with the maintenance cost when R is not in I , we can get the benefit • After we add this phantom to I ,we iterate with the other phantoms • As more phantoms are added into I, the overall collision rate goes up and benefit decreases • Stop when the benefit becomes negative.

  13. Algorithmic strategies for choosing the phantoms Available memory=12000 E1-E2=F(x1, x2, x1’) Allocate AB=(1846/7690)*120000 Allocate BC… Allocate BD… Allocate CD… Greedy by Increasing Collision Rate g=2837 bABCDxABCDBenefit ABCD Try ABCD (Linear proportional Allocation) Allocate ABCD=(2837/10527)*12000 Allocate AB=(1846/10527)*12000 Allocate BC… Allocate BD… Allocate CD… g=2117 g=2387 g=2249 ABC ABD BCD The process ends when benefit become negative g=1846 g=1946 g=1899 g=1999 AB BC BD CD Q1 Q3 Q4 Q2

  14. Space Allocation x0 AB x1 A B x2 By partial derivatives of e to 0. Optimal solution for the two level graph When , e has minimum cost. Thereby, the space allocated is proportional to square root of number of group.

  15. Algorithmic strategies for choosing the phantoms • One way of allocating hash table space to a relation is proportional to the number of groups in the table • We can allocate space for a relation with g • is a constant and we set it large

  16. Algorithmic strategies for choosing the phantoms Greedy by Increasing Space • We calculate the benefit of each phantom according to the cost model • We calculate the benefit per unit space for each phantom R, benefit/ • We choose the phantom with the largest benefit per unit space as the first phantom to instantiate • The process ends when the benefit per unit space becomes negative

  17. Algorithmic strategies for choosing the phantoms E1-E2=(2-3x2)c1+3(x1-x1’x2)c2 Benefit/Space as a metric • Greedy by Increasing Space g=2837 Benefit=2 Try ABCD Available memory=12000 12000-7690=4310 4310-2837=1473 ABCD Benefit=-1 Benefit=1 g=2117 g=2387 g=2249 • The process ends when • Benefit become negative • The space is exhausted ABC ABD BCD g=1846 g=1946 g=1899 g=1999 AB BC BD CD Q1 Q3 Q4 Q2

  18. Drawback • needs to be tuned to find the best performance

  19. Space Allocation • According to Abel’s impossibility theorem, equations of order higher than 4 cannot be solved algebraically, we say unsolvable • More general multi-level configurations generate equations of even higher order which are unsolvable • We would use heuristics to decide space allocation for the these unsolvable cases based on the analysis available

  20. Space Allocation • Super-node with Linear Combination • Super-node with Square Root Combination • Linear Proportional Allocation • Square Root Proportional Allocation

  21. Conclusion • We address the problem of efficiently computing multiple aggregations over high speed data streams • In real DSMS, the value of “g” is unknown.

More Related