1 / 43

Generating Wide-Area Content-Based Publish/Subscribe Workloads

Generating Wide-Area Content-Based Publish/Subscribe Workloads. Albert Yu , Pankaj K. Agarwal, Jun Yang Duke University. Overview. Publish/Subscribe systems Data extraction Workload generation Conclusion and future work. Publish/Subscribe. Publishers. Subscribers. Brokers.

debbie
Download Presentation

Generating Wide-Area Content-Based Publish/Subscribe Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating Wide-Area Content-Based Publish/Subscribe Workloads Albert Yu, Pankaj K. Agarwal, Jun Yang Duke University

  2. Overview • Publish/Subscribe systems • Data extraction • Workload generation • Conclusion and future work

  3. Publish/Subscribe Publishers Subscribers Brokers Broker network

  4. Two tasks • Subscription processing • Match and process each publish event with a large set of subscriptions. • Notification dissemination • Notify those interested subscribers over a network.

  5. Event and network spaces S1 e4 • Event space • Event is a point. • Subscription defines a region (ex: rectangle). • Network space • Network location is a point. • Distance between two network locations approximates the latency between them. e1 e3 e2 S2

  6. Lack of publicly available, realistic workloads • Privacy concern and commercial interests • Lack of widely deployed systems supporting powerful content-based subscriptions

  7. Goal • Collect the limited amount of various statistics available to public. • Generate a workload consistent with these statistics. • Generate other workloads according to user-defined deviations.

  8. Workload components • A set of subscriptions, each of which corresponds to: • A rectangular region of interest in the event space • A point in the network space • An event distribution over the event space • A set of brokers (optional) • A point in the network space

  9. Motivation: Broker-subscriber assignment • How to assign subscribers to brokers? • Cluster subscribers with similar interests • Potentially minimize network traffic. • Letting a broker handle subscribers that are far away • Violate delivery latency requirements. • Increase communication costs. • Balancing semantic similarity and network proximity in dissemination network design is a hard optimization. • Optimal tradeoff depends on the amounts of event matching shared versus disjoint interests.

  10. Motivation: Broker-subscriber assignment • Take into account • Subscription interest • Subscription location • Event distribution • Exploring correlation between event and network spaces provides more optimization opportunities.

  11. Related work • Characterize pub/sub systems • Properties of RSS feeds [Liu et al.’ 05] • Stock popularity in NYSE [Tock et al.’05] • Simple synthesized workloads • Event space • Uniform and Gaussian distributions [Baldoni et al.’07] • Zipf distribution [Bianchi et al.’07] • Network space • Subscribers are located uniformly or randomly in the network [Baldoni et al.’07, Papaemmanouil and Cetintemel’05]

  12. Two work phases of our generator • Data extraction • Workload generation

  13. Data extraction Data extraction Data generator Summary statistics of subscriber interests, locations, and events

  14. Data extraction (Cont’d) For each cell , • Subscription count • Event count • Distribution of subscribers over the network Event space

  15. Data extraction (Cont’d) • Data from Google Groups • Data from PlanetLab • Our approach can be applied to other data sources that offer similar types of summary information.

  16. Google Groups • Ex: Asian languages -> Eastern Asian languages -> Korean Google defines hierarchies over topics and regions. Tag each group with three attributes.

  17. Google Groups (Cont’d) t7 t6 • Treat topic and language as dimensions of the event space. • Each interest • Pair of topic and language. t5 t4 Interest = (t7, l4) l4 l5 l6 l7 t1 l1 t3 t2 l2 l3 t4 t5 t6 t7 l7 l4 l5 l6 Event space Topic hierarchy Lang hierarchy

  18. Google Groups (Cont’d) • Collect statistical summary for each interest • # messages per month posted to groups associated to that interest. • # members in each group associated with that interest.

  19. Google Groups (Cont’d) • Divide all Google groups associated with the same interest by their geographic regions. • Count #members within each geographic region. • Rough indication of the distribution of subscribers over the network.

  20. Google Groups (Cont’d) For each cell (interest), • Subscription count • Event count • Distribution of subscribers by geographic region Event space

  21. Network Location • Data from Google Groups gives us a rough distribution of subscriptions by geographic regions. • Still need actual network locations. Google Groups PlanetLab nodes Geographic regions Coordinates • PlanetLab nodes • IP-address • Embed inter-node latencies in a low-dimensional Euclidean space [Dabek et al’04, Ledlie et al’02, Ng et al’02]

  22. Popularities of interests Top three are (business services, English), (small business, English), (consulting, English). By removing the top 24 interests, # members reduces from 8.1 million to 4.3 millon.

  23. Distribution of interests in event space for different geographic regions Simplified Chinese English English Simplified Chinese Super-interest Super-interest Asia US Simplified Chinese English Europe

  24. Two work phases of our generator • Data extraction • Workload generation

  25. Workload generation Skewness parameter Interest generalization parameter Range perturbation parameter Workload size parameter Data extraction Workload generation Summary statistics of subscriber interests, locations, and events A set of range subscriptions A set of events

  26. Workload generation • Interest diffusion • Interest generalization • Categorical-to-range subscription conversion • Workload with different size

  27. Workload generation 200 100 300 10 277.5 187.5 97.5 16.5 10 90 10 10 16.5 16.5 88.5 16.5 • Interest diffusion • Interest generalization • Categorical-to-range subscription conversion • Workload with different size 10 20 100 50 97.5 16.5 25.5 52.5 10 200 30 50 16.5 34.5 52.5 187.5 Event space

  28. Workload generation • Interest diffusion • Interest generalization • Categorical-to-range subscription conversion • Workload with different size Topic hierarchy Language hierarchy

  29. Workload generation • Interest diffusion • Interest generalization • Categorical-to-range subscription conversion • Workload with different size (soccer, Korean)

  30. Workload generation • Interest diffusion • Interest generalization • Categorical-to-range subscription conversion • Workload with different size

  31. Interest diffusion • Popularity of an interest = number of subscriptions in its subtree. • Siblings of an interest are “related.” • Reduce the popularity variance among the silbings. t1 t2 t3 t5 t4 t6 t7

  32. Interest diffusion • Goal • Given a user-specified value p, reduce all popularity variances by a factor of p for all levels of granularity. • Under the following constraints • Total subscription count remains constant. • Popularity of an interest = sum of child popularities. t1 t2 t3 t5 t4 t6 t7

  33. 100 200 t7 300 10 500 t3 230 t6 10 90 10 10 t5 10 20 100 50 70 400 t2 t4 10 200 30 50 l4 l5 l6 l7 l2 l3 Subscription count t1 l1 t2 t3 l2 l3 t5 t4 t6 t7 l4 l5 l6 l7 Topic hierarchy Lang hierarchy

  34. 100 200 t7 300 10 500 t3 230 t6 10 90 10 10 t5 10 20 100 50 70 400 t2 t4 10 200 30 50 l4 l5 l6 l7 l2 l3 Mean: (500 + 230 + 70 + 400) / 4 = 300 Variance: [(500 – 300)2 + (230 – 300) 2 + (70 – 300) 2 + (400 – 300) 2 ]/4= 26950

  35. 100 200 t7 300 10 500 t3 230 t6 10 90 10 10 t5 10 20 100 50 70 400 t2 t4 10 200 30 50 l4 l5 l6 l7 l2 l3 Mean: 57.5 Mean: 300 Mean: 125 Variance: 6768.75 Variance: 26950 Variance: 11425 Goal : Given a user-specified value p, reduce all popularity variances by a factor of p. Mean: 100 Mean: 17.5 Variance: 3750 Variance: 68.75

  36. Old: C1 Old: C2 New: C1* New: C2* t3 Old: C3 Old: C4 t2 New: C3* New: C4* l2 l3 Old mean: C New mean: C* Goal : Given a user-specified value p, reduce all popularity variances by a factor of p.

  37. 100 200 t7 300 10 500 t3 230 t6 10 90 10 10 t5 10 20 100 50 70 400 t2 t4 10 200 30 50 l4 l5 l6 l7 l2 l3 Mean: 57.5 Mean: 300 Mean: 125 Variance: 6768.75 Variance: 26950 Variance: 11425 Mean: 100 Mean: 17.5 Proceed top-down from the coarsest level of granularity to the finest level of granularity Variance: 3750 Variance: 68.75

  38. 100 200 t7 300 10 500 230 t3 237 480 t6 10 90 10 10 t5 10 20 100 50 70 400 t2 93 390 t4 10 200 30 50 l4 l5 l6 l7 l2 l3 59.25 Mean: 57.5 Mean: 300 120 Mean: 125 Variance: 6768.75 Variance: 26950 21829.5 Variance: 11425 97.5 Mean: 100 Mean: 17.5 23.25 P = 0.81 Variance: 3750 Variance: 68.75

  39. 200 t7 100 300 10 500 230 277.5 187.5 97.5 16.5 t3 237 480 t6 10 90 10 10 16.5 16.5 88.5 16.5 t5 10 20 100 50 70 97.5 400 16.5 25.5 52.5 t2 93 390 t4 10 200 30 50 16.5 34.5 52.5 187.5 l4 l5 l6 l7 l2 l3 59.25 Mean: 57.5 Mean: 300 120 Mean: 125 Variance: 6768.75 Variance: 26950 21829.5 Variance: 11425 9254.25 5482.69 97.5 Mean: 100 Mean: 17.5 23.25 P = 0.81 Variance: 3750 Variance: 68.75 3037.5 55.6875

  40. Along the language dimension Before diffusion After diffusion

  41. Along the topic dimension Before diffusion After diffusion

  42. Conclusion and Future work • Make the best out of the limited amount of publicly available information to generate realistic workloads. • Make deviations easy to understand and control by users. • Extensions • Changes to event distributions and subscriptions over time. • Subscriptions beyond multi-dimensional range predicates. • Statistical models.

  43. Thank you

More Related