470 likes | 592 Views
Computing on Jetstream: Streaming Analytics In the Wide-Area. Matvey Arye Joint work with: Ari Rabkin , Sid Sen , Mike Freedman and Vivek Pai. The Rise of Global Distributed Systems. Image shows. CDN. Traditional Analytics. Centralized Database. Image shows. CDN.
E N D
Computing on Jetstream:Streaming Analytics In the Wide-Area Matvey Arye Joint work with: Ari Rabkin, Sid Sen, Mike Freedman and VivekPai
The Rise of Global Distributed Systems Image shows CDN
Traditional Analytics Centralized Database Image shows CDN
Bandwidth is Expensive Price Trends 2005-2008 [Above the Clouds, Armbrust et. al.]
Bandwidth Trends 20% 20% [TeleGeography's Global Bandwidth Research Service]
Bandwidth Costs • Amazon EC2 bandwidth: $0.05 per GB • Wireless broadband: $2 per GB • Cell phone broadband (ATT/Verizon): $6 per GB • (Other providers are similar) • Satellite Bandwidth $200 - $460 per GB • May drop to ~$20
This Approach is Not Scalable Centralized Database Image shows CDN
The Coming Future: Dispersed Data Dispersed Databases Dispersed Databases Dispersed Databases Dispersed Databases
Wide-Area Computer Systems Military Global Network Drones UAVs Surveillance • Web Services • CDNs • Ad Services • IaaS • Social Media • Infrastructure • Energy Grid
Need Queries on a global view • CDNs: • Popularity of websites globally • Tracking security threats • Military • Threat “chatter” correlation • Big picture view of battlefield • Energy Grid • Wide-area view of energy production and expenditure
Standing Computation Processing To User Processing Source Cube Union Cube Source Cube Processed Data Processing Processed Data Processed Data Network bottleneck
Some queries are easy Server Crashed Alert me when servers crash
Others are hard Requests Requests Requests Requests Requests Requests CDN Requests CDN Requests How popular are all of my domains? Urls?
Before JetStream Needed for backhaul 95% Level Bandwidth Analyst’s remorse: not enough data wasted bandwidth Buyers’s remorse: system overloador overprovisioning Time [two days]
What Happens During Overload? Available Needed for backhaul ? ? ? ? ? ? Bandwidth Latency Queue size grows without bound! Time Time [one day]
The JetStream Vision Needed for backhaul Available Used by JetStream Bandwidth JetStreamlets programs adapt to shortages and backfill later. Need new abstractions for programmers Time [two days]
System Architecture … Query graph Optimized query … … … JetStream API Planner Library Coordinator Daemon Control plane worker node Data plane compute resources (several sites) stream source
An Example Query Local Storage File Read Operator Parse Log File Query Every 10 s Site A Central Storage Site C Site B Local Storage File Read Operator Parse Log File Query Every 10 s
Adaptive Degradation Local Data Network Summarized or Approximated Data Dataflow Operators Dataflow Operators Feedback control • Feedback control to decide when to degrade • User-defined policies for how to degrade data
Monitoring Available BandWidth Data Data Data Time Marker Data • Sources insert time markers into the data stream every k seconds • Network monitor records time it took to process interval – t • => k/t estimates available capacity
Ways to Degrade Data • Can coarsen a dimension • Can drop low-rank values
An Interface for Degradation (I) Network Incoming data Sampled Data Coarsening Operator Sending 4x too much First attempt: policy specified by choosing an operator. Operators read the congestion sensor and respond.
Depends on level of coarsening Data from CoralCDN logs
Getting The Most Data Quality For The Least BW Issue Some degradation techniques result in good quality but have unpredictable savings. Solution Use multiple techniques • Start off with technique that gives best quality • Supplement with other techniques when BW scarce => Keeps latency bounded; minimize analyst’s remorse
Allowing Composite Policies Network Incoming data Sampling Operator Coarsening Operator Sending 4x too much Chaos if two operators are simultaneously responding to the same sensor Operator placement constrained in ways that don’t match degradation policy.
Introducing a Controller Network Incoming data Sampling Operator Coarsening Operator Controller Drop 75% of data! Sending 4x too much Introduce a controller for each network connection that determines which degradations to apply Degradation policies for each controller Policy no longer constrained by operator topology
Mergeability is Nontrivial 01 - 05 06 - 10 11 - 15 16 - 20 21 - 25 26 - 30 Every 5 ?????? 01 - 06 07 - 12 13 - 18 19 - 24 25 - 30 Every 6 01 - 30 Every 30?? 01 - 10 11 - 20 Every 10 21 - 30 Can’t cleanly unify data at arbitrary degradation Degradation operators need to have fixed levels
Interfacing with the Controller Network Incoming data Sampling Operator Coarsening Operator Controller Sending 4x too much Operator Controller Shrinking data by 50% Possible levels: [0%, 50%, 75%, 95%, …] Go to level 75%
A Planner for Policy Query planners: Query + Data Distribution => Execution Plan Why not do this for degradation policy? What is the Query? For us the policy affects the data ingestion => Effects all subsequent Queries Planning All Potential Queries + Data Distribution => Policy
Experimental Setup Princeton Policy: Drop data if insufficient BW 80 nodes on VICCI testbedin US and Germany
Without Adaptation Bandwidth Shaping
WITH Adaptation Bandwidth Shaping
Operating on Dispersed Data Dispersed Databases Dispersed Databases Dispersed Databases Dispersed Databases
Cube Dimensions 01:01:01 Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL
Cube Aggregates Count Requests Max Latency 01:01:01 bar.com/m
Cube Rollup bar.com/* foo.com/* Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL
Full Hierarchy (37,199) URL: * Time: 01:01:01 (8,90) (29,199) (5,90) (3,75) (8,199) (21,40) Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL
Rich Structure E D … (5,90) A 01:01:59 B (3,75) bar.com/m 01:01:58 (8,199) C bar.com/n (21,40) 01:01:01 foo.com/q 01:01:00 foo.com/r
Two kinds of aggregation • Rollups – Across Dimensions • Inserts – Across Sources The data cube model constrains the system to use the same aggregate function for both. Constraint: no queries on tuple arrival order Makes reasoning easier!
An Example Query Local Storage File Read Operator Parse Log File Query Every 10 s Site A Central Storage Site C Site B Local Storage File Read Operator Parse Log File Query Every 10 s
Subscribers • Extract data from cubes to send downstream • Control latency vs. completeness trade-off Parse Log File Parse Log File Query Every 10 s File Read Operator Local Storage File Read Operator Site A
Subscriber API Subscriber is an operator++: • Notified of every tuple inserted into cube • Can slice and rollup cube Possible policies: • Wait for all upstream nodes to contribute • Wait for a timer to go off
Future Work • Reliability • Individual queries • Statistical methods • Multi-round protocols • Currently working on improving top-k • Fairness that gives best data quality Thanks for listening!