Computing on Jetstream: Streaming Analytics In the Wide-Area

Computing on Jetstream:Streaming Analytics In the Wide-Area Matvey Arye Joint work with: Ari Rabkin, Sid Sen, Mike Freedman and VivekPai

The Rise of Global Distributed Systems Image shows CDN

Traditional Analytics Centralized Database Image shows CDN

Bandwidth is Expensive Price Trends 2005-2008 [Above the Clouds, Armbrust et. al.]

Bandwidth Trends [TeleGeography's Global Bandwidth Research Service]

Bandwidth Trends 20% 20% [TeleGeography's Global Bandwidth Research Service]

Bandwidth Costs • Amazon EC2 bandwidth: $0.05 per GB • Wireless broadband: $2 per GB • Cell phone broadband (ATT/Verizon): $6 per GB • (Other providers are similar) • Satellite Bandwidth $200 - $460 per GB • May drop to ~$20

This Approach is Not Scalable Centralized Database Image shows CDN

The Coming Future: Dispersed Data Dispersed Databases Dispersed Databases Dispersed Databases Dispersed Databases

Wide-Area Computer Systems Military Global Network Drones UAVs Surveillance • Web Services • CDNs • Ad Services • IaaS • Social Media • Infrastructure • Energy Grid

Need Queries on a global view • CDNs: • Popularity of websites globally • Tracking security threats • Military • Threat “chatter” correlation • Big picture view of battlefield • Energy Grid • Wide-area view of energy production and expenditure

Some queries are easy Server Crashed Alert me when servers crash

Others are hard Requests Requests Requests Requests Requests Requests CDN Requests CDN Requests How popular are all of my domains? Urls?

Before JetStream Needed for backhaul 95% Level Bandwidth Analyst’s remorse: not enough data wasted bandwidth Buyers’s remorse: system overloador overprovisioning Time [two days]

What Happens During Overload? Available Needed for backhaul ? ? ? ? ? ? Bandwidth Latency Queue size grows without bound! Time Time [one day]

The JetStream Vision Needed for backhaul Available Used by JetStream Bandwidth JetStreamlets programs adapt to shortages and backfill later. Need new abstractions for programmers Time [two days]

System Architecture … Query graph Optimized query … … … JetStream API Planner Library Coordinator Daemon Control plane worker node Data plane compute resources (several sites) stream source

An Example Query Local Storage File Read Operator Parse Log File Query Every 10 s Site A Central Storage Site C Site B Local Storage File Read Operator Parse Log File Query Every 10 s

Adaptive Degradation Local Data Network Summarized or Approximated Data Dataflow Operators Dataflow Operators Feedback control • Feedback control to decide when to degrade • User-defined policies for how to degrade data

Monitoring Available BandWidth Data Data Data Time Marker Data • Sources insert time markers into the data stream every k seconds • Network monitor records time it took to process interval – t • => k/t estimates available capacity

Ways to Degrade Data • Can coarsen a dimension • Can drop low-rank values

An Interface for Degradation (I) Network Incoming data Sampled Data Coarsening Operator Sending 4x too much First attempt: policy specified by choosing an operator. Operators read the congestion sensor and respond.

Coarsening reduces data volumes

But not always

Depends on level of coarsening Data from CoralCDN logs

Getting The Most Data Quality For The Least BW Issue Some degradation techniques result in good quality but have unpredictable savings. Solution Use multiple techniques • Start off with technique that gives best quality • Supplement with other techniques when BW scarce => Keeps latency bounded; minimize analyst’s remorse

Allowing Composite Policies Network Incoming data Sampling Operator Coarsening Operator Sending 4x too much Chaos if two operators are simultaneously responding to the same sensor Operator placement constrained in ways that don’t match degradation policy.

Introducing a Controller Network Incoming data Sampling Operator Coarsening Operator Controller Drop 75% of data! Sending 4x too much Introduce a controller for each network connection that determines which degradations to apply Degradation policies for each controller Policy no longer constrained by operator topology

Degradation

Mergeability is Nontrivial 01 - 05 06 - 10 11 - 15 16 - 20 21 - 25 26 - 30 Every 5 ?????? 01 - 06 07 - 12 13 - 18 19 - 24 25 - 30 Every 6 01 - 30 Every 30?? 01 - 10 11 - 20 Every 10 21 - 30 Can’t cleanly unify data at arbitrary degradation Degradation operators need to have fixed levels

Interfacing with the Controller Network Incoming data Sampling Operator Coarsening Operator Controller Sending 4x too much Operator Controller Shrinking data by 50% Possible levels: [0%, 50%, 75%, 95%, …] Go to level 75%

A Planner for Policy Query planners: Query + Data Distribution => Execution Plan Why not do this for degradation policy? What is the Query? For us the policy affects the data ingestion => Effects all subsequent Queries Planning All Potential Queries + Data Distribution => Policy

Experimental Setup Princeton Policy: Drop data if insufficient BW 80 nodes on VICCI testbedin US and Germany

Without Adaptation Bandwidth Shaping

WITH Adaptation Bandwidth Shaping

Composite policies

Operating on Dispersed Data Dispersed Databases Dispersed Databases Dispersed Databases Dispersed Databases

Cube Dimensions 01:01:01 Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL

Cube Aggregates Count Requests Max Latency 01:01:01 bar.com/m

Cube Rollup bar.com/* foo.com/* Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL

Full Hierarchy (37,199) URL: * Time: 01:01:01 (8,90) (29,199) (5,90) (3,75) (8,199) (21,40) Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL

Rich Structure E D … (5,90) A 01:01:59 B (3,75) bar.com/m 01:01:58 (8,199) C bar.com/n (21,40) 01:01:01 foo.com/q 01:01:00 foo.com/r

Two kinds of aggregation • Rollups – Across Dimensions • Inserts – Across Sources The data cube model constrains the system to use the same aggregate function for both. Constraint: no queries on tuple arrival order Makes reasoning easier!

An Example Query Local Storage File Read Operator Parse Log File Query Every 10 s Site A Central Storage Site C Site B Local Storage File Read Operator Parse Log File Query Every 10 s

Subscribers • Extract data from cubes to send downstream • Control latency vs completeness traeoff Parse Log File Parse Log File Query Every 10 s File Read Operator Local Storage File Read Operator Site A

Subscriber API • Notified of every tuple inserted into cube • Can slice and rollup cubes Possible policies: • Wait for all upstream nodes to contribute • Wait for a timer to go off

Future Work • Reliability • Individual queries • Statistical methods • Multi-round protocols • Currently working on improving top-k • Fairness that gives best data quality Thanks for listening!

Computing on Jetstream: Streaming Analytics In the Wide-Area