280 likes | 397 Views
Making Every Bit Count in Wide Area Analytics. Ariel Rabkin Joint work with: Matvey Arye , Siddhartha Sen , Michael J. Freedman, and Vivek Pai. Global Systems Have Global Data. The Rise of Big Distributed Data. CDNs: Akamai has ~20 m illion requests per second
E N D
Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: MatveyArye, Siddhartha Sen, Michael J. Freedman, and VivekPai
The Rise of Big Distributed Data • CDNs: • Akamai has ~20 million requests per second • CloudFlare has about 300 MB/s of logs, volume doubles every 4 months • Sensor data (e.g., power grid, highways) • Smart camera networks
Trends Data Volumes Wide-area Bandwidth Amount per dollar Time
Analyzing Low-rate Events is Easy Server Crashed! Alert me when server crashes!
High-rate Events can be Costly Requests Requests Requests Requests Requests Requests Requests Requests Every minute, computerequest counts by URL
Backhaul has Bad Dynamics Example: backhaul count of events every 5 minutes Choice of summaries is made upfront statically • Buyer’s remorse: Chose to collect unnecessary and expensive data • Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data
Local Storage! Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests Every minute, computerequest counts by URL Local Aggregation and Storage
Challenge: Bandwidth Scarcity I want the request count for every URL every second I can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value cutoff, or change frequency? I can do that for 900 KB/sec. Can I get the top 1000 URLs every second? Great, do it!
Challenge: Varying Scarcity Available ? ? ? ? ? ? ? Needed Firstaggregate over longer time periods, up to 30 seconds. Then only keep the top URLs. Bandwidth Can do Time
Data Processing Requirements • Aggregatable StoredData += Update • Merge-able Data Data Merged Representation + = • Reducible Data Data
Raw byte strings e.g. MapReduce Database tables
The Data Cube Model Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions • Aggregation function used for: • Updates • Roll-ups • Merging cubes • Degrading cubes Roll-up of mysite.com by time from 12:00 to 12:01: 8 Roll-up of sites at time 12:00: 16
Raw byte strings e.g. MapReduce Database tables Data Cube
A Vision for Wide-Area Analytics Merged Cube Dataflow Operators Dataflow Operators Dataflow Operators Local Cube Local Cube Dataflow Operators Dataflow Operators Dataflow Operators Network bottleneck Dataflow adapted to bandwidth
Adaptivity Local Cube Network bottleneck Dataflow Operators Dataflow Operators
Adaptivity Local Cube Network bottleneck Summarized Cube Dataflow Operators Dataflow Operators Feedback control • Key ingredients: • Cube summarization as mechanism • User-defined policies • Feedback control
Conclusions • The hard problems in wide-area analysis: • Reasoning about bandwidth/data quality tradeoffs • Optimizing data quality under changing conditions. • Jointly optimizing bandwidth and other resources • We are building a system. • We call it JetStream. Stay tuned….
Bandwidth Costs do not Decline Smoothly [TeleGeography's Global Bandwidth Research Service]
2012 Bandwidth Price Shifts Frankfurt- London 20% 20% [TeleGeography's Global Bandwidth Research Service]
Diurnal Load Makes Overprovisioning Expensive • Leased lines waste capacity during off-peak • Public internet gets congested during peak
Benefit: Iteration Can iteratively pose different queries Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests A revised query Local Aggregation and Storage
Benefit: adaptation Can adapt data volume collected to available bw Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests Limited Bandwidth Local Aggregation and Storage
Benefit: adaptation Can adapt data volume collected to available bw Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests Ample Bandwidth Local Aggregation and Storage
A dataflow model for wide-area analytics Defines data transformation on tuples. Can do input or output. Operator Cube Structured storage of data
Generated data Ingested Into Local cubes Source Cube Source Cube Processing Processing Processed Data Network bottleneck
Processing Processed Data