490 likes | 504 Views
Scheduling and Guided Search for Cloud Applications. Cristiana Amza. Big Data is Here. Data growth (by 2015) = 100x in ten years [IDC 2012] Population growth = 10% in ten years Monetizing data for commerce, health, science, services, …. [source: Economist]. courtesy of Babak Falsafi.
E N D
Scheduling and Guided Search for Cloud Applications Cristiana Amza
Big Data is Here • Data growth (by 2015) = 100x in ten years [IDC 2012] • Population growth = 10% in ten years • Monetizing data for commerce, health, science, services, …. [source: Economist] courtesy of BabakFalsafi
Data Growing Faster than Technology Growing technology gap WinterCorp Survey, www.wintercorp.com courtesy of BabakFalsafi
Challenge 1: Costs of a datacenter 3yr server and 10yr infrastructure amortization Estimated costs of datacenter: 46,000 servers $3,500,000 per month to run Server & Power are 88% of total cost Data courtesy of James Hamilton [SIGMOD’11 Keynote]
Datacenter Energy Not Sustainable A Modern Datacenter 17x football stadium, $3 billion Billion Kilowatt hour/year • Modern datacenters 20 MW! • In modern world, 6% of all electricity, growing at >20%! 50 million homes 2001 2005 2009 2013 2017 courtesy of BabakFalsafi
Challenge 2: Data Management (Anomalies) Cloudy Whoops – Facebook loses 1 billion photos Chris Keall, 10 March 2009, The National Business Review When the Cloud Fails: T-Mobile, Microsoft LoseSidekick Customer Data Om Malik, 10 October 2009, gigaom.com with a chance of failure courtesy of HaryadiS. Gunawi Amazon Can’t Recover All Its Cloud Data From Outage Max Eddy, 27 April 2011, www.geekosystem.com Cloud Storage Often Results in Data Loss Chad Brooks, 10 October 2011, www.businessnewsdaily.com
Problems are entrenched • I have been working in this area since 2001 • Problems have only grown more complex/intractable • Same old Distributed Systems problems • New: Levels of indirection (remote processing, deep software stacks, VMs, etc) • Eg: Cloud monitoring and logging data (terrabytes per day) • But, no notable success stories with analyzing such data
Reduce Reduce Reduce Challenge 3: Paradigm Limitations • MapReduceparallelism: • Embarrassing/simplistic • Works for aggregate op • Simple scheduling Map Map Map
Hadoop Hadoop/Enterprise: Separate Storage Silos Hardware $$$ Cross-silo data management $$$ Periodic data ingest
What can we do ?Find Meaningful Apps • We can produce/find tons of data • Need to analyze something of vital importance to justify draining vital resources • Otherwise the simplest solution is to stop creating the problem(s)
What can we do ?Consolidate Research Agendas • Find overarching, mission critical paradigms • State of the art: MapReduce too simplistic • Develop standards, common tools and benchmarks • Integrate solutions, think holistically • Enforce accountability for data center/Cloud provider
Opportunity 1: The Brain Challenge • Started to explore Neuroscience workloads in 2010 • A Brain Summit/Workshop held at IBM TJ Watson • Started a collaboration with Stephen Strother at Baycrest a year later • An application that is both data and compute intensive • Boils down to an optimization problem in a highly parametrized search space
Opportunity 2: Guided Modeling • Performance modeling, energy modeling, anomaly modeling, biophysical modeling • All tend to be interpolations/searches/optimizations in highly parametrized spaces • Key idea: Develop a common framework that works for all • Extend the way MapReduce standardized aggregation ops • Guidance: Operator “Reduction”, “Linear Interpolation”, etc
Building models takes time Exhaustive sampling takes 11 days! High Latency 32x32=1024 sampling points Low Latency Avg. Latency Storage Memory 32 data points DB Memory 16GB Actuate a live system and take experimental samples. Sample in 512MB chunks; 15 minutes for each point
Goal: Reduce Time by Model Reuse Provide resource-to-performance mapping High Latency Low Latency Avg. Latency Storage Resources Less DB Resources More • Dynamic Resource Allocation [FAST’09] • Capacity Planning [SIGMOD’13] • What-if Queries [SIGMETRICS’10] • Anomaly Detection[SIGMETRICS’10] • Towards A Virtual Brain Model[HPCS’14]
Management Interactions Service Provider • Use less resources: Customer wants 1000 TPS. What is the most efficient (e.g., CPU/Memory) to deliver it? • Share resources: Can I place customer A’s DB along side customer B’s DB? Will their service-levels be met? Customer DBA • Use the right amount of resources: What will be the performance (e.g., query latency) if I use 8GB of RAM instead of 16GB? • Solve performance problems: I’m only getting 500 TPS. What’s wrong? Is the cloud to blame? Need to build performance models to understand
Libraries/archive of models Analytical Models No samples required Difficult to derive Fragileto maintain Gray-box Models Few samples needed Can be adapted Still need to derive Use an Ensemble of models Black-box Models Minimal assumptions Needs lots of samples Couldover-fit Knowledge driven Data driven
Model Ensemble approach 1.Guidance as trends and patterns 2.Automatically tunethe models 3.Rank the models use a blend y y y y y y x x x x 1 x x Test & Rank y Use data 2 x Repeat (if needed)
How to specify guidance Specifies model inputsand parameters y x SelfTalk Language to describe relationships Provide a catalogof common functions Curve-fitting and validation algorithms Details in SIGMETRICS’10 paper
Refine models using data HINTmyHint RELATION LINEAR(x,y) METRIC (x,y) { x.name=MySQL.CPU y.name=MySQL.QPS } CONTEXT (a) { a.name=MySQL.BufPoolAlloc a.value >= 512MB } Hintsthat CPU linearly correlatedto QPS y x Use hintsto linkrelations to metrics But working-set should be in RAM Learns parameters using data (or requests more data)
Rank models and blend 1.Dividesearch space into regions y y y High Latency 2.n-fold cross-validation to rank 3.Associate best-model to region Avg. Latency Storage Memory DB Memory 16GB
Prototype “How should I partition resources between two applications A and B?” SelfTalk, Catalog of models, data MySQL & Storage Server
model new workload Runtime Engine reuse similar data/models Model Matching Expand Samples refine if necessary selective, iterative and ensemble learning process Model Validation and Refinement Model Repository 23
Ex 1: Predicting buffer pool latencies Analytical model replaced with data-driven models
Ex 3: Modeling and Job Scheduling for Brain • Data centers usually have heterogeneous structure – variety of multicores, GPUs, etc. • Different stages of application have different resource demands (CPU versus data intensive) • Job scheduling to available resources becomes non-trivial • Guided modeling helps
Functional MRI • Goal • Studying brain functionality • Procedure • Asking patients (subjects) to do a task and capturing brain slices measuring blood oxygen level. • Correlating images to identify brain activity
Functional MRI • Overall Pipeline Subject Selection & Experimental Design Data Acquisition Data Preprocessing Analysis Model Results
NPAIRS As Our Application • NPAIRS Goal: Processing images to find images correlations • Feature Extraction: A common technique in image processing applications (e.g. Face Recognition) • Using Principal Component Analysis to extract Eigen Vectors • Finding a set of Eigen Vectors which is a good representative for the whole set of subjects Machine Learning Methods, Heuristic Search, etc.
NPAIRS Statistical Parametric MapSJ2 REPRODUCIBILITY ESTIMATE (r) SPLIT-HALF 2 “Design” Data Scans FULL DATA vSJ2 Split J SPLIT-HALF 1 “Design” Data SPMSJ1 Scans vSJ1
Job Modeling: Exhaustive Sampling Sample Set: 1 to 99 Fitness Score R2: 0.995 Total Run Time: 64933
Uniform Sampling Sample Set: 2 12 22 32 42 52 62 72 82 92 Using 5-fold cross validation Fitness Score R2: 0.990 Total Run Time: 6368
Guidance: Step function + Fast Sampling Sample Set: 2 4 8 12 16 20 24 32 48 96 Fitness Score R2: 0.993 Total Run Time: 5313 16.6% Time Saving!
Conclusions • Big Data processing is driving a quantum leap in IT • Hampered by slow progress in data center management • We propose to investigate guided modeling • Promising preliminary results with Neuroscience workloads • 7x speedup of NPAIRS on small CPU+GPU cluster
Modeling Procedure • Get sample set and split it using 5-fold cross validation • Fit the model using 4 folds sample data • Test the model using 1 fold sample data • Try 5 type of splits, and sum the model error. • If the error is less than the threshold, stop. We found the model.
Notes • Total run time is the summary of total sampling time. Modeling time is negligible. • Use the exhaustive data set as the true value, and the fitted model to predict the value. Then compute the coefficient of determination R2.