Elastic and Efficient Execution of Data-Intensive Applications on Hybrid Cloud

Elastic and Efficient Execution of Data-Intensive Applications on Hybrid Cloud Tekin Bicer Computer Science and Engineering Ohio State University

Introduction • Scientific simulations and instruments • X-ray Photon Correlation Spectroscopy • CCD Detector: 120MB/s now; 44GB/s by 2015 • Global Cloud Resolving Model • 1PB for 4km grid-cell • Performed on local clusters • Not sufficient • Problems • Data Analysis, Storage, I/O performance • Cloud Technologies • Elasticity • Pay-as-you-go Model

Hybrid Cloud Motivation • Cloud technologies • Typically associated with computational resources • Massive data generation • Exhaust local storage • Hybrid Cloud • Local Resources: Base • Cloud Resources: Additional • Cloud • Compute and storage resources

Usage of Hybrid Cloud Local Nodes Local Storage Data Source Cloud Storage Cloud Compute Nodes

Challenges • Data-Intensive Processing • Transparent Data Access and Analysis • Programmability of Large-Scale Applications • Meeting User Constraints • Enabling Cloud Bursting • Minimizing Storage and I/O Cost • Domain Specific Compression • In-Situ and In-Transit Data Analysis MATE-HC: Map-reduce with AlternaTE APIover Hybrid Cloud Dynamic Resource Allocation Framework for Hybrid Cloud Compression Methodology and System for Large-Scale App.

Programmability of Large-Scale Applications on Hybrid Cloud • Geographically distributed resources • Ease of programmability • Reduction-based programming structures • MATE-HC • A middleware for transparent data access and processing • Selective job assignment • Multi-threaded data retrieval

Middleware for Hybrid Cloud GlobalReduction GlobalReduction Job Assignment Job Assignment Remote DataAnalysis 7

MATE vs. Map-Reduce Processing Structure • Reduction Objectrepresents the intermediate state of the execution • Reduce func. is commutative and associative • Sorting, grouping.. overheads are eliminated with red. func/obj.

Simple Example 3 5 8 4 1 3 5 2 6 7 9 4 2 4 8 Our large Dataset  Local Reduction(+) Local Reduction (+) Local Reduction(+) Robj[1]= 21 8 Robj[1]= 15 23 Robj[1]= 14 27 Our Compute Nodes Result= 71 Global Reduction(+)

Experiments • 2 geographically distributed clusters • Cloud: EC2 instances running on Virginia • 22 nodes x 8 cores • Local: Campus cluster (Columbus, OH) • 150 nodes x 8 cores • 3 applications with 120GB of data • KMeans: k=1000; KNN: k=1000; • PageRank: 50x10 links w/ 9.2x10 edges • Goals: • Evaluating the system overhead with different job distributions • Evaluating the scalability of the system 10

System Overhead: K-Means 11

Scalability: K-Means 12

Summary • MATE-HC is a data-intensive middleware developed for Hybrid Cloud • Our results show that • Low inter-cluster comm. overhead • Job distribution is important • Overall slowdown is modest • Proposed system is scalable 13

Outline • Data-Intensive Processing • Programmability of Large-Scale Applications • Transparent Data Access and Analysis • Meeting User Constraints • Enabling Cloud Bursting • Minimizing Storage and I/O Cost • Domain Specific Compression • In-Situ and In-Transit Data Analysis MATE-HC: Map-reduce with AlternaTE APIover Hybrid Cloud Dynamic Resource Allocation Framework for Cloud Bursting Compression Methodology and System for Large-Scale App.

Dynamic Resource Allocation for Cloud Bursting • Performance of cloud resources and workload vary • Problems: • Extended execution times • Unable to meet user constraints • Cloud resources can dynamically scale • Cloud Bursting • In-house resources: Base workload • Cloud resources: Adopt performance requirements • Dynamic Resource Allocation Framework • A model for capturing “Time” and “Cost” constraints with cloud bursting

System Components • Local cluster and Cloud • MATE-HC processing structure • Pull-based job distribution • Head Node • Coarse grained job assignment • Consideration of locality • Master node • Fine grained job assignment • Job Stealing • Remote data processing 16

Resource Allocation Framework Estimate required time for local cluster processing Estimate required time for cloud cluster processing All variables can be profiled during execution, except estimated # stolen jobs Estimate remaining # jobs after local jobs are consumed Ratio of local computational throughput in system 17

Execution of Resource Allocation Framework • Head Node • Evaluates profiled info. • Estimates # cloud inst. • Before each job assign. • Informs Master nodes • Master Node • Each cluster has one • Collects profile info. • During job req. time • (De)allocates resources • Slave Nodes • Request and consume jobs 18

Experimental Setup • Two Applications • KMeans (520GB): Local=104GB; Cloud=416GB • PageRank (520GB): Local=104GB; Cloud=416GB • Local cluster: Max. 16 nodes x 8 cores = 128 cores • Cloud resources: Max. 16 nodes x 8 cores = 128 cores • Evaluation of model • Local nodes are dropped during execution • Observed how system is adopted 19

KMeans – Time Constraint • System is not able to meet the time constraint because max. # of cloud instances is reached • # Local Inst.: 16 (fixed) • # Cloud Inst.: Max 16 (Varies) • Local: 104GB, Cloud:416GB • All other configurations meet the time constraint with <1.5% error rate 20

KMeans – Cloud Bursting • # Local Inst.: 16 (fixed) • # Cloud Inst.: Max 16 (Varies) • Local: 104GB, Cloud:416GB • 4 local nodes are dropped … • After 25% and 50% of time constraints are elapsed, error rate <1.9% • After 75% of time constraint is elapsed, error rate <3.6% • Reason of higher error rate: Shorter time to profile new environment 21

Summary • MATE-HC: MapReduce type of processing • Federated resources • Developed a resource allocation model • Based on feedback mechanism • Time and cost constraints • Two data-intensive applications (KMeans, PR) • Error rate for time < 3.6% • Error rate for cost < 1.2% 22

Outline • Data-Intensive Processing • Programmability of Large-Scale Applications • Transparent Data Access and Analysis • Meeting User Constraints • Enabling Cloud Bursting • Minimizing Storage and I/O Cost • Domain Specific Compression • In-Situ and In-Transit Data Analysis MATE-HC: Map-reduce with AlternaTE APIover HC Dynamic Resource Allocation Framework for Cloud Bursting Compression Methodology and System for Large-Scale App.

Data Management using Compression • Generic compression algorithms • Good for low entropy sequence of bytes • Scientific dataset are hard to compress • Floating point numbers: Exponent and mantissa • Mantissa can be highly entropic • Using compression is challenging • Suitable compression algorithms • Utilization of available resources • Integration of compression algorithms

Compression Methodology • Common properties of scientific datasets • Multidimensional arrays • Consist of floating point numbers • Relationship between neighboring values • Domain specific solutions can help • Approach: • Prediction-based differential compression • Predict the values of neighboring cells • Store the difference

Example: GCRM Temperature Variable Compression • E.g.: Temperature record • The values of neighboring cells are highly related • X’ table (after prediction): • X’’ compressed values • 5bits for prediction + difference • Lossless and lossy comp. • Fast and good compression ratios

Compression Framework • Improve end-to-end application performance • Minimize the application I/O time • Pipelining I/O and (de)compression operations • Hide computational overhead • Overlapping application computation with compression framework • Easy implementation of compression algorithms • Easy integration with applications • Similar API to POSIX I/O

A Compression Framework for Data Intensive Applications Chunk Resource Allocation (CRA) Layer • Initialization of the system • Generate chunk requests, enqueue processing • Converting original offset and data size requests to compressed Parallel I/O Layer (PIOL) • Creates parallel chunk requests to storage medium • Each chunk request is handled by a group of threads • Provides abstraction for different data transfer protocols Parallel Compression Engine (PCE) • Applies encode(), decode() functions to chunks • Manages in-memory cache with informed prefetching • Creates I/O requests 28

Integration with a Data-Intensive Computing System • Remote data processing • Sensitive to I/O bandwidth • Processes data in… • local cluster • cloud • or both (Hybrid Cloud)

Experimental Setup • Two datasets: • GCRM: 375GB (L:270 + R:105) • NPB: 237GB (L:166 + R:71) • 16x8 cores (Intel Xeon 2.53GHz) • Storage of datasets • Lustre FS (14 storage nodes) • Amazon S3 (Northern Virginia) • Compression algorithms • CC, FPC, LZO, bzip, gzip, lzma • Applications: AT, MMAT, KMeans

Performance of MMAT Breakdown of Performance • Overhead (Local): 15.41% • Read Speedup: 1.96

Lossy Compression (MMAT) Lossy • #e: # dropped bits • Error bound: 5x(1/10^5)

Summary • Management and analysis of scientific datasets are challenging • Generic compression algorithms are inefficient for scientific datasets • Proposed a compression framework and methodology • Domain specific compression algorithms are fast and space efficient • 51.68% compression ratio • 53.27% improvement in exec. time • Easy plug-and-play of compression • Integration of the proposed framework and methodology with a data analysis middleware

Outline • Data-Intensive Processing • Programmability of Large-Scale Applications • Transparent Data Access and Analysis • Meeting User Constraints • Enabling Cloud Bursting • Minimizing Storage and I/O Cost • Domain Specific Compression • In-Situ and In-Transit Data Analysis MATE-HC: Map-reduce with AlternaTE APIover Hybrid Cloud Dynamic Resource Allocation Framework for Cloud Bursting Compression Methodology and System for Large-Scale App.

In-Situ and In-Transit Analysis • Compression can ease data management • But may not always be sufficient • In-situ data analysis • Co-locate data source and analysis code • Data analysis during data generation • In-transit data analysis • Remote resources are utilized • Forward generated data to “staging nodes”

In-Situ and In-Transit Data Analysis • Significant reduction in generated dataset size • Noise elimination, data filtering, stream mining… • Timely insights • Parallel data analysis • MATE-Stream • Dynamic resource allocation and load balancing • Hybrid data analysis • Both in-situ and in-transit

Parallel In-Situ Data Analysis LR Robj[...] DataSource LR Disp Robj[...] LRobj[...] LR Robj[...] LR Robj[...] • Data Generation • Scientific instruments, simulations, etc. • (Un)bounded data • Local Reduction • Filtering, stream mining • Data reduction • Continuous local red. • Local Combination • Intermediate results • Timely insights • Continuous global red.

Elastic In-Situ Data Analysis LR Robj[...] DataSource LR Disp Robj[...] LR Robj[...] LR Robj[...] LRobj[...] LR Robj[...] LR Robj[...] LR Robj[...] • Insufficient resource utilization • Dynamically extend resources • New local reduction proc. LR Robj[...]

Elastic In-Situ and In-Transit Data Analysis N0 LR Robj[...] DataSource LR Disp Robj[...] LRobj[...] LR Robj[...] LR Robj[...] GRobj[...] LR Robj[...] LR Disp Robj[...] LRobj[...] LR Robj[...] LR • Staging node is set • Forward data Reduction process: Local comb. Global comb. Robj[...] N1

Future Directions • Scientific applications are difficult to modify • Integration with existing data sources • GridFTP, (P)NetCDF and HDF5 etc. • Data transfer is expensive (especially for in-transit) • Utilization of advanced network technologies • Software-Defined Networking (SDN) • Long running nature of large-scale app. • Failures are inevitable • Exploit features of processing structure

Conclusions • Data-intensive applications and instruments can easily exhaust local resources • Hybrid cloud can provide additional resources • Challenges: Transparent data access and processing; meeting user constraints; minimizing I/O and storage cost • MATE-HC: Transparent and efficient data processing on Hybrid Cloud • Developed a “dynamic resource allocation framework” and integrated with MATE-HC • Time and cost sensitive data processing • Proposed a “compression methodology and a system” to minimize storage cost and I/O bottleneck • Design of “in-situ and in-transit data analysis” (on going work)

Thanks for your attention!

MATE-EC2 Design • Data organization • Three levels: Buckets, Chunks and Units • Metadata information • Chunk Retrieval • Threaded Data Retrieval • Selective Job Assignment • Load Balancing and handling heterogeneity • Pooling mechanism

MATE-EC2 vs. EMR • PageRank • Speedups • vs. combine4.08 – 7.54 • KMeans • Speedups • vs. combine3.54 – 4.58

Different Chunk Sizes • KMeans • 1 retrieval threads • Performance increase • 128KB vs. >8M • 2.07 to 2.49

K-Means (Data Retrieval) Fig. 1 Fig. 2 • Fig 1: 16 Retrieval Threads • 8M vs. others speedup: 1.13-1.30 • Fig. 2: 128M Chunk Size • 1 Thread vs. others speedup: 1.37-1.90 • Dataset: 8.2GB

Job Assignment • KMeans: • 1.01 (8M) and 1.10-1.14 (for others) • PCA (2 iterations): • Speedups : 1.19-1.68

Heterogeneous Conf. Overheads • KMeans: 1% • PCA: 1.1%, 7.4%, 11.7%

Kmeans – Cost Constraint • System meets the cost constraints with <1.1% error rate • System tries to minimize the execution time with provided cost constraint • Maximum # cloud instances is allocated error rate is again <1.1% 49

Prefetching and In-Memory Cache • Overlapping application layer computation with I/O • Reusability of already accessed data is small • Prefetching and caching the prospective chunks • Default is LRU • User can analyze history and provide prospective chunk list • Cache uses row-based locking scheme for efficient consecutive chunk requests Informed Prefetching prefetch(…)

Elastic and Efficient Execution of Data-Intensive Applications on Hybrid Cloud

Elastic and Efficient Execution of Data-Intensive Applications on Hybrid Cloud

Presentation Transcript

Multicore and Cloud Technologies for Data Intensive Applications

Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids

Cloud, Data Center Applications

Cloud Technologies for Data Intensive Computing

Data-intensive Computing on the Cloud: Concepts, Technologies and Applications

Cloud Technologies and Data Intensive Applications

Hybrid Cloud and Cluster Computing Paradigms for Scalable Data Intensive Applications

Silverline : Toward Data Confidentiality in Storage-Intensive Cloud Applications

Data Intensive Applications on Clouds

Securing Elastic Applications on Mobile Devices for Cloud Computing

Models and Frameworks for Data Intensive Cloud Computing

Impact of High Performance Sockets on Data Intensive Applications

Programming Model Support for Dependable, Elastic Cloud Applications

Clouds for Sensors and Data Intensive Applications

Elastic Applications in the Cloud

Octopus: Efficient Data Intensive Computing on Virtualized E nvironments

Data Intensive Applications BOF

Optimizing Execution of Data Intensive Jobs in Computational and Data Grids

Flex, Java and Data Intensive Applications

Data-Intensive Cloud Control for GENI

Geoinformatics and Data Intensive Applications on Clouds

System Support for Data-Intensive Applications