160 likes | 307 Views
Compute Intensive Research on Cloud Computing Infrastructure. Systems Research Group Computer Science Department University of Illinois at Urbana-Champaign. Roy H. Campbell Reza Farivar, Abhishek Verma , Cristina Abad. rhc@illinois.edu.
E N D
Compute Intensive Research on Cloud Computing Infrastructure Systems Research Group Computer Science Department University of Illinois at Urbana-Champaign Roy H. Campbell Reza Farivar, AbhishekVerma, Cristina Abad rhc@illinois.edu {farivar2, verma7, cabad}@illinois.edu
Motivation and Goals • Research teams and practitioners are embracing cloud computing technologies for compute intensive tasks • E.g. Genetic Algorithms, Financial Algorithms, Bioinformatics, Astronomy, Machine Learning, Web Analytics, etc. • Many economic advantages • Not clear if such tasks perform optimally using MapReduce on COTS clusters (specially GPU clusters) • Research Goals: Investigate bottlenecks of COTS + MapReduce + Compute Intensive Tasks
Summary • Financial Computations • Genetic Algorithms for Optimization • Astronomy • Gene Alignment • Partitioned Iterative Algorithms: Best Effort • Clouds, Machine Learning and Reliability • Storage Workload Characterization • Workload Modeling
Financial Computations • Black Scholes future options pricing on a MapReduce cluster • Using MITHRA, our modified “MapReduce on GPU clusters” Middleware • MITHRA runs map() on GPUs as CUDA kernels • reduce() runs on the cluster CPUs • Better use of GPU hardware increased locality exploiting
Genetic Algorithms for Optimization • Initialize population with random individuals. • Evaluate fitness value of individuals. • Repeat steps 4-5 to 2 until some convergence criteria are met. • Select good solutions by using tournament selection without replacement. • Create new individuals by recombining the selected population using uniform crossover. Map Reduce
Astronomy Pre-processing / Metadata generation SExtractor MapReduce Job Post-Processing File Fetch MapReduce Job Merginguses X,Y as key SExtractor • Use Hadoop Streaming to Run multiple, parallel instances of an Astronomy source extraction program: Sextractor • Use MapReduce intermediate key grouping / sorting to help merge catalog records File 1 File 2 File 3 File 4 … SExtractor SExtractor UniqueIDs SExtractor HDFS Mergedcatalog HDFS Individual Catalogs Phase 1 Phase 2 Phase 3 Phase 4
Gene Alignment: Distributed Filtering TGCCTTCATTTCGTTATGTACCCAGTAGTCATAAAAGCACTAGCTTGCCAAGTT Sorted Masked Arrays 1 1 0 1 0 1 0 1 1 TGCCTT CCTT00 CC00CA 00CATT GCCTTC CTTC00 CT00AT 00CCTT CCTTCA GCCT00 GC00TC 00CTTC CTTCAT TGCC00 TG00TT 00TCAT TTCATT TTCA00 TT00TT 00TTCA Distributed pigeon hole filter
Masked Read Matching A Short Read Sorted Masked Arrays CCATCA 1 1 0 1 0 1 0 1 1 CC00CA CC00CA 1 1 0 1 0 1 0 1 1 CT00AT CCTT00 00CATT GC00TC CTTC00 00CCTT TG00TT CCAT00 CC00CA 00ATCA GCCT00 00CTTC TT00TT TGCC00 00TCAT TTCA00 00TTCA
Iterative Computations YoutubeVideo Suggestion BFS PageRank Clustering Pattern Recognition
Partitioned Iterative Convergence: Best Effort Model Update Current Model(s) Shared model management Convergence test Cluster node 1 Local iteration ? Cluster node 2 ? Global Model Merge Input Partitioner New Model Cluster node 3 ? Model effect applicator New sub-model ? Convergence Criteria
Clouds, Machine Learning and Reliability • Trend: Clouds will expand into diverse roles • Big Data Data mining and machine learning • Real time data Streaming clouds (e.g. Storm) • Economic pressure: Massive clouds adoption • Results fed into Cyber physical systems • Result: The reliability and security of (1) clouds and (2) ML algorithms on clouds will impact real-world phenomena • The current cloud solutions are orders of magnitude less dependable than minimum requirements for cyber physical systems
Cloud Storage Workload Characterization • Studied how MapReduce interacts with storage layer • Findings relevant to storage system design and tuning: • Workloads are dominated by high file churn • 80%−90% files accessed 1-10 times in 6 months • Small % of very popular files • Young files: • High % of accesses, • Small % of bytes stored • Requests are bursty • Files are very short-lived: • 90% deletions target files < 1.25 hours old 12
Big Data Storage Workloads: Modeling and Synthetic Generation • One potential storage bottleneck: • Metadata server: must handle large number of bursty requests • New schemes have been proposed but evaluation has been insufficient • No adequate traces or models • Mimesis: synthetic workload generator • Suitable for Big Data workloads • Reproduces desired statistical workload of original trace • Accurate: low RMSE (root mean squared error) when used in place of original traces • Used to evaluate a LRU metadata cache for HDFS
Performance Modeling of MapReduce Environments • Performance modeling techniques for MapReduce environments • Analytical models, Simulation, Experimental measurements • Service level objectives: • Automatic Resource Inference and Allocation of resources for MapReduce workloads • Optimization of makespans of set of jobs and DAGs • Comparison of hardware alternatives
Comparison of Hardware Alternatives • Designed a synthetic MapReduce application based on the CPU, memory, disk and network used • Goal: Find a minimum set (basis) of these synthetic applications onto which any MapReduce workload can be projected on to • Using performance of the basis on old and new hardware, estimated performance of any workload on new hardware within 10% error.