260 likes | 275 Views
Explore high-level abstractions for data and computation needs in distributed computing challenges. Learn about the All-Pairs problem, mistakes in computing, and solutions like Chirp_array and DataLab. Discover how to handle large data sets efficiently using ensemble techniques.
E N D
High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain University of Notre Dame
Computing’s central challenge, “How not to make a mess of it,” has not yet been met. -Edsger Dijkstra
Overview • Many systems today give end users access to hundreds or thousands of CPUs. • But, it is far too easy for the naive user to create a big mess in the process. • Our Solution: • Deploy high-level abstractions that describe both data and computation needs. • Some examples of current work: • All-Pairs: An abstraction for biometric workloads. • Distributed Ensemble Classification • DataLab: A system and language for data-parallel computation.
Distributed Computing is Hard! What is Condor? Which resources? How Many? What happens when things fail? How do I fit my workload into jobs? How long will it take? What about job input data? How can I measure job stats? What do I do with the results?
Distributed Computing is Hard! What is Condor? Which resources? How Many? ARGH! What happens when things fail? How do I fit my workload into jobs? How long will it take? What about job input data? How can I measure job stats? What do I do with the results?
The All-Pairs Problem All-Pairs( Set S1, Set S2, Function F ) yields a matrix M: Mij = F(S1i,S2j) 60K 20KB images >1GB 3.6B comparisons @ 50/s = 2.3 CPUYrs x 8B output = 29GB
F Biometric All-Pairs Comparison
For all $X : For all $Y : cmp $X to $Y CPU CPU CPU CPU Batch System Each CPU reads 10TB! file server Naïve Mistakes Computing Problem: Even expert users don’t know how to tune jobs optimally, and can make 100 CPUs even slower than one by overloading the file server, network, or resource manager.
All Pairs Abstraction binary function F set S of files F invocation M = AllPairs(F,S)
All-Pairs Production System at Notre Dame 300 active storage units 500 CPUs, 40TB disk Web Portal F G H 4 – Choose optimal partitioning and submit batch jobs. S T F F F 1 - Upload F and S into web portal. 2 - AllPairs(F,S) F F F All-Pairs Engine 5 - Collect and assemble results. 3 - O(log n) distribution by spanning tree. 6 - Return result matrix to user.
Returning the Result Matrix 4.37 Too many files. Hard to do prefetching. Too large files. Must scan entire file. Row/Column ordered. How can we build it? 6.01 2.22 4.37 7.13 8.94 6.72 1.34 … … … 0.98
Chirp_array allows users to create, manage, modify large arrays without having to realize underlying form. Operations on chirp_array: create a chirp_array open a chirp_array set value A[i,j] get value A[i,j] get row A[i] get column A[j] set row A[i] set column A[j] Result Storage by Abstraction X X CPU CPU CPU Disk Disk Disk
Result Storage with chirp_array • chirp_array_get(i,j) Cache CPU CPU CPU Disk Disk Disk
Result Storage with chirp_array • chirp_array_get(i,j) Cache CPU CPU CPU Disk Disk Disk
Result Storage with chirp_array • chirp_array_get(i,j) Cache CPU CPU CPU Disk Disk Disk
Data Mining on Large Data Sets Problem: Supercomputers are expensive, not all scientists have access to them for completing very large memory problems. Classification on large data sets without sufficient memory can degrade throughput, degrade accuracy, or fail outright.
training data partitioning/sampling (optional) algorithm 1 algorithm n test instance classifier 1 classifier n voting classification Data Mining Using Ensembles (From Steinhaeuser and Chawla, 2007)
training data partitioning/sampling (optional) algorithm 1 algorithm n test instance classifier 1 classifier n voting classification Data Mining Using Ensembles (From Steinhaeuser and Chawla, 2007)
Abstraction for Ensembles Using Natural Parallelism Choose optimal partitioning and submit batch jobs. Here are my algorithms. Here is my data set. Here is my test set. CPU CPU CPU CPU Abstraction Engine Local Votes Return local votes for tabulation and final prediction.
DataLab Abstractions file system distributed data structures function evaluation tcsh emacs perl set S file F Y = F(X) A B C job_start job_commit job_wait job_remove parrot chirp server chirp server chirp server chirp server chirp server unix filesys unix filesys unix filesys F X Y
set S set T F A B C A B C chirp server chirp server chirp server chirp server chirp server F F F DataLab Language Syntax apply F on S into T
For More Information • Christopher Moretti • cmoretti@cse.nd.edu • Douglas Thain • dthain@cse.nd.edu • Cooperative Computing Lab • http://cse.nd.edu/~ccl