600 likes | 812 Views
Big Data Challenge. Peta 10^15. Tera 10^12. Pig Latin. Giga 10^9. Mega 10^6. Understanding MapReduce in Context. Support Scientific Simulations (Data Mining and Data Analysis).
E N D
Big Data Challenge Peta 10^15 Tera 10^12 Pig Latin Giga 10^9 Mega 10^6
Understanding MapReducein Context Support Scientific Simulations (Data Mining and Data Analysis) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping Applications Security, Provenance, Portal Services and Workflow Programming Model High Level Language Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Runtime Distributed File Systems Object Store Data Parallel File System Storage Windows Server HPC Bare-system Amazon Cloud Azure Cloud Grid Appliance Linux HPC Bare-system Infrastructure Virtualization Virtualization CPU Nodes GPU Nodes Hardware
Judy QiuBingjing Zhang School of Informatics and ComputingIndiana University
Contents • Data mining and Data analysis Applications • Twister: Runtime for Iterative MapReduce • Twister Tutorial Examples • K-means Clustering • Multidimensional Scaling • Matrix Multiplication
Motivation • Iterative algorithms are commonly used in many domains • Traditional MapReduce and classical parallel runtimes cannot solve iterative algorithms efficiently • Hadoop: Repeated data access to HDFS, no optimization to data caching and data transfers • MPI: no nature support of fault tolerance and programming interface is complicated
What is Iterative MapReduce • Iterative MapReduce • Mapreduce is a Programming Model instantiating the paradigm of bringing computation to data • Iterative Mapreduce extends Mapreduce programming model and support iterative algorithms for Data Mining and Data Analysis • Interoperability • Is it possible to use the same computational tools on HPC and Cloud? • Enabling scientists to focus on science not programming distributed systems • Reproducibility • Is it possible to use Cloud Computing for Scalable, Reproducible Experimentation? • Sharing results, data, and software
Twister Iterative MapReduce v0.9 New Infrastructure for Iterative MapReduce Programming • Distinction on static and variable data • Configurable long running (cacheable) map/reduce tasks • Pub/sub messaging based communication/data transfers • Broker Network for facilitating communication • Version 0.9 (hands-on) • Version beta
Iterative MapReduce Frameworks • Twister[1] • Map->Reduce->Combine->Broadcast • Long running map tasks (data in memory) • Centralized driver based, statically scheduled. • Daytona[2] • Microsoft Iterative MapReduce on Azure using cloud services • Architecture similar to Twister • Twister4Azure[3] • A decentralized Map/reduce input caching, reduce output caching • Haloop[4] • On disk caching, Map/reduce input caching, reduce output caching • Spark[5] • Iterative MapreduceUsing Resilient Distributed Dataset to ensure the fault tolerance • Pregel[6] • Graph processing from Google • Mahout[7] • Apache project for supporting data mining algorithms
Others • Network Levitated Merge[8] • RDMA/infiniband based shuffle & merge • Mate-EC2[9] • Local reduction object • Asynchronous Algorithms in MapReduce[10] • Local & global reduce • MapReduce online[11] • online aggregation, and continuous queries • Push data from Map to Reduce • Orchestra[12] • Data transfer improvements for MR • iMapReduce[13] • Async iterations, One to one map & reduce mapping, automatically joins loop-variant and invariant data • CloudMapReduce[14] & Google AppEngineMapReduce[15] • MapReduce frameworks utilizing cloud infrastructure services
Applications & Different Interconnection Patterns Input map iterations Input Input map map Output Pij reduce reduce Domain of MapReduce and Iterative Extensions MPI
Parallel Data Analysis using Twister • MDS (Multi Dimensional Scaling) • Clustering (Kmeans) • SVM (Scalable Vector Machine) • Indexing • …
Application #1 Twister MDS Output MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute
Application #2 Data Intensive Kmeans Clustering • ─ Image Classification: 1.5 TB; 500 features per image;10k clusters • 1000 Map tasks; 1GB data transfer per Map task node
Iterative MapReduce Data Deluge MapReduce Classic Parallel Runtimes (MPI) Input map iterations Data Centered, QoS Efficient and Proven techniques Experiencing in many domains Input Input map map Output Pij reduce reduce Expand the Applicability of MapReduce to more classes of Applications Iterative MapReduce More Extensions Map-Only MapReduce
runMapReduce(..) Iterations Twister Programming Model Worker Nodes Main program’s process space configureMaps(..) Local Disk • Main program may contain many MapReduce invocations or iterative MapReduce invocations configureReduce(..) Cacheable map/reduce tasks while(condition){ Map() Reduce() monitorTillCompletion(..) Combine() operation Communications/data transfers via the pub-sub broker network & direct TCP updateCondition() } //end while close()
Twister Design Features Static Data Loaded only once Direct TCP Broadcast/Scatter transfer of dynamic KeyValue pairs Concepts and Features in Twister Configure() Iteration Main Program Long running map/reduce task threads (cached) Map(Key, Value) Direct data transfer via pub/sub Reduce (Key, List<Value>) Combine operation to collect all reduce outputs Combine ( Map<Key,Value>) Fault detection and recovery support between iterations
Twister APIs • configureMaps() • configureMaps(String partitionFile) • configureMaps(List<Value> values) • configureReduce(List<Value> values) • addToMemCache(Value value) • addToMemCache(List<Value> values) • cleanMemCache() • runMapReduce() • runMapReduce(List<KeyValuePair> pairs) • runMapReduceBCast(Value value) • map(MapOutputCollector collector, Key key, Value val) • reduce(ReduceOutputCollector collector, Key key, List<Value> values) • combine(Map<Key, Value> keyValues)
Twister Architecture Master Node Pub/Sub Broker Network and Collective Communication Service Twister Driver B B B B Main Program Twister Daemon Twister Daemon map reduce Cacheable tasks Worker Pool Worker Pool Local Disk Local Disk Scripts perform: Data distribution, data collection, and partition file creation Worker Node Worker Node
Data Storage and Caching • Use local disk to store static data files • Use Data Manipulation Tool to manage data • Use partition file to locate data • Data on disk can be cached into local memory • Support using NFS (version 0.9)
Data Manipulation Tool Data Manipulation Tool • Provides basic functionality to manipulate data across the local disks of the compute nodes • Data partitions are assumed to be files (Contrast to fixed sized blocks in Hadoop) • Supported commands: • mkdir, rmdir, put, putall, get, ls • Copy resources • Create Partition File Node 0 Node 1 Node n A common directory in local disks of individual nodes e.g. /tmp/twister_data Partition File
Partition File • Partition file allows duplicates to show replica availability • One data file and its replicas may reside in multiple nodes • Provide information required for caching
Pub/Sub Messaging • For small control messages only • Currently support • NaradaBrokering: single broker and manual configuration • ActiveMQ: multiple brokers and auto configuration
Broadcast • Use addToMemCache operation to broadcast dynamic data required by all tasks in each iteration • Replace original broker-based methods to direct TCP data transfers • Algorithm auto selection • Topology-aware broadcasting MemCacheAddressmemCacheKey = driver.addToMemCache(centroids); TwisterMonitormonitor = driver.runMapReduceBCast(memCacheKey);
Broadcast Twister Communications Map Tasks Map Tasks Map Tasks Broadcasting Data could be large Chain & MST Map Collectives Local merge Reduce Collectives Collect but no merge Combine Direct download or Gather Map Collective Map Collective Map Collective Reduce Tasks Reduce Tasks Reduce Tasks Reduce Collective Reduce Collective Reduce Collective Gather
Twister Broadcast Comparison Sequential vs. Parallel implementations
Failure Recovery • Recover at iteration boundaries • Does not handle individual task failures • Any Failure (hardware/daemons) result the following failure handling sequence • Terminate currently running tasks (remove from memory) • Poll for currently available worker nodes (& daemons) • Configure map/reduce using static data (re-assign data partitions to tasks depending on the data locality) • Re-execute the failed iteration
Bioinformatics Pipeline Gene Sequences (N = 1 Million) Distance Matrix Pairwise Alignment & Distance Calculation Select Reference Reference Sequence Set (M = 100K) Reference Coordinates Interpolative MDS with Pairwise Distance Calculation N - M Sequence Set (900K) Multi-Dimensional Scaling (MDS) x, y, z O(N2) 3D Plot x, y, z Visualization N - M Coordinates
Million Sequence Challenge • Input DataSize: 680k • Sample Data Size: 100k • Out-Sample Data Size: 580k • Test Environment: PolarGrid with 100 nodes, 800 workers. 100k sample data 680k data
Dimension Reduction Algorithms [1] I. Borg and P. J. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, New York, NY, U.S.A., 2005. [2] C. Bishop, M. Svens´en, and C. Williams. GTM: The generative topographic mapping. Neural computation, 10(1):215–234, 1998. • Multidimensional Scaling (MDS) [1] • Given the proximity information among points. • Optimization problem to find mapping in target dimension of the given data based on pairwise proximity information while minimize the objective function. • Objective functions: STRESS (1) or SSTRESS (2) • Only needs pairwise distances ijbetween original points (typically not Euclidean) • dij(X) is Euclidean distance between mapped (3D) points • Generative Topographic Mapping (GTM) [2] • Find optimal K-representations for the given data (in 3D), known as K-cluster problem (NP-hard) • Original algorithm use EM method for optimization • Deterministic Annealing algorithm can be used for finding a global solution • Objective functions is to maximize log-likelihood:
High Performance Data Visualization.. GTM with interpolation for 2M PubChem data 2M PubChem data is plotted in 3D with GTM interpolation approach. Red points are 100k sampled data and blue points are 4M interpolated points. MDS for 100k PubChem data 100k PubChem data having 166 dimensions are visualized in 3D space. Colors represent 2 clusters separated by their structural proximity. GTM for 930k genes and diseases Genes (green color) and diseases (others) are plotted in 3D space, aiming at finding cause-and-effect relationships. [3] PubChem project, http://pubchem.ncbi.nlm.nih.gov/ Developed parallel MDS and GTM algorithm to visualize large and high-dimensional data Processed 0.1 million PubChem data having 166 dimensions Parallel interpolation can process up to 2M PubChem points
Twister-MDS Demo Twister Driver MDS Monitor II. Send intermediate results ActiveMQ Broker Master Node Twister-MDS PlotViz I. Send message to start the job Client Node • Input: 30k metagenomicsdata • MDS reads pairwise distance matrix of all sequences • Output: 3D coordinates visualized in PlotViz
Twister-MDS Demo Moviehttp://www.youtube.com/watch?v=jTUD_yLrW1s&feature=share
How it works? wikipedia
Parallelization of K-means Clustering Broadcast Partition Partition Partition …
K-means Clustering Algorithm for MapReduce * Do * Broadcast Cn * * [Perform in parallel] –the map() operation * for each Vi * for each Cn,j * Dij <= Euclidian (Vi,Cn,j) * Assign point Vi to Cn,j with minimum Dij * for each Cn,j * newCn,j <=Sum(Vi in j'th cluster) * newCountj <= Sum(1 if Vi in j'th cluster) * * [Perform Sequentially] –the reduce() operation * Collect all Cn * Calculate new cluster centers Cn+1 * Diff<= Euclidian (Cn, Cn+1) * * while (Diff <THRESHOLD) Map E-Step M-Step Global reduction Reduce Vi refers to the ith vector Cn,j refers to the jth cluster center in nth * iteration Dij refers to the Euclidian distance between ith vector and jth * cluster center K is the number of cluster centers
Twister K-means Execution <c, File1 > <c, File2 > <c, Filek > <K, > <K, > <K, > <K, > C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck C1 C2 C3 … Ck
K-means Clustering Code Walk-through • Main Program • Job Configuration • Execution iterations • KMeansMapTask • Configuration • Map • KMeansReduceTask • Reduce • KMeansCombiner • Combine
Download Resources • Download Virtual Box and Appliance Download Virtual Box https://www.virtualbox.org/wiki/Downloads Step 1 Download Appliance http://salsahpc.indiana.edu/ScienceCloud/apps/salsaDPI/virtualbox/chef_ubuntu.ova Step 2 Step 3 Import Appliance
Prepare the Environment Step 4 Start the VM Open 3 terminals (Putty or equal) and connect to the VM (IP: 192.168.56.1, username: root, password: school2012) Step 5
Start ActiveMQ in Terminal 1 • cd $ACTIVEMQ_HOME/bin • ./activemqconsole Step 6