1.43k likes | 1.47k Views
Chapter 8 Cloud Programming with Hadoop and Spark. Scalable Parallel Computing Over Large Clusters. The MapReduce from Google and Apache Hadoop and Spark programming models All of these are open source available to all programmers Run on a large-scale cluster of servers
E N D
Scalable Parallel Computing Over Large Clusters • The MapReduce from Google and Apache Hadoop and Spark programming models • All of these are open source available to all programmers • Run on a large-scale cluster of servers • Either on clouds or on supercomputers
Characteristics of Scalable Computing • Handling the whole data flow of parallel and distributed programming is time-consuming • Needs specialized knowledge of programming • Dealing with these issues may affect the productivity of the programmer • Even result in affecting the program’s time to market • May detract the programmer from concentrating on the logic of the program itself • Parallel and distributed programming paradigms or models are offered • To abstract many parts of the data flow • Aim to provide users with an abstraction layer to hide implementation details of the data flow
Characteristics of Scalable Computing (cont.) • Simplicity of writing parallel programs is an important metric for parallel and distributed programming paradigms • Other motivations • To improve productivity of programmers • To decrease programs’ time to market • To leverage underlying resources more efficiently • To increase system throughput • To support higher levels of abstraction • For running parallel programs in a distributed cluster of servers • Partitioning is applicable to computation and data
Characteristics of Scalable Computing (cont.) • Computation partitioning • Computation partitioning splits a given job or program into smaller tasks • Partitioning greatly depends on correctly identifying portions of the job that can be performed concurrently • Identifying parallelism in the structure of the program to be divided into parts to be run on different workers • Different parts may process different data or a copy of the same data • Data partitioning • Data partitioning is splitting the input or intermediate data into smaller pieces • Identifying parallelism in the input data to be divided into pieces to be processed on different workers • Data pieces may be processed by different parts of a program or a copy of the same program
Characteristics of Scalable Computing (cont.) • Allocation • Assigning either the smaller parts of a program or smaller pieces of data to underlying resources • Aims to appropriately assign parts or pieces to be run simultaneously on different workers • Usually handled by resource allocators in the system • Synchronization • Different workers may perform different tasks • Synchronization and coordination between workers is necessary to prevent race condition • Multiple accesses to a shared resource by different workers may raise race condition • Data dependency between different workers also needs to be properly managed • Data dependency happens when a worker needs the processed data of other workers
Characteristics of Scalable Computing (cont.) • Communication • Data dependency is one of the main reasons of communication between workers • Communication is always triggered when the intermediate data is ready to be sent among workers • Scheduling • The computation parts or data pieces may be more than the available workers • A scheduler selects a sequence of tasks or data pieces to be assigned to the workers • The resource allocator performs the mapping of computation or data pieces to workers • The scheduler only picks the next part from the queue of unassigned tasks based on a set of rules called scheduling policy
Characteristics of Scalable Computing (cont.) • For multiple jobs or programs • A scheduler selects a sequence of jobs to be run on the distributed computing system • Resources scheduling demands high efficiency in Hadoop and Spark programming • The loose coupling of components in distributed resources makes it possible to schedule elastic VMs to yield higher fault tolerance and scalability • Than traditional programming models using the message passing interface (MPI) library
Workers in a Cluster • Workers are installed in the physical servers in the cloud cluster • Can be VM instances or application containers • The scheduling of VMs is automatically launched • Explicit scheduling is often performed in clouds • Either for individual worker roles • Or for gang-scheduling supported in MapReduce • Queues provide a natural way to manage the task assignment in a fault-tolerant distributed environment
From MapReduce to Hadoop and Spark • Both MapReduce and Hadoop are designed for bipartite graph computing • Not for general-purpose applications • Spark extends the MapReduce model • On the speed side, supports interactive queries and streaming processing • Using in-memory computing • Offers the ability to run computations in memory • More efficient than MapReduce running on disks for complex applications • Spark is highly accessible • Offering simple APIs in Python, Java, Scala, structured query language (SQL), etc
From MapReduce to Hadoop and Spark (cont.) • Spark can run in Hadoop clusters • Access any Hadoop data source like Cassandra • MapReduce is a software framework • Built with a master-worker model • Supports parallel and distributed computing on large data sets • Abstracts the data flow of running a parallel program on a distributed computing system • By providing users with two interfaces in the form of two functions: Map and Reduce • Users can override these two functions to interact with and manipulate the data flow of running the programs
From MapReduce to Hadoop and Spark (cont.) • Data parallel languages largely aimed at loosely coupled clusters of servers • The language and runtime spawn many task executions simultaneously • MapReduce applies dynamic execution, fault tolerance, and easy-to-use APIs • Performs Map and Reduce functions in a pipelined fashion • MapReduce software framework was first proposed and implemented by Google • Google MapReduce paradigm is written in C
From MapReduce to Hadoop and Spark (cont.) • Evolved from use in a search engine to Google App Engine cloud • Hadoop library is developed for MapReduce programming in Java environments • The original Hadoop implements MapReduce in batch mode over distributed disks • Spark is improved from Hadoop for in-memory processing • In both batch and streaming modes • Over a directed acyclic graphs (DAG) based computing paradigm
From MapReduce to Hadoop and Spark (cont.) • Initially, Google’s MapReduce was applied only in fast search engines • Then MapReduce enabled cloud computing • Apache Hadoop has made MapReduce • Possible for big data processing on large server clusters or clouds • Apache Spark frees up many constraints by MapReduce and Hadoop programming • In general-purpose batch or streaming applications
Application Software Libraries for Big Data Processing • Some open-source software tools and programming projects • Often used to process big data • Most have been developed for big data storage, mining, and analysis in academia and industry • The tool names, categories, URLs, languages used, and relevant sections, and major functionality and applications of these tool sets
Application Software Libraries for Big Data Processing (cont.)
Application Software Libraries for Big Data Processing (cont.) • These software libraries is classified into • Compute engine (Hadoop, Spark) • Data storage (HDFS, Cassandra) • Resource management (YARN, Mesos) • Query engine (Impala, Spark SQL) • Message system (StormMQ) • Data mining (Weka) • Data analytics (MLlib, Mahout) • Graphic processing (GraphX)
Hadoop Programming with YARN and HDFS • The basic concept of MapReduce for batch processing of large data sets • In batch processing, we deal with a static data set which will not change during execution • Streaming data or real-time data cannot be handled well in batch mode • The batch processing considers only static data sets executed in the original MapReduce framework
The MapReduce Compute Engine • The MapReduce software framework • Provides an abstraction layer for data and control flow • The logical data flow from the Map to the Reduce • The control flow is hidden from users
The MapReduce Compute Engine (cont.) • The data flow in a MapReduce framework is predefined • Data partitioning, mapping and scheduling, synchronization, communication, and output of results • Partitioning is controlled in user programs • By specifying the partitioning block size and data fetch patterns • The abstraction layer provides two well-defined interfaces in two functions: Map and Reduce • These mapper and reducer functions can be defined by the user to achieve specific objectives • The user overrides the Map and Reduce functions • Invokes the provided MapReduce(Spec & Results) function from the library to start the flow of data
The MapReduce Compute Engine (cont.) • Map and Reduce functions take a specification object, called Spec • This object is first initialized inside the user’s program • The user writes code to fill it with the names of input and output files, as well as other tuning parameters • This object is also filled with the names of the Map and the Reduce functions • The MapReduce library is essentially the controller of the MapReduce pipeline • Coordinates the dataflow from the input end to the output end in a synchronous manner • The API tools are used to provide an abstraction • To hide the MapReduce software framework from intervention by users, randomly
Logical Dataflow • The input data to both the Map and the Reduce function have a particular structure • The same argument goes for the output data too • The input data to the Map function is arranged in the form of a (key, value) pair • The value is the actual data • The key part is only used to control the data flow • e.g., The key is the line offset within the input file and the value is the content of the line • The output data from the Map function is structured as (key, value) pairs • Called intermediate (key, value) pairs
Logical Dataflow (cont.) • The Map function processes each input (key, value) pair • To produce s few intermediate (key, value) pairs • The aim is to process all input (key, value) pairs to the Map function in parallel • e,g, The map function emits each word w plus an associated count of occurrences • Just a 1 is recorded in this pseudo-code • The Reduce function receives the intermediate (key, value) pairs
Logical Dataflow (cont.) • In the form of a group of intermediate values • (key, [set of values]) associated with one intermediate key • The MapReduce framework forms these groups • By first sorting the intermediate (key, value) pairs • Then grouping values with the same key • Sorting the data is done to simplify the grouping process • The Reduce function processes each (key, [set of values]) group • Produces a set of (key, value) pairs as output • e.g., The reduce function merges the word counts by different map workers • Into a total count as output
Logical Dataflow (cont.) • Word Count Using MapReduce over Partitioned Data Set • One of the well-known MapReduce problems • The word count problem for a simple input file containing only two lines • most people ignore most poetry • most poetry ignores most people
Logical Dataflow (cont.) • The Map function simultaneously produces a number of intermediate (key, value) pairs for each line content • Each word is the intermediate key with 1 as its intermediate value, e.g., (ignore, 1) • The MapReduce library collects all the generated intermediate (key, value) pairs • Sorts them to group the 1s for identical words, e.g., (people, [1,1]) • Groups are then sent to the Reduce function in parallel • It can sum up the 1 values for each word • Generate the actual number of occurrences for each word in the file, e.g., (people, 2)
Logical Dataflow (cont.) • Hadoop Implementation of a MapReduce WebVisCounter Program • WebVisCounter counts the number of times • Users connect to or visit a given website using a particular operating system
Logical Dataflow (cont.) • e.g., Windows XP or Ubuntu Linux • The input data is a typical web server log file • A line has eight fields separated by tabs or spaces • The Map function parses each line to extract the type of the used OS as a key and assigns a value 1 to it • The Reduce function in turn sums up the number of 1s for each unique key
Parallel Batch Processing • Each Map server applies the map function to each input data split • Many mapper functions run concurrently on hundreds or thousands of machine instances • Many intermediate key-value pairs are generated • Stored in local disks for subsequent use • The original MapReduce is slow on large clusters • Due to disk-based handling of intermediate results • The Reduce server collates the values using the reduce function • The reducer function can be max., min., average, dot product of two vectors, etc
Formal MapReduce Model • The Map function is applied in parallel to every input (key, value) pair • Produces a new set of intermediate (key, value) pairs • MapReduce library collects all the produced intermediate pairs from all input pairs • Sorts them based on the key part • Groups the values of all occurrences of the same key • The Reduce function is applied in parallel to each group • To produce the collection of values as output
Formal MapReduce Model (cont.) • After grouping all the intermediate data • The values of all occurrences of the same key are sorted and grouped together • Each key becomes unique in all intermediate data • Finding unique keys is the starting point to solving a typical MapReduce problem • The intermediate (key, value) pairs as the output of map function will be automatically produced • Examples of how to define keys and values • Count the number of occurrences of each word in a collection of documents in the above example
Formal MapReduce Model (cont.) • Count the number of occurrences of anagrams in a collection of documents • Anagrams are words that are formed by rearranging the letters of anotherword • e.g., listen can be reworked into the word silent • The unique keys are an alphabetically sorted sequence of letters for each word, e.g., eilnst • The intermediate value is the number of occurrences • The main responsibility of the MapReduce framework • To efficiently run a user’s program on a distributed computing system • Carefully handles all partitioning, mapping, synchronization, communication, and scheduling details of such data flows
Formal MapReduce Model (cont.) • Distinct steps of the MapReduce engine • Data partitioning of input files • The MapReduce library splits the input data files into multiple pieces that match the number of map workers • Called splits or blocks • Fork out the user program to masters and workers • One copy of the program runs on the master node • The map and reduce tasks fork out to map and reduce workers, respectively • Assign map tasks and reduce tasks • The master picks idle workers and assigns tasks to them
Formal MapReduce Model (cont.) • Read partitioned data blocks into map workers • Each map worker reads its own block of input data • A map worker may handle one or more input split • Perform the operations of the map workers • MapReduce library generates many copies of a user program and distributes them on available workers • The Map function receives the input data split as a set of (key, value) pairs to process and produce the intermediate (key, value) pairs • Sort and group (value) pairs • MapReduce applies simple synchronization policy to coordinate map workers with reduce workers • The communication between them starts when all map tasks finish
Formal MapReduce Model (cont.) • MapReduce framework sorts and groups the intermediate (key, value) pairs before forwarding them to reduce workers • The intermediate (key, value) paired with identical keys are grouped together • All values inside each group should be processed by only one Reduce function to generate the final result • Perform the reduce function • Reduce worker iterates over the grouped (key, value) pairs • For each unique key, the key and corresponding values is sent to one Reduce function • Write the results to output files • The reduce worker will output the final results to output files
Formal MapReduce Model (cont.) • Intermediate (key, value) pairs produced are partitioned into R regions • R is equal to number of reduce tasks • This guarantees that (key, value) pairs with identical keys are stored in the same region • A partitioning function could simply be a hash function to forward the data into particular regions • e.g., Hash(key) mod R • Reduce workers may face a problem of network congestion • Caused by the reduction or merging operation performed
Compute-Data Locality • The MapReduce implementation takes advantage of Google File System (GFS) • As the underlying layer • MapReduce could perfectly adapt itself to GFS • GFS is a distributed file system • Files are divided into fixed-size blocks (chunks) • Blocks are distributed and stored on cluster nodes • MapReduce library splits the input data (files) into fixed-size blocks • Ideally performs the Map function in parallel on each block
Compute-Data Locality (cont.) • GFS has already stored files as a set of blocks • MapReduce just needs to send a copy of the user’s program • Containing the Map function to the nodes already stored as data blocks • The notion of sending computation toward data rather than sending data toward computation
MapReduce for Parallel Matrix Multiplication • In multiplying two n×n matrices A = (aij) and B = (bij) • Need to perform n2 dot product operations to produce an output matrix C = (cij) • Each dot product produces an output element cij = ai1 × b1j + ai2 × b2j + ∙ ∙ ∙ + ain × bnj • Corresponding to the i-th row vector in matrix A multiplied by the j-th column vector in matrix B • Mathematically, each dot product takes n multiply-and-add time units to complete • The total matrix multiply complexity equals n×n2 since there are n2 output elements • In theory, the n2 dot products are totally independent of each other
MapReduce for Parallel Matrix Multiplication (cont.) • They can be done on n2 servers in n time units • When n is very large, say millions or higher • Too expensive to build a cluster with n2 servers • In practice, only the use of N << n2servers • The ideal speedup is expected to be N • MapReduce Multiplication of Two Matrices • Apply the MapReduce method to multiply two 2×2 matrices: A = (aij) and B = (bij) • With two mappers and one reducer
MapReduce for Parallel Matrix Multiplication (cont.) • Map the first and second rows row of matrix A and entire matrix B to the first and second Map servers, respectively • Four keys are used to identify four blocks of data processed • K11, K12, K21, and K22 • Simply denoted by the matrix element indices • Partition matrix A and matrix BTby rows into two blocks, horizontally • BT is the transposed matrix of B • Data blocks are read into the two mappers • All intermediate computing results are identified by their <key, value> pairs
MapReduce for Parallel Matrix Multiplication (cont.) • The generation, sorting, and grouping of four <key, value> pairs by each mapper in two stages • Each short pair <key, value> holds a single partial-product value identified by its key • The long pair holds two partial products identified by each block key • The Reducer is used to sum up the output matrix elements using four long <key, value(s)> pairs • Consider six mappers and two reducers • Each mapper handles n/6 adjacent rows of the input matrix • Each reducer generates n/2 of the output matrix C
MapReduce for Parallel Matrix Multiplication (cont.) • When the matrix order becomes very large • The time to multiply very large matrices becomes cost prohibitive • A dataflow graph for the above example
Three parallel computing models • Map-only Model • A simplified parallel processing mode is the map-only execution mode • This model applies to embarrassingly parallel computations • All subdivided tasks are totally independent from one another • Carried out in one stage • Applications suitable for applying a map-parallel model for parallel computing • Document conversion, e.g., PDF->HTML • Brute force searches in cryptography • Parametric sweeps and gene assembly • Polar Grid MATLAB data analysis
Three parallel computing models (cont.) • Classic MapReduce Model • For parallel execution of tasks that can be described by two-stage bipartite graphs • Computational tasks suitable to this model • High energy physics (HEP) histograms • Distributed search and distributed sort • Information retrieval, data mining and clustering • Calculation of pairwise distances for sequences (BLAST) • Expectation maximization algorithms • Linear algebra and k-means • Deterministic annealing clustering • Multidimensional scaling (MDS)
Three parallel computing models (cont.) • Iterative MapReduce Model • Applies the classic MapReduce, iteratively • In many passes through the engine • The best example of iterative MapReduce is the Twister software tool • Developed at Indiana University • This model has been commercialized by Microsoft • Indiana University has applied Twister in bioinformatics applications • The potentials of Twister • Expectation maximization algorithms • Linear algebra and k-means • Data mining and clustering
Three parallel computing models (cont.) • Deterministic annealing clustering • Multidimensional scaling (MDS)
Three parallel computing models (cont.) • Three software packages that are based on using MapReduce • Consider the differences in four technical aspects • Execution mode, data handling, job scheduling, and high-level language (HLL) support area • Google MapReduce was written in C language • Used primarily in batch processing based on the bipartite MapReduce graph • Emphasizes data locality • Supported by an HLL Sawzall and GFS and BigTable in Google cloud • Apache Hadoop supports not only batch mode but also real-time applications