640 likes | 652 Views
Explore the journey of MapReduce, from Google's inception to Apache's Hadoop and Spark, enabling big data processing and overcoming limitations. Dive into the map and reduce functions, logical data flow, and user overrides in this software framework.
E N D
Lecture 4MapReduce Software Frameworks and CUDA GPU Architectures
From MapReduce to Hadoop and Spark • MapReduce is a software framework • Designed for bipartite graph computing • Built with a master-worker model • Supports parallel and distributed computing on large data sets • Abstracts the data flow of running a parallel program on a distributed computing system • By providing users with two interfaces in the form of two functions, i.e., Map and Reduce • Users can override these two functions to interact with and manipulate the data flow of running the programs
From MapReduce to Hadoop and Spark (cont.) • MapReduce applies dynamic execution, fault tolerance, and easy-to-use APIs • Performs Map and Reduce functions in a pipelined fashion • MapReduce software framework was first proposed and implemented by Google • Google MapReduce paradigm is written in C • Evolved from use in a search engine to Google App Engine cloud • Initially, Google’s MapReduce was applied only in fast search engines • Then MapReduce enabled cloud computing
From MapReduce to Hadoop and Spark (cont.) • Apache Hadoop has made MapReduce • Possible for big data processing on large server clusters or clouds • Apache Spark frees up many constraints by MapReduce and Hadoop programming • In general-purpose batch or streaming applications • The MapReduce framework is only for batch processing of large data sets • Deal with a static data set which will not change during execution • Streaming data or real-time data cannot be handled well in batch mode
The MapReduce Compute Engine • The MapReduce framework provides an abstraction layer for data and control flow • The logical data flow from the Map to the Reduce • The control flow is hidden from users
The MapReduce Compute Engine (cont.) • The MapReduce library is essentially the controller of the MapReduce pipeline • Coordinates the dataflow from the input end to the output end in a synchronous manner • The API tools are used to provide an abstraction to hide the MapReduce software framework from intervention by users, randomly • The data flow in a MapReduce framework is predefined • Data partitioning, mapping and scheduling, synchronization, communication, and output of results • Partitioning is controlled in user programs • By specifying the partitioning block size and data fetch patterns
The MapReduce Compute Engine (cont.) • The abstraction layer provides two well-defined interfaces in two functions: Map and Reduce • These mapper and reducer functions can be defined by the user to achieve specific objectives • The user overrides the Map and Reduce functions • Map and Reduce functions take a specification object, called Spec • First initialized inside the user’s program • The user writes code to fill it with the names of input and output files as well as other tuning parameters • Also filled with the names of the Map and the Reduce functions • Invokes the provided MapReduce(Spec, &Results) function from the library to start the flow of data
The MapReduce Compute Engine (cont.) • The overall structure of a user’s MapReduce program
Logical Dataflow • The input data to both the Map and the Reduce function have a particular structure • The same argument goes for the output data too • The input data to the Map function is arranged in the form of a (key, value) pair • The value is the actual data • The key part is only used to control the data flow • e.g., The key is the line offset within the input file and the value is the content of the line • The output data from the Map function is structured as (key, value) pairs • Called intermediate (key, value) pairs
Logical Dataflow (cont.) • The Map function processes each input (key, value) pair • To produce s few intermediate (key, value) pairs • The aim is to process all input (key, value) pairs to the Map function in parallel • e,g, The map function emits each word w plus an associated count of occurrences • Just a 1 is recorded in this pseudo-code • The Reduce function receives the intermediate (key, value) pairs
Logical Dataflow (cont.) • In the form of a group of intermediate values (key, [set of values]) associated with one intermediate key • MapReduce framework forms these groups by first sorting the intermediate (key, value) pairs • Then grouping values with the same key • Sorting the data is done to simplify the grouping process • The Reduce function processes each (key, [set of values]) group • Produces a set of (key, value) pairs as output • e.g., The reduce function merges the word counts by different map workers • Into a total count as output
Logical Dataflow (cont.) • Word Count Using MapReduce over Partitioned Data Set • One of the well-known MapReduce problems • The word count problem for a simple input file containing only two lines • most people ignore most poetry • most poetry ignores most people • The Map function simultaneously produces a number of intermediate (key, value) pairs for each line content • Each word is the intermediate key with 1 as its intermediate value, e.g., (ignore, 1) • The MapReduce library collects all the generated intermediate (key, value) pairs
Logical Dataflow (cont.) • Sorts them to group the 1s for identical words, e.g., (people, [1,1]) • Groups are then sent to the Reduce function in parallel • It can sum up the 1 values for each word • Generate the actual number of occurrences for each word in the file, e.g., (people, 2)
Logical Dataflow (cont.) • Hadoop Implementation of a MapReduce WebVisCounter Program • WebVisCounter counts the number of times that users connect to or visit a given website using a particular operating system • The input data is a typical web server log file
Logical Dataflow (cont.) • Data flow in WebVisCounter program execution • The Map function parses each line to extract the type of the used OS as a key and assigns a value 1 to it • The Reduce function in turn sums up the number of 1s for each unique key
Logical Dataflow (cont.) • Each Map server applies the map function to each input data split • Many mapper functions run concurrently on hundreds or thousands of machine instances • Many intermediate key-value pairs are generated • Stored in local disks for subsequent use • The original MapReduce is slow on large clusters • Due to disk-based handling of intermediate results • The Reduce server collates the values using the reduce function • The reducer function can be max., min., average, dot product of two vectors, etc
Formal MapReduce Model • The Map function is applied in parallel to every input (key, value) pair • Produces a new set of intermediate (key, value) pairs • MapReduce library collects all the produced intermediate pairs from all input pairs • Sorts them based on the key part • Groups the values of all occurrences of the same key • The Reduce function is applied in parallel to each group • To produce the collection of values as output
Formal MapReduce Model (cont.) • After grouping all the intermediate data • The values of all occurrences of the same key are sorted and grouped together • Each key becomes unique in all intermediate data • Finding unique keys is the starting point to solving a typical MapReduce problem • The intermediate (key, value) pairs as the output of map function will be automatically produced • Examples of how to define keys and values • Count the number of occurrences of each word in a collection of documents in the above example
Formal MapReduce Model (cont.) • Count the number of occurrences of anagrams in a collection of documents • Anagrams are words that are formed by rearranging the letters of anotherword • e.g., listen can be reworked into the word silent • The unique keys are an alphabetically sorted sequence of letters for each word, e.g., eilnst • The intermediate value is the number of occurrences • The main responsibility of the MapReduce framework • To efficiently run a user’s program on a distributed computing system • Carefully handles all partitioning, mapping, synchronization, communication, and scheduling details of such data flows
Formal MapReduce Model (cont.) • Intermediate (key, value) pairs produced are partitioned into R regions • R is equal to number of reduce tasks • Guarantees that (key, value) pairs with identical keys are stored in the same region • Reduce workers may face network congestion • Caused by reduction or merging operation performed
Compute-Data Locality • The MapReduce implementation takes advantage of Google File System (GFS) as the underlying layer • MapReduce can perfectly adapt itself to GFS • GFS is a distributed file system • Files are divided into fixed-size blocks (chunks) • Blocks are distributed and stored on cluster nodes • MapReduce library splits the input data (files) into fixed-size blocks • Ideally performs the Map function in parallel on each block
Compute-Data Locality (cont.) • GFS has already stored files as a set of blocks • MapReduce just needs to send a copy of the user’s program containing the Map function to the nodes already stored as data blocks • The notion of sending computation toward data rather than sending data toward computation
MapReduce for Parallel Matrix Multiplication • In multiplying two n×n matrices A = (aij) and B = (bij) • Need to perform n2 dot product operations to produce an output matrix C = (cij) • Each dot product produces an output element cij = ai1 × b1j + ai2 × b2j + ∙ ∙ ∙ + ain × bnj • Corresponding to the i-th row vector in matrix A multiplied by the j-th column vector in matrix B • Mathematically, each dot product takes n multiply-and-add time units to complete • The total matrix multiply complexity equals n×n2 since there are n2 output elements • In theory, the n2 dot products are totally independent of each other
MapReduce for Parallel Matrix Multiplication (cont.) • Can be done on n2 servers in n time units • When n is very large, say millions or higher • Too expensive to build a cluster with n2 servers • In practice, only the use of N << n2servers • The ideal speedup is expected to be N • MapReduce Multiplication of Two Matrices • Apply the MapReduce method to multiply two 2×2 matrices: A = (aij) and B = (bij) • With two mappers and one reducer
MapReduce for Parallel Matrix Multiplication (cont.) • Map the first and second rows row of matrix A and entire matrix B to the first and second Map servers, respectively • Four keys are used to identify four blocks of data processed • K11, K12, K21, and K22 • Simply denoted by the matrix element indices • Partition matrix A and matrix BTby rows into two blocks, horizontally • BT is the transposed matrix of B • Data blocks are read into the two mappers • All intermediate computing results are identified by their <key, value> pairs
MapReduce for Parallel Matrix Multiplication (cont.) • The generation, sorting, and grouping of four <key, value> pairs by each mapper in two stages • Each short pair <key, value> holds a single partial-product value identified by its key • The long pair holds two partial products identified by each block key • The Reducer is used to sum up the output matrix elements using four long <key, value(s)> pairs • Consider six mappers and two reducers • Each mapper handles n/6 adjacent rows of the input matrix • Each reducer generates n/2 of the output matrix C
MapReduce for Parallel Matrix Multiplication (cont.) • When the matrix order becomes very large • The time to multiply very large matrices becomes cost prohibitive • A dataflow graph for the above example
GPU Computing to Exascale and Beyond • Multicore CPUs may increase from the tens of cores to hundreds or more in the future • CPU has reached its limit in terms of exploiting massive parallelism due to liming memory speed • Triggered the development of many-core GPUs with hundreds or more thin cores • x-86 processors have been extended to serve HPC systems in some high-end server processors • Many RISC processors have been replaced with multicore x-86 processors and many-core GPUs • This trend indicates that x-86 upgrades will dominate in data centers and supercomputers • The GPU also has been applied in large clusters to build supercomputers in massively parallel processors (MPPs)
GPU Computing to Exascale and Beyond (cont.) • In the future, the processor industry will develop asymmetric/heterogeneous chip multiprocessors • With both fat CPU cores and thin GPU cores on chip • Internal to each node of the cloud • Multithreading is practiced with a large number of cores in many-core GPU clusters • Four challenges for exascale computing • Energy and power & Memory and storage • Needs to optimize the storage hierarchy and tailor the memory to the applications • Concurrency and locality & System resiliency • Needs to promote self-aware OS and runtime support and build locality-aware compilers and auto-tuners • Self-aware OS/RT systems have the ability to adapt to the current situation and react to runtime events
GPU Computing to Exascale and Beyond (cont.) • A graphics processing unit (GPU) is a graphics coprocessor or accelerator • Mounted on a computer’s graphics/video card • Offloads the CPU from tedious graphics tasks in video editing applications • The world’s first GPU, the GeForce 256, was marketed by NVIDIA in 1999 • Modern GPU chips can process a minimum of 10 million polygons per second • Used in nearly every computer on the market today • Some features were also integrated into certain CPUs • Traditional CPUs are structured with only a few cores
GPU Computing to Exascale and Beyond (cont.) • e.g., The Xeon X5670 CPU has six cores • Modern GPU chips can be built with hundreds of cores • GPUs have a throughput architecture that exploits massive parallelism by executing many concurrent threads slowly • Instead of executing a single long thread in a conventional microprocessor very quickly • Parallel GPUs or GPU clusters have been adopted • Against the use of CPUs with limited parallelism • General-purpose computing on GPUs have appeared in the HPC field, known as GPGPUs • NVIDIA’s CUDA (Compute Unified Device Architecture) model is for HPC using GPGPUs
How GPUs Work • Early GPUs functioned as coprocessors attached to the CPU • Today, the NVIDIA GPU has been upgraded to 128 cores on a single chip • Each core can handle eight threads of instructions • Translates to having up to 1,024 threads executed concurrently on a single GPU • True massive parallelism, compared to only a few threads that can be handled by a conventional CPU • Achieves exascale-scale computing, Eflops or 1018flops • Optimized to deliver much higher throughput with explicit management of on-chip memory • The CPU is optimized for latency caches
How GPUs Work (cont.) • Modern GPUs are not restricted to accelerated graphics or video coding • Also used in HPC systems • To power supercomputers with massive parallelism at multicore and multithreading levels • Designed to handle large numbers of floating-point operations in parallel • In a way, the GPU offloads the CPU from all data-intensive calculations • Not just those related to video processing widely used in mobile phones, game consoles, PCs, servers, etc • e.g., The NVIDIA CUDA Tesla or Fermi is used in GPU clusters or in HPC systems for parallel processing of massive floating-pointing data
How GPUs Work (cont.) • The interaction between a CPU and GPU • In performing parallel execution of floating-point operations concurrently • The CPU is the conventional multicore processor with limited parallelism to exploit • The GPU has a many-core architecture • Hundreds of simple processing cores organized as multiprocessors • Each core can have one or more threads • The CPU’s floating-point kernel computation role is largely offloaded to the many-core GPU • The CPU instructs the GPU to perform massive data processing
How GPUs Work (cont.) • The bandwidth must be matched between the on-board main memory and the on-chip GPU memory • This process is carried out in NVIDIA’s CUDA programming using the GeForce 8800 or Tesla and Fermi GPUs • The NVIDIA Fermi GPU Chip with 512 CUDA Cores
How GPUs Work (cont.) • In 2010, three of the five fastest supercomputers in the world used large numbers of GPU chips to accelerate floating-point computations • i.e., The Tianhe-1a, Nebulae, and Tsubame • The architecture of the Fermi GPU • A next-generation GPU from NVIDIA • A streaming multiprocessor (SM) module • Multiple SMs can be built on a single GPU chip • The Fermi chip has 16 SMs implemented with 3 billion transistors • Each SM comprises up to 512 streaming processors (SPs), known as CUDA cores • The Tesla GPUs used in the Tianhe-1a have a similar architecture, with 448 CUDA cores
How GPUs Work (cont.) • The Fermi GPU is a newer generation of GPU • Can be used in desktop workstations to accelerate floating-point calculations • Or for building large-scale data centers • There are 32 CUDA cores per SM • Each CUDA core has a simple pipelined integer ALU and an FPU that can be used in parallel • Each SM has 16 load/store units • Allowing source and destination addresses to be calculated for 16 threads per clock • There are four special function units (SFUs) for executing transcendental instructions • These instructions perform trigonometric and logarithmic operations on floating-point operands
How GPUs Work (cont.) • All functional units and CUDA cores are interconnected by an NoC (network on chip) to a large number of SRAM banks (L2 caches) • Each SM has a 64 KB L1 cache • The 768 KB unified L2 cache is shared by all SMs to serve all load and store operations • Memory controllers are used to connect to 6 GB of off-chip DRAMs • The SM schedules threads in groups of 32 parallel threads called warps • In total, 256/512 FMA (fused multiply and add) operations can be done in parallel to produce 32/64-bit floating-point results • The 512 CUDA cores in an SM can work in parallel to deliver up to 515Gflops of double-precision results
How GPUs Work (cont.) • With 16 SMs, a single GPU has a peak speed of 82.4Tflops • Only 12 Fermi GPUs have the potential to reach the Pflops performance • In the future, thousand-core GPUs may appear in Exascale systems • Reflects a trend toward building future MPPs with hybrid architectures of both types of chips • The progress of GPUs along with CPU advances in power efficiency, performance, and programmability • All systems using the hybrid CPU/GPU architecture consume much less power
GPU Clusters for Massive Parallelism • Commodity GPUs have become high-performance accelerators for data-parallel computing • Modern GPU chips contain hundreds of processor cores per chip • Each GPU chip is capable of achieving up to 1Tflops for single-precision (SP) arithmetic • And more than 80Gflops for double-precision (DP) calculations • Recent HPC-optimized GPUs contain up to 4 GB of on-board memory • Capable of sustaining memory bandwidths exceeding 100 GB/second
GPU Clusters for Massive Parallelism (cont.) • GPU clusters are built with a large number of GPU chips • With the capability to achieve Pflops performance in some of the Top 500 systems • Most GPU clusters are structured with homogeneous GPUs of the same hardware class, make, and model • The software used in a GPU cluster includes the OS, GPU drivers, and clustering API such as an MPI • The high performance of a GPU cluster is attributed mainly to • The massively parallel multicore architecture, and high throughput in multithreaded floating-point arithmetic • Significantly reduced time in massive data movement using large on-chip cache memory
GPU Clusters for Massive Parallelism (cont.) • GPU clusters already are more cost-effective than traditional CPU clusters • Result in a quantum jump in speed performance • Highly reduced space, power, and cooling demands • Can operate with a reduced number of OS images, compared with CPU-based clusters • These reductions in power, environment, and management complexity • Make GPU clusters very attractive for use in future HPC applications • A GPU cluster is often built as a heterogeneous system • Consisting of three major components
GPU Clusters for Massive Parallelism (cont.) • The CPU host nodes, the GPU nodes and the cluster interconnect between them • The GPU nodes are formed with general-purpose GPUs (GPGPUs) to carry out numerical calculations • The host node controls program execution • The cluster interconnect handles inter-node communications • To guarantee the performance • Multiple GPUs must be fully supplied with data streams over high-bandwidth network and memory • Host memory should be optimized to match with the on-chip cache bandwidths on the GPUs
Echelon GPU Cluster Architecture • The architecture of a GPU accelerator chip suggested for Exascale computing • For use in building a NVIDIA GPU cluster • An Echelon GPU chip incorporates 1024 stream cores and 8 latency-optimized CPU-like cores • Eight stream cores form a stream multiprocessor (SM) • There are 128 SMs in the Echelon GPU chip • Each SM is designed with 8 processor cores to yield a 160Gflops peak speed • Totally the chip has a peak speed of 20.48 Tflops • These nodes are interconnected by a NoC to 1,024 SRAM banks (L2 caches) • Each cache bank has a 256 KB capacity
Echelon GPU Cluster Architecture (cont.) • The MCs (memory controllers) are used to connect to off-chip DRAMs • The NI (network interface) is to scale the size of the GPU cluster hierarchically • The architecture of NVIDIA Echelon GPU system
Echelon GPU Cluster Architecture (cont.) • The entire system is built with N cabinets • Labeled C0, C1, ... , CN • Each cabinet is built with 16 compute module • Labeled as M0, M1, ... , M15 • Each compute module is built with 8 GPU nodes • Labeled as N0, N1, ... , N7 • Each GPU node is the innermost block labeled as PC • A single cabinet can house 128 GPU nodes • Each compute module features a performance of 160Tflops and 12.8 TB/s over 2 TB of memory • Each cabinet has the potential to deliver 2.6 Pflops over 32 TB memory and 205 TB/s bandwidth • The N cabinets are interconnected by a Dragonfly network with optical fiber