400 likes | 514 Views
PARALLEL PROCESSOR ORGANIZATIONS. Jehan-François Pâris jfparis@uh.edu. Chapter Organization. Overview Writing parallel programs Multiprocessor Organizations Hardware multithreading Alphabet soup (SISD, SIMD, MIMD, …) Roofline performance model. OVERVIEW. The hardware side.
E N D
PARALLEL PROCESSOR ORGANIZATIONS Jehan-François Pâris jfparis@uh.edu
Chapter Organization • Overview • Writing parallel programs • Multiprocessor Organizations • Hardware multithreading • Alphabet soup (SISD, SIMD, MIMD, …) • Roofline performance model
The hardware side • Many parallel processing solutions • Multiprocessor architectures • Two or more microprocessor chips • Multiple architectures • Multicore architectures • Several processors on a single chip
The software side • Two ways for software to exploit parallel processing capabilities of hardware • Job-level parallelism • Several sequential processes run in parallel • Easy to implement (OS does the job!) • Process-level parallelism • A single program runs on several processors at the same time
Overview • Some problems are embarrassingly parallel • Many computer graphics tasks • Brute force searches in cryptography or password guessing • Much more difficult for other applications • Communication overhead among sub-tasks • Amdahl's law • Balancing the load
Amdahl's Law • Assume a sequential process takes • tp seconds to perform operations that could be performed in parallel • ts seconds to perform purely sequential operations • The maximum speedup will be (tp+ ts )/ts
Balancing the load • Must ensure that workload is equally divided among all the processors • Worst case is when one of the processors does much more work than all others
Example (I) • Computation partitioned amongnprocessors • One of them does 1/m of the work with m < n • That processor becomes a bottleneck • Maximum expected speedup: n • Actual maximum speedup: m
Example (II) • Computation partitioned among64processors • One of them does 1/8 of the work • Maximum expected speedup: 64 • Actual maximum speedup: 8
A last issue • Humans likes to address issues one after the order • We have meeting agendas • We do not like to be interrupted • We write sequential programs
Rene Descartes • Seventeenth-century French philosopher • Invented • Cartesian coordinates • Methodical doubt • [To] never to accept anything for true which I did not clearly know to be such • Proposed a scientific method based on four precepts
Method's third rule • The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain ordereven to those objects which in their own nature do not stand in a relation of antecedence and sequence.
PU PU PU Cache Cache Cache Shared memory multiprocessors … Interconnection network RAM I/O
Shared memory multiprocessor • Can offer • Uniform memory access to all processors(UMA) • Easiest to program • Non-uniform memory access to all processors(NUMA) • Can scale up to larger sizes • Offer faster access to nearby memory
PU PU PU Cache Cache Cache RAM RAM RAM Computer clusters … Interconnection network
Computer clusters • Very easy to assemble • Can take advantage of high-speed LANs • Gigabit Ethernet, Myrinet, … • Data exchanges must be done throughmessage passing
Message passing (I) • If processor P wants to access data in the main memory of processor Q it must • Send a request to Q • Wait for a reply • For this to work, processor Q must have a thread • Waiting for message from other processors • Sending them replies
Message passing (II) • In a shared memory architecture, each processor can directly access all data • A proposed solution • Distributed shared memory offers to the users of a cluster the illusion of a single address space for their shared data • Still has performance issues
When things do not add up • Memory capacity is very important for big computing applications • If the data can fit into main memory, the computation will run much faster
A problem • A company replaced • Single shared memory computer with 32GB of RAM • Four “clustered” computers with 8GB each • More I/O than ever • What did happen?
The explanation • Assume OS occupies one GB of RAM • The old shared-memory computer still had 31 GB of free RAM • Each of the clustered computer has 7 GB of free RAM • The total RAM available to the program went down from 31 GB to 47 = 28 GB!
Grid computing • The computers are distributed over a very large network • Sometimes computer time is donated • Volunteer computing • Seti@Home • Works well with embarrassingly parallel workloads • Searches in a n-dimensional space
General idea • Let the processor switch to another thread of computation while them current one is stalled • Motivation: • Increased cost of cache misses
Implementation • Entirely controlled by the hardware • Unlike multiprogramming • Requires a processor capable of • Keeping track of the state of each thread • One set of registers—including PC– for each concurrent thread • Quickly switching among concurrent threads
Approaches • Fine-grained multithreading: • Switches between threads for each instruction • Provides highest throughputs • Slows down execution of individual threads
Approaches • Coarse-grained multithreading • Switches between threads whenever a long stall is detected • Easier to implement • Cannot eliminate all stalls
Approaches • Simultaneous multi-threading: • Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threads • Best solution
Overview • Used to describe processor organizations where • Same instructions can be applied to • Multiple data instances • Encountered in • Vector processors in the past • Graphic processing units (GPU) • x86 multimedia extension
Classification • SISD: • Single instruction, single data • Conventional uniprocessor architecture • MIMD: • Multiple instructions, multiple data • Conventional multiprocessor architecture
Classification • SIMD: • Single instruction, multiple data • Perform same operations on a set of similar data • Think of adding two vectors for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];
Vector computing • Kind of SIMD architecture • Used by Cray computers • Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU • Requires • Vector registers able to storemultiple values • Special vector instructions: say lv, addv, …
Benchmarking • Two factors to consider • Memory bandwidth • Depends on interconnection network • Floating-point performance • Best known benchmark is LINPACK
Roofline model • Takes into account • Memory bandwidth • Floating-point performance • Introduces arithmetic intensity • Total number of floating point operations in a program divided by total number of bytes transferred to main memory • Measured in FLOPS/byte
Roofline model • Attainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance
Roofline model Peak floating-point performance Floating-point performance is limited by memory bandwidth