1 / 40

PARALLEL PROCESSOR ORGANIZATIONS

PARALLEL PROCESSOR ORGANIZATIONS. Jehan-François Pâris jfparis@uh.edu. Chapter Organization. Overview Writing parallel programs Multiprocessor Organizations Hardware multithreading Alphabet soup (SISD, SIMD, MIMD, …) Roofline performance model. OVERVIEW. The hardware side.

coen
Download Presentation

PARALLEL PROCESSOR ORGANIZATIONS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PARALLEL PROCESSOR ORGANIZATIONS Jehan-François Pâris jfparis@uh.edu

  2. Chapter Organization • Overview • Writing parallel programs • Multiprocessor Organizations • Hardware multithreading • Alphabet soup (SISD, SIMD, MIMD, …) • Roofline performance model

  3. OVERVIEW

  4. The hardware side • Many parallel processing solutions • Multiprocessor architectures • Two or more microprocessor chips • Multiple architectures • Multicore architectures • Several processors on a single chip

  5. The software side • Two ways for software to exploit parallel processing capabilities of hardware • Job-level parallelism • Several sequential processes run in parallel • Easy to implement (OS does the job!) • Process-level parallelism • A single program runs on several processors at the same time

  6. WRITING PARALLEL PROGRAMS

  7. Overview • Some problems are embarrassingly parallel • Many computer graphics tasks • Brute force searches in cryptography or password guessing • Much more difficult for other applications • Communication overhead among sub-tasks • Amdahl's law • Balancing the load

  8. Amdahl's Law • Assume a sequential process takes • tp seconds to perform operations that could be performed in parallel • ts seconds to perform purely sequential operations • The maximum speedup will be (tp+ ts )/ts

  9. Balancing the load • Must ensure that workload is equally divided among all the processors • Worst case is when one of the processors does much more work than all others

  10. Example (I) • Computation partitioned amongnprocessors • One of them does 1/m of the work with m < n • That processor becomes a bottleneck • Maximum expected speedup: n • Actual maximum speedup: m

  11. Example (II) • Computation partitioned among64processors • One of them does 1/8 of the work • Maximum expected speedup: 64 • Actual maximum speedup: 8

  12. A last issue • Humans likes to address issues one after the order • We have meeting agendas • We do not like to be interrupted • We write sequential programs

  13. Rene Descartes • Seventeenth-century French philosopher • Invented • Cartesian coordinates • Methodical doubt • [To] never to accept anything for true which I did not clearly know to be such • Proposed a scientific method based on four precepts

  14. Method's third rule • The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain ordereven to those objects which in their own nature do not stand in a relation of antecedence and sequence.

  15. MULTI PROCESSOR ORGANIZATIONS

  16. PU PU PU Cache Cache Cache Shared memory multiprocessors … Interconnection network RAM I/O

  17. Shared memory multiprocessor • Can offer • Uniform memory access to all processors(UMA) • Easiest to program • Non-uniform memory access to all processors(NUMA) • Can scale up to larger sizes • Offer faster access to nearby memory

  18. PU PU PU Cache Cache Cache RAM RAM RAM Computer clusters … Interconnection network

  19. Computer clusters • Very easy to assemble • Can take advantage of high-speed LANs • Gigabit Ethernet, Myrinet, … • Data exchanges must be done throughmessage passing

  20. Message passing (I) • If processor P wants to access data in the main memory of processor Q it must • Send a request to Q • Wait for a reply • For this to work, processor Q must have a thread • Waiting for message from other processors • Sending them replies

  21. Message passing (II) • In a shared memory architecture, each processor can directly access all data • A proposed solution • Distributed shared memory offers to the users of a cluster the illusion of a single address space for their shared data • Still has performance issues

  22. When things do not add up • Memory capacity is very important for big computing applications • If the data can fit into main memory, the computation will run much faster • A company replaced • Single shared memory computer with 32GB of RAM

  23. A problem • A company replaced • Single shared memory computer with 32GB of RAM • Four “clustered” computers with 8GB each • More I/O than ever • What did happen?

  24. The explanation • Assume OS occupies one GB of RAM • The old shared-memory computer still had 31 GB of free RAM • Each of the clustered computer has 7 GB of free RAM • The total RAM available to the program went down from 31 GB to 47 = 28 GB!

  25. Grid computing • The computers are distributed over a very large network • Sometimes computer time is donated • Volunteer computing • Seti@Home • Works well with embarrassingly parallel workloads • Searches in a n-dimensional space

  26. HARDWARE MULTITHREADING

  27. General idea • Let the processor switch to another thread of computation while them current one is stalled • Motivation: • Increased cost of cache misses

  28. Implementation • Entirely controlled by the hardware • Unlike multiprogramming • Requires a processor capable of • Keeping track of the state of each thread • One set of registers—including PC– for each concurrent thread • Quickly switching among concurrent threads

  29. Approaches • Fine-grained multithreading: • Switches between threads for each instruction • Provides highest throughputs • Slows down execution of individual threads

  30. Approaches • Coarse-grained multithreading • Switches between threads whenever a long stall is detected • Easier to implement • Cannot eliminate all stalls

  31. Approaches • Simultaneous multi-threading: • Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threads • Best solution

  32. ALPHABET SOUP

  33. Overview • Used to describe processor organizations where • Same instructions can be applied to • Multiple data instances • Encountered in • Vector processors in the past • Graphic processing units (GPU) • x86 multimedia extension

  34. Classification • SISD: • Single instruction, single data • Conventional uniprocessor architecture • MIMD: • Multiple instructions, multiple data • Conventional multiprocessor architecture

  35. Classification • SIMD: • Single instruction, multiple data • Perform same operations on a set of similar data • Think of adding two vectors for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];

  36. Vector computing • Kind of SIMD architecture • Used by Cray computers • Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU • Requires • Vector registers able to storemultiple values • Special vector instructions: say lv, addv, …

  37. Benchmarking • Two factors to consider • Memory bandwidth • Depends on interconnection network • Floating-point performance • Best known benchmark is LINPACK

  38. Roofline model • Takes into account • Memory bandwidth • Floating-point performance • Introduces arithmetic intensity • Total number of floating point operations in a program divided by total number of bytes transferred to main memory • Measured in FLOPS/byte

  39. Roofline model • Attainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance

  40. Roofline model Peak floating-point performance Floating-point performance is limited by memory bandwidth

More Related