Parallel Computer Architectures 2 nd week

Parallel Computer Architectures2nd week • References • Flynn’s Taxonomy • Classification of Parallel Computers Based on Architectures • Trend in Parallel Computer Architectures Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

REFERENCES • Scalable Parallel Computing • Kai Hwang and Zhiwei Xu, McGraw-Hill Chapter.1 • Parallel Computing: Theory and Practice • Michael J. Quinn, McGraw-Hill Chapter 3 • Parallel Processing Course • Yang-Suk Kee(yskee@iris.snu.ac.kr) School of EECS, Seoul National University Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

FLYNN’S TAXONOMY • A way to classify parallel computers (Flynn,1972) • Based on notions of instruction and data streams • SISD (Single Instruction stream over a Single Data stream ) • SIMD (Single Instruction stream over Multiple Data streams ) • MISD (Multiple Instruction streams over a Single Data stream) • MIMD (Multiple Instruction streams over Multiple Data stream) • Popularity • MIMD > SIMD > MISD Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

SISD (Single Instruction Stream Over A Single Data Stream ) • SISD • Conventional sequential machines IS : Instruction Stream DS : Data Stream CU : Control Unit PU : Processing Unit MU : Memory Unit IS IS DS CU PU MU I/O Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DS DS PE1 LM1 IS Data sets loaded from host CU IS DS DS PEn LMn Program loaded from host SIMD (Single Instruction Stream Over Multiple Data Streams ) • SIMD • Vector computers • Special purpose computations PE : Processing Element LM : Local Memory SIMD architecture with distributed memory Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

IS IS CU1 CU2 CUn Memory (Program, Data) IS IS IS DS DS DS PU1 PU2 PUn I/O DS MISD (Multiple Instruction Streams Over A Single Data Streams) • MISD • Processor arrays, systolic arrays • Special purpose computations • Is it different from pipeline computer? MISD architecture (the systolic array) Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

IS IS DS CU1 PU1 Shared Memory I/O IS DS I/O CUn PUn IS MIMD (Multiple Instruction Streams Over Multiple Data Stream) • MIMD • General purpose parallel computers MIMD architecture with shared memory Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

CLASSIFICATION BASED ON ARCHITECTURES • Pipelined Computers • Dataflow Architectures • Data Parallel Systems • Multiprocessors • Multicomputers Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

Cycles Instruction # 1 2 3 4 5 6 7 8 Instruction i IF ID EX WB Instruction i+1 IF ID EX WB Instruction i+2 IF ID EX WB Instruction i+3 IF ID EX WB Instruction i+4 IF ID EX WB PIPELINED COMPUTERS • Instructions are divided into a number of steps (segments, stages) • At the same time, several instructions can be loaded in the machine and be executed in different steps Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DATAFLOW ARCHITECTURES • Data-driven model • A program is represented as a directed acyclic graph in which a node represents an instruction and an edge represents the data dependency relationship between the connected nodes • Firing rule: A node can be scheduled for execution if and only if its input data become valid for consumption • Dataflow languages • Id, SISAL, Silage, ... • Single assignment, applicative(functional) language, no side effect • Explicit parallelism Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DATAFLOW GRAPH a + z = (a + b) * c b * z c The dataflow representation of an arithmetic expression Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DATA-FLOW COMPUTER • Execution of instructions is driven by data availability • Basic components • Data are directly held inside instructions • Data availability check unit • Token matching unit • Chain reaction of asynchronous instruction executions • What is the difference between this and normal (control flow) computers? Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DATAFLOW COMPUTERS • Advantages • Very high potential for parallelism • High throughput • Free from side-effect • Disadvantages • Time lost waiting for unneeded arguments • High control overhead • Difficult in manipulating data structures Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

Token <data, tag, dest, marker> Token Queue Network I/O Switch Matching Unit Overflow Unit Instruction Store func1 funck Match <tag, dest> Dataflow Machine (Manchester Dataflow Computer) To host From host First actual hardware implementation Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DATAFLOW REPRESENTATION input d,e,f c0 = 0 for i from 1 to 4 do begin ai := di / ei bi := ai * fi ci := bi + ci-1 end output a, b, c d1 e1 d2 e2 d3 e3 d4 e4 / / / / a1 a2 a3 a4 f1 * f2 * f3 * f4 * b1 b2 b3 b4 c0 + + + + c4 c1 c2 c3 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

EXECUTION ON CONTROL FLOW MACHINES Assume all the external inputs are available before entering do loop + : 1 cycle, * : 2 cycles, / : 3 cycles, a1 b1 c1 a2 b2 c2 a4 b4 c4 Sequential execution on a uniprocessor in 24 cycles How long will it take to execute this program on a dataflow computer with 4 processors? Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

EXECUTION ON A DATAFLOW MACHINE a1 b1 c1 c2 c3 c4 a2 b2 a3 b3 a4 b4 Data-driven execution on a 4-processor dataflow computer in 9 cycles Can we further reduce the execution time of this program ? Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

PROBLEMS OF DATAFLOW COMPUTERS • Excessive copying of large data structures in dataflow operations • I-structure : a tagged memory unit for overlapped usage by the producer and consumer • Retreat from pure dataflow approach (shared memory) • Handling complex data structures • Chain reaction control is difficult to implement • Complexity of matching store and memory units • Expose too much parallelism (?) Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DATA PARALLEL SYSTEMS • Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Conceptually, a processor associated with each data element Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

DATA PARALLEL SYSTEMS (cont’d) • SIMD Architectural model • Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synchronization Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

VECTOR PROCESSOR • Instruction set includes operation on vectors as well as scalars • 2 types of vector computers • Processor arrays • Pipelined vector processors Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

Processor array Interconnection network Processing element Front-end computer Data memory Data path Program and Data Memory Processing element Data memory … CPU … Instruction path I/O processor Processing element Data memory I/0 I/0 PROCESSOR ARRAY • A sequential computer connected with a set of identical processing elements simultaneouls doing the same operation on different data. Eg CM-200 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

PIPELINED VECTOR PROCESSOR • Stream vector from memory to the CPU • Use pipelined arithmetic units to manipulate data • Eg: Cray-1, Cyber-205 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

MULTIPROCESSOR • Consists of many fully programmable processors each capable of executing its own program • Shared Address Space Architecture • Classified into 2 types • Uniform Memory Access (UMA) Multiprocessors • Non-Uniform Memory Access (NUMA) Multiprocessors Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

SHARED ADDRESS SPACE ARCHITECTURE Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Pn private Load Common physical addresses Store Shared portion of address space P2 private P1 private Private portion of address space P0 private Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

UMA MULTIPROCESSOR • Uses a central switching mechanism to reach a centralized shared memory • All processors have equal access time to global memory • Tightly coupled system • Problem: cache consistency Pi Processor i C1 C2 Cn P1 P2 Pn … Ci Cache i Switching mechanism I/O Memory banks Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

UMA MULTIPROCESSOR (cont’d) • Crossbar switching mechanism Mem Mem Mem Mem Cache Cache I/O I/O P P Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

UMA MULTIPROCESSOR (cont’d) • Shared- Bus switching mechanism Mem Mem Mem Mem Cache Cache I/O I/O P P Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

UMA MULTIPROCESSOR (cont’d) • Packet-switched network Mem Mem Mem Network Cache Cache Cache P P P Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

NUMA MULTIPROCESSOR • Distributed shared memory combined by local memory of all processors • Memory access time depends on whether it is local to the processor • Caching shared (particularly nonlocal) data? P P Mem Cache Mem Cache Network Mem Cache Mem Cache P P Distributed Memory Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

CURRENT TYPES OF MULTIPROCESSORS • PVP (Parallel Vector Processor) • A small number of proprietary vector processors connected by a high-bandwidth crossbar switch • SMP (Symmetric Multiprocessor) • A small number of COTS microprocessors connected by a high-speed bus or crossbar switch • DSM (Distributed Shared Memory) • Similar to SMP • The memory is physically distributed among nodes. Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

PVP (Parallel Vector Processor) VP : Vector Processor SM : Shared Memory VP VP VP Crossbar Switch SM SM SM Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

SMP (Symmetric Multi-Processor) P/C : Microprocessor and Cache P/C P/C P/C Bus or Crossbar Switch SM SM SM Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

MB MB P/C P/C LM LM DIR DIR NIC NIC Custom-Designed Network DSM (Distributed Shared Memory) DIR : Cache Directory Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

… M M M P P P Message-Passing Interconnection Network M P P M … … M P P M P P P … M M M MULTICOMPUTERS • Consists of many processors with their own memory • No shared memory • Processors interact via message passing  loosely coupled system Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

CURRENT TYPES OF MULTICOMPUTERS • MPP (Massively Parallel Processing) • Total number of processors > 1000 • Cluster • Each node in system has less than 16 processors. • Constellation • Each node in system has more than 16 processors Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

MPP (Massively Parallel Processing) MB : Memory Bus NIC : Network Interface Circuitry MB MB P/C P/C LM LM NIC NIC Custom-Designed Network Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

Cluster LD : Local Disk IOB : I/O Bus MB MB P/C P/C M M Bridge Bridge LD LD IOB IOB NIC NIC Commodity Network (Ethernet, ATM, Myrinet, VIA) Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

CONSTELLATION IOC : I/O Controller >= 16 >= 16 P/C P/C P/C P/C IOC IOC Hub Hub NIC NIC LD LD SM SM SM SM Custom or Commodity Network Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

Trend in Parallel Computer Architectures MPPs Constellations Clusters SMPs 400 350 300 250 Number of HPCs 200 150 100 50 0 Years 1997 1998 1999 2000 2001 2002 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

TOWARD ARCHITECTURAL CONVERGENCE • Evolution and role of software have blurred boundary • Send/recv supported on SAS machine via buffers • Can construct global address space on MP • Hardware organization converging too. • Tighter NI integration for MP (low latency, higher bandwidth) • At lower level, hardware SAS passes hardware messages. • Nodes connected by general network and communication assists. • Cluster of workstations/SMPs • Emergence of fast SAN (System Area Networks) Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

Converged Architecture of Current Supercomputers Interconnection Network Memory Memory Memory Memory P P P P P P P P P P P P P P P P Multiprocessors Multiprocessors Multiprocessors Multiprocessors Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

Parallel Computer Architectures 2 nd week

Parallel Computer Architectures 2 nd week

Presentation Transcript

Parallel Computer Architectures

Parallel Computer Architectures Duncan A. Buell

Parallel Computer Architectures

Advanced Computer Architecture Data-Level Parallel Architectures

2 nd week

Week 2 February 2 nd , 2014

English 1301 2 nd week

Lecture 2: Evaluating Computer Architectures

PIPELINING 2 nd week

Parallel Architectures

Parallel Computer Architectures Duncan A. Buell

Parallel Architectures

Controls Lab (2 nd Week) ‏

Paralleelarvutid Parallel Computer Architectures

2 ND WEEK VOCABULARY

M-Machine and Grids Parallel Computer Architectures

Parallel Architectures: Topologies

Parallel Architectures

Parallel Architectures

Parallel Computer Architectures Duncan A. Buell

A Survey of Parallel Computer Architectures

Communication Models for Parallel Computer Architectures