370 likes | 534 Views
CS61V. Parallel Architectures. Architecture Classification . The Flynn taxonomy (proposed in 1966!) Functional taxonomy based on the notion of streams of information: data and instructions
E N D
CS61V Parallel Architectures
Architecture Classification • The Flynn taxonomy (proposed in 1966!) • Functional taxonomy based on the notion of streams of information: data and instructions • Platforms are classified according to whether they have a single (S) or multiple (M) stream of data or instructions.
Flynn’s Classification Architecture Categories SISD SIMD MISD MIMD
SISD • Classic von Neumann machine • Basic components: CPU (control unit, ALU) and Main Memory (RAM) • Connected via Bus (aka von Neumann bottleneck) • Examples: standard desktop computer, laptop
SISD M C P IS IS DS
SIMD • Pure SIMD machine: • single CPU devoted exclusively to control • collection of subordinate ALUs each w/small amount of memory • Instruction cycle: CPU broadcasts, ALUs execute or idle • lock-step progress (effectively a global clock) • Key point: completely synchronous execution of statements • Vector and matrix computation lend themselves to an SIMD implementation • Examples of SIMD computers: Illiac IV, MPP, DAP, CM-2, and MasPar MP-2
Data Parallel Systems • Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Conceptually, a processor associated with each data element • Architectural model • Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synchronization • Original motivations • Matches simple differential equation solvers • Centralize high cost of instruction fetch/sequencing
Data Parallel Programming In this approach, we must determine how large amounts of data can be split up. In other words, we need to identify small chunks of data which require similar processing. • These chunks of data are than assigned to different sites where they can be processed. The computations at each node may require some intermediate results from peer nodes. • The same executable could be running on each processing site, but each processing site would have different datasets. • For data parallelism to work best the volume of communicated values should be small compared with the volume of locally computed results.
Data Parallel Programming Data Parallel decomposition can be implemented using a SPMD (single program multiple data) programming model. One processing element is regarded as "first among equals“: • This processor starts up the program and initialises the other processors. It then works as an equal to these processors. • Each PE is doing approximately the same calculation on different data.
Data Parallel Programming Data-parallel architectures introduced the new programming-language concept of a distributed or parallel array. Typically the set of semantic operations allowed on a distributed array was somewhat different to the operations allowed on a sequential array Unfortunately, each data parallel language had features tied to a particular manufacturer's parallel computer architecture e.g. *LISP, C* and CM Fortran for Thinking Machines Corporation’s Connection Machine series of computers. In the 1980s and 1990s microprocessors grew in power and availability, and fell in price. Building SIMD computers out of simple but specialized compute nodes gradually became less economical than putting a general purpose commodity microprocessor at every node. Eventually SIMD computers were displaced almost completely by Multiple Instruction Multiple Data (MIMD) parallel computer architectures.
Example - ILLIAC IV ILLIAC IV was the first large system to employ semiconductor primary memory, built in 1974 at the University of Illinois. The ILLIAC IV was a SIMD computer for array processing. It consisted of: • a control unit (CU) and • 64 processing elements (PEs). Each processing element had two thousand 64-bit words of memory associated with it. The CU could access all 128K words of memory through a bus, but each PE could only directly access its local memory.
Example - ILLIAC IV An 8 by 8 grid interconnect joined each PE to 4 neighbours. The CU interpreted program instructions scattered across the memory, and broadcast them to the PEs. Neither the PEs nor the CU were general-purpose computers in the modern sense--the CU had quite limited arithmetic capabilities. Between 1975 and 1981 it was the world's fastest computer.
Example - ILLIAC IV The ILLIAC IV had thirteen rotating fixed head disks which comprised part of the central system memory. The ILLIAC IV, one of the first computers to use all semiconductor main memories.
Data Parallel Languages CFD was a data parallel language developed in the early 70s at the Computational Fluid Dynamics Branch of Ames Research Center. CFD was a ``FORTRAN-like'' language, rather than a FORTRAN dialect. The language design was extremely pragmatic. No attempt was made to hide the hardware peculiarities from the user; in fact, every attempt was made to give programmers access and control of all of the ILLIAChardware so they could construct an efficient program. CFD had five basic datatypes: • CU INTEGER • CU REAL • CU LOGICAL • PE REAL • PE INTEGER.
Data Parallel Languages The type of a variable statically encoded its home: • either on the control unit or on the processing elements. Apart from restrictions on their home, the two INTEGER and REAL types behave like the corresponding types in ordinary FORTRAN. The CU LOGICAL type was more idiosyncratic: • it had 64 independent bits that acted as flags controlling activity of the PEs.
Data Parallel Languages Scalars and arrays of the five types could be declared as in FORTRAN. • An ordinary variable or array of type CU REAL, for example, would be allocated in the (very small) control unit memory. • An ordinary variable or array of type PE REAL would be allocated somewhere in the collective memory of the processing elements (accessible by the control unit over the data bus) e.g. CU REAL A, B(100) PE INTEGER I PE REAL D(25), E(1000) The last data structure available in CFD was a new kind of array called a vector-aligned array.
Data Parallel Languages Only the first dimension could be distributed, and the extent of that dimension had to be exactly 64. A vector-aligned array would be of PE INTEGER or PE REAL type, and the syntax for the distributed dimension involved an asterisk: PE INTEGER J(*) PE REAL X(*,4), Y(*,2,8) These are parallel arrays. J(1) is stored on the first PE J(2) is stored on the second PE, and so on. Similarly X(1,1), X(1,2), X(1,3), X(1,4) are stored on PE 1 X(2,1), X(2,2), X(2,3), X(2,4) are stored on PE 2, etc.
Data Parallel Languages A vector expression was a vector-aligned array with a (*) subscript in the first dimension. Communication between neighbouring PEs was captured by allowing the (*) to have some shift added, as in: DIFP(*) = P(* + 1) - P(* - 1) All shifts were cyclic (end-around) shifts, so this parallel statement is equivalent to the sequential statements: DIFP(1) = P(2) - P(64) DIFP(2) = P(3) - P(1) ... DIFP(64) = P(1) - P(63)
Data Parallel Languages Essential flexibility was added by allowing vector assignments to be executed conditionally with a vector test, e.g. IF(A(*) .LT. 0) A(*) = -A(*) Less structured methods of masking operations by explicitly assigning PE activity flags in CU LOGICAL variables were also available; • there were special primitives for restricting activity to simply-specified ranges of PEs. • PEs could concurrently access different addresses in their local memory by using vector subscripts: DIAG(*) = RHO(*, X(*))
Connection Machine (Tucker, IEEE Computer, Aug. 1988)
CM-5 • Repackaged SparcStation • 4 per board • Fat-Tree network • Control network for global synchronization
Whither SIMD machines? Trade-off individual processor performance for collective performance: • CM-1 had 64K PEs each 1-bit! Problems with SIMD • Inflexible - not all problems can use this style of parallelism • cannot leverage off microprocessor technology => cannot be general-purpose architectures Special-purpose SIMD architecture still viable (array processors, DSP chips)
… … … … … vr1 vr2 vr2 vr1 vr3 + + + + + Vector Processors Definition: a processor that can do element-wise operations on entire vectors with a single instruction, called a vector instruction • These are specified as operations on vector registers • A processor comes with some number of such registers A vector register holds ~32-64 elements • The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 The hardware performs a full vector operation in • #elements-per-vector-register / #pipes r1 r2 + (logically, performs #elts adds in parallel) r3 (actually, performs #pipes adds in parallel)
Vector Processors Advantages • quick fetch and decode of a single instruction for multiple operations • the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion • The compiler does the work for you of course Memory-to-memory • no registers • can process very long vectors, but startup time is large • appeared in the 70s and died in the 80s Examples: Cray, Fujitsu, Hitachi, NEC
Vector Processors What about: for (j = 0; j < 100; j++) A[j] = B[j] * C[j] Scalar code: load, operate, store for each iteration Both instructions and data consume memory bandwidth The solution: A vector instruction
Vector Processors A[0:99] = B[0.99] * C[0:99] • Single instruction requires memory bandwidth for data only. • No control overhead for loops Pitfalls • extension to instruction set, vector fu’s, vector registers, memory subsystem changes for vectors
Vector Processors Merits of vector processor • Very deep pipeline without data hazard • The computation of each result is independent of the computation of previous results • Instruction bandwidth requirement is reduced • A vector instruction specifies a great deal of work • Control hazards are nonexistent • A vector instruction represents an entire loop. • No loop branch
Vector Processors (Cont’d) The high latency of initiating a main memory access is amortized • A single access is initiated for the entire vector rather than a single word • Known access pattern • Interleaved memory banks Vector operations is faster than a sequence of scalar operations on the same number of data items!
Vector Programming Example Y = a * X + Y LD F0, a ADDI R4, Rx, #512 ; last address to load Loop: LD F2, 0(Rx) ; load X(i) MULTD F2, F0, F2 ; a x X(i) LD F4, 0(Ry) ; load Y(i) ADDD F4, F2, F4 ; a x X(i) + Y(i) SD F4, 0(Ry) ; store into Y(i) ADDI Rx, Rx, #8 ; increment index to X ADDI Ry, Ry, #8 ; increment index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done Repeat 64 times RISC machine
Vector Programming Example(Cont’d) Y = a * X + Y LD F0, a ; load scalar LV V1, Rx ; load vector X MULTSV V2, F0, V1 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result 6 instructions (low instruction bandwidth) Vector machine
A Vector-Register Architecture(DLXV) Main Memory FP add/subtract Vector Load-store FP add/subtract FP add/subtract Vector registers FP add/subtract FP add/subtract Crossbar Crossbar Scalar registers
Vector Machines Registers Elements per register Load Store Functional units 8 64 1 6 CRAY-1 8 - 256 32-1024 2 3 Fujitsu VP200 8 64 2Ld/1St 8 CRAY X-MP 32 256 4 4 Hitachi S820 8 + 8192 256 8 16 NEC SX/2 Convex C-1 8 128 1 4 8 64 1 5 CRAY-2 CRAY Y-MP 8 64 2Ld/1St 8 CRAY C-90 8 128 4 8 NEC SX/4 8 + 8192 256 8 16
MISD • Multiple instruction, single data • Doesn’t really exist, unless you consider pipelining an MISD configuration
MISD M IS C P IS DS C P IS IS DS