Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture(CS05162) Introduction on Data-Level Parallel Architecture An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Outline • Basic Concepts • Roots:Vector Supercomputers • Vector Architectures of the Future • High-end supercomputer • High performance microprocessors for the Mutilmedia • Stream Processor Architecture USTC CS AN Hong

DLP vs. ILP(superscalar and VLIW) • ILP(Instruction level parallelism): Different operations (one load, one add, one multiply and one divide) in same cycle • Each operation is performed in a different execution unit • General purpose computing • Superscalar, VLIW, DSP(VLIW) • DLP(data-level parallelism): Same type of arithmetic/logical operation is performed on multiple data elements • Typically, the execution unit is wide (64-bit or 128-bit) and holds multiple data, e.g., four 16-bit elements in a 64-bit execution unit • Multimedia, scientific computing • Vector, SIMD, SPMD USTC CS AN Hong

Vector (N operations) v1 r1 r1 r1 r1 r1 r1 r1 r2 r2 r2 r2 r2 r2 v2 r2 r1 r1 r1 r1 r1 r1 v3 r1 vector length addv v3, v1, v2 Vector Processing • Vector processors have high-level operations that work on linear arrays of numbers: "vectors“ • Initially developed for super-computing applications, today important for multimedia. Scalar (1 operation) add r3, r1, r2 USTC CS AN Hong

Roots: Vector Supercomputers USTC CS AN Hong

Supercomputers In 70s-80s, Supercomputer = Vector Machine • Definition of a supercomputer • Fastest machine in world at given task • A device to turn a compute-bound problem into an I/O bound problem • Any machine costing $30M+ • Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer (Control Data Corporation) USTC CS AN Hong

Supercomputer Applications • Typical application areas • Military research (nuclear weapons, cryptography) • Scientific research • Weather forecasting • Oil exploration • Industrial design (car crash simulation) • …… All involve huge computations on large data sets USTC CS AN Hong

Vector Supercomputers(Epitomized by Cray-1, 1976) • Scalar Unit + Vector Extensions. • Load/Store Architecture. • Vector Registers. • Vector Instructions. • Hardwired Control. • Highly Pipelined Functional Units. • Interleaved Memory System. • No Data Caches. • No Virtual Memory USTC CS AN Hong

Cray-1(1976) USTC CS AN Hong

Cray-1: the world’s most expensive love-seat 1976 80 MHz 138 MFLOPs peak 8 MB memory USTC CS AN Hong

A Modern Vector Super:NEC SX-5(1998) • CMOS Technology • 250MHz clock (312 MHz in 2001) • CPU fits on one multi-chip module • SDRAM main memory (up to 128GB) • Scalar unit • 4-way superscalar with out-of-order and speculative execution • 64KB I-cache and 64KB data cache • Vector unit • 8 foreground VRegs+ 64 background VRegs(256 elements/VReg) • 1multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit • 16 lanes, 8 GFLOPS peak (32 FLOPS/cycle) • 1 load & store unit (32x8 byte accesses/cycle) • 64 GB/s memory bandwidth per processor • SMP structure • 16 CPUs connected to memory through crossbar • 1 TB/s shared memory bandwidth USTC CS AN Hong

Components of Vector Processor • Scalar registers: single element for FP scalar or address • Scalar CPU: registers, datapaths, instruction fetch logic • Vector Registers: fixed length bank holding a single vector • has at least 2 read and 1 write ports • typically 8~32 vector registers, each holding 64-128 64-bit elements • MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements • Vector Functional Units (FUs): fully pipelined, start new operation every clock • typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit • Multiple datapaths (pipelines) used for each unit to process multiple elements per cycle • Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; may have multiple LSUs • Multiple elements fetched/stored per cycle • Cross-bar: to connect FUs , LSUs, registers USTC CS AN Hong

Vector Programming Model USTC CS AN Hong

Styles of Vector Architectures • Memory-memory vector processors • All vector operations are memory to memory, vector memory-memory instructions hold all vector operands in main memory • The first vector machines, CDC Star-100 (’ 73) and TI ASC (’71), were memory-memory machines • Vector-register processors • All vector operations between vector registers (except vector load and store) • Vector equivalent of load-store architectures • Includes all vector machines since late 1980s, Cray-1 (’ 76) was first vector register machine USTC CS AN Hong

Example Source Code • For (i=0;i<N;i++) • { • C[i]=A[i]+B[i]; • D[i]=A[i]-B[i]; • } • Vector Memory-Memory Code • ADDV C, A, B • SUBV D, A, B • Vector Register Code • LV V1, A • LV V2, B • ADDV V3, V1, V2 • SV V3, C • SUBV V4, V1, V2 • SV V4, D Styles of Vector Architectures • We assume vector-register for rest of the lecture USTC CS AN Hong

Vector Arithmetic Execution • Use deep pipeline (=> fast clock) to execute element operations. • Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six stage multiply pipeline USTC CS AN Hong

Vector Instruction Execution USTC CS AN Hong

Basic Vector Instructions: e.g. “DLXV” Vector Instructions USTC CS AN Hong

Vector Code Example USTC CS AN Hong

Automatic Code Vectorization for (i=0; i < N; i++) C[i] = A[i] + B[i]; USTC CS AN Hong

Vector Memory System • Cray-1, 16banks, 4 cycle bank busy time, 12 cycle latency • Bank busy time: cycles between accesses to same bank USTC CS AN Hong

Vector Memory Operations • Load/store operations move groups of data between registers and memory • Three types of addressing • Unit stride • Fastest • Non-unit(constant) stride • Indexed (gather-scatter) • Vector equivalent of register indirect • Good for sparse arrays of data • Increases number of programs that vectorize • compress/expand variant also • Support for various combinations of data widths in memory • {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b} USTC CS AN Hong

Optimization 1: Vector Chaining • Suppose: vmul.vv V1,V2,V3vadd.vv V4,V1,V5 # RAW hazard • Chaining • Vector register (V1) is not as a single entity but as a group of individual registers • Pipeline forwarding can work on individual vector elements • Flexible chaining: allow vector to chain to any other active vector operation => more read/write ports Unchained Cray X-mp introduces memory chaining vadd vmul vmul Chained vadd USTC CS AN Hong

Optimization 1: Vector Chaining • Vector Chaining Advantage • Without chaining, must wait for last element of result to be written before starting dependent instruction • With chaining, can start dependent USTC CS AN Hong

Optimization 1: Vector Chaining • Vector version of register bypassing • introduced with Cray-1 USTC CS AN Hong

Pipelined Datapath Optimization 2: Multi-lane Implementation(Vector Unit Structure) 4 Lanes(通道), 2 Functional Units To/From Memory System USTC CS AN Hong

Optimization 2: Multi-lane Implementation(Vector Unit Structure) • Elements for vector registers interleaved across the lanes • Each lane receives identical control • Multiple element operations executed per cycle • Modular, scalable design • No need for inter-lane communication for most vector instructions USTC CS AN Hong

Vector Instruction Parallelism: Chaining and Multi-lane Execution • Can overlap execution of multiple vector instructions • example machine has 32 elements per vector register and 8 lanes USTC CS AN Hong

Vector Length • A vector register can hold some maximum number of elements for each data width (maximum vector length or MVL) • What to do when the application vector length is not exactly MVL? • Vector-length (VL) registercontrols the length of any vector operation, including a vector load or store • E.g. vadd.vv with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] • VL can be anything from 0 to MVL • How do you code an application where the vector length is not known until run-time? USTC CS AN Hong

Vector Strip Mining • Problem: Vector registers have finite length(MVL), suppose application vector length > MVL • Solution: Break loops into pieces that fit into vector registers, called “Strip Mining” • Generation of a loop that handles MVL elements per iteration • A set operations on MVL elements is translated to a single vector instruction • Example: vector saxpy of N elements • First loop handles (N mod MVL) elements, the rest handle MVL VL = (N mod MVL); // set VL = N mod MVL for (I=0; I<VL; I++) // 1st loop is a single set of Y[I]=A*X[I]+Y[I]; // vector instructions low = (N mod MVL); VL = MVL; // set VL to MVL for (I=low; I<N; I++) // 2nd loop requires N/MVL Y[I]=A*X[I]+Y[I]; // sets of vector instructions VLR = 64 USTC CS AN Hong

Vector Strip Mining USTC CS AN Hong

Vector Startup • Two components of vector startup penalty • functional unit latency (time through pipeline) • dead time or recovery time (time before another vector instruction can start down pipeline) USTC CS AN Hong

Vector Instruction Set Advantages(vs. scalar or VLIW ISA’s) • Semantic advantages • Compact( vs. VLIW) • one short vectorinstruction encodes N operations (loop), implies lots of work => far fewer instructions =>instruction fetch and decodebandwidth are greatly reduced and more effectively utilized =>Reduces branches and branch problems in pipelines => far fewer Operations => far fewer address computation => far fewer branch computation, reduces branches and branch problems in pipelines • far fewer loop counter increments USTC CS AN Hong

Vector Instruction Set Advantages(vs. scalar or VLIW ISA’s) • Expressive • Natural way to express DLP, compiler (programmer) ensures no dependencies, tells hardware that these N operations: • are independent => can be executed in parallel • use the same functional unit => Functional unit replication • access disjoint registers • access registers in the same pattern as previous instructions • access a contiguous block of memory (unit-stride load/store) • access memory in a known pattern (strided load/store) • Scalable • can run same object code on more parallel pipelines or lanes • Simpler design, aggressively clock a design by deeply pipelining it =>Heavy pipelining, high clock rate USTC CS AN Hong

Number of instructions executed (Vector) (Superscalar) USTC CS AN Hong

Number of operations executed R1000/C34 Due to a combination of heavily nested IF constructs in the main loop, high register pressure , and lack of support for multiple vector mask registers USTC CS AN Hong

Operation & Instruction Count: RISC v. Vector Processor(from F. Quintana, U. Barcelona.) Vector reduces ops by 1.2x, instructions by 20x USTC CS AN Hong

Vector Instruction Set Advantages(vs. scalar or VLIW ISA’s) • Vector instructions access memory with known pattern • Only useful data is requested • Every data item requested by the processor is actually used • Spatial locality can be exploited by requesting multiple data items with a single address • Effective prefetching • Amortize memory latency of over large number of elements • Stride information can be used by the hardware to optimize memory accesses • Can exploit a high bandwidth memory system • No (data) caches required! • Do use instruction cache USTC CS AN Hong

Vector Instruction Set Advantages(vs. scalar or VLIW ISA’s) • Datapath Control The semantic content of the vector instructions already includes the notion of parallel operations • Needn’t complex dispatch window and reorder buffers required for superscalar machine • Without increasing the complexity or the pressure on the decode unit Add vector pipes and add wider paths from the vector registers to the functional units => scaled to higher levels of parallelism • Low power and Real-time performance • Vector instruction have the property of “localizing” computations => reduce power consumption by turning off all units not needed during the execution of a long-runing vector instruction • Power on:the functional unit and register busses • Power off: instruction fetch unit, the reorder buffer and other large power-hungry blocks of the processor USTC CS AN Hong

Vector Power Consumption • Can trade-off parallelism for power • Power = C *Vdd2 *f • If we double the lanes, peak performance doubles • Halving f restores peak performance but also allows halving of the Vdd • Powernew = (2C)*(Vdd/2)2*(f/2) = Power/4 • Simpler logic • Replicated control for all lanes • No multiple issue or dynamic execution logic • Simpler to gate clocks • Each vector instruction explicitly describes all the resources it needs for a number of cycles • Conditional execution leads to further savings USTC CS AN Hong

Vectors Lower Power USTC CS AN Hong

Example Vector Machines USTC CS AN Hong (from: Hennessy)

Vector Linpack Performance(MFLOPS) USTC CS AN Hong (from: Hennessy)

Vector architectures of the futrue: High performance microprocessors for the mutilmedia USTC CS AN Hong

Applications: Limited to scientific computing? USTC CS AN Hong

Possible evolution of vector architectures USTC CS AN Hong

Approaches to Mediaprocessing General-purpose processors with SIMD extensions Vector Processors VLIW with SIMD extensions (aka mediaprocessors) Multimedia Processing DSPs ASICs/FPGAs USTC CS AN Hong

What is Multimedia Processing? • Desktop: • 3D graphics (games) • Speech recognition (voice input) • Video/audio decoding (mpeg-mp3 playback) • Servers: • Video/audio encoding (video servers, IP telephony) • Digital libraries and media mining (video servers) • Computer animation, 3D modeling & rendering (movies) • Embedded: • 3D graphics (game consoles) • Video/audio decoding & encoding (set top boxes) • Image processing (digital cameras) • Signal processing (cellular phones) USTC CS AN Hong

The Need for Multimedia ISAs • Why aren’t general-purpose processors and ISAs sufficient for multimedia (despite Moore’s law)? • Performance • A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps • One 384Kbps W-CDMA channel requires 6.9 GOPS • Power consumption • A 1.2GHz Athlon consumes ~60W • Power consumption increases with clock frequency and complexity • Cost • A 1.2GHz Athlon costs ~$62 to manufacture and has a list price of ~$600 (module) • Cost increases with complexity, area, transistor count, power, etc USTC CS AN Hong

Lecture on High Performance Processor Architecture ( CS05162 )