ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 4 Billion-Transistor Architecture 97 (Part II) Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering

Practitioners’ Groups Every one has an acronym !  • IRAM • Implementation at Berkeley • CMP • Lead to Sun Niagra and the multicore (r)evolution • SMT • Intel HyperThreading (arguably Intel first envisioned the idea), IBM Power5, Alpha 21464 • Many credit this technology to UCSB’s multistreaming work in early 1990s. • RAW • Lead to Tilera64

C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick

Mission Statement

Future Roadblocks that Inspires IRAM • Latency issues • Continuingly increased performance gap between processor and memory • DRAM optimized for density, not speed • Bandwidth issues • Off-chip bus • Slow and narrow • high capacitance, high energy • Especially, scientific codes, database, etc.

IRAM Approach • Move DRAM closer to processor • Enlarge on-chip bandwidth • Fewer I/O pins • Smaller package • Serial interface  Anything look familiar?

IRAM Chip Design Research • How much larger and slower is a processor designed in a straight DRAM process vs. a standard logic process • Microprocessor fab offers fast transistors fo fast logic and many metal layers for accelerating communication and simplifying power distribution • DRAM fabs offer many poly layers to give small DRAM cells and low leakage for low refresh rate • Speed of page buffer vs. registers and cache • New DRAM interface based on fast serial links (2.5Gbit/s or 300 MB/s per pin) • Quantify Bandwidth vs. Area/Power tradeoff • Area overhead for IRAM vs. a DRAM • Extra power dissipation for IRAM vs. a DRAM • Performance of IRAM with same area and power as DRAM (“processor for free) Source: David Patterson’s slide in his IRAM Overview talk

IRAM Architecture Research • How much slower can a processor with a high bandwidth memory be and yet be as fast as a conventional computer? (very interesting point) • Compare memory management schemes (e.g., vector registers, scratch pad, wide TLB/cache) • Compare scheme for running large programs, i.e., span multiple IRAMs • Quantify value of compact programs and data (e.g., compact code, on-the-fly compression) • Quantify pros and cons of standard instruction set vs. custom IRAM instruction set Source: David Patterson’s slide in his IRAM Overview talk

IRAM Compiler Research • Explicit SW control of memory management vs. conventional implicit HW designs • Protection (software fault isolation) • Paging (dynamic relocation, overlap I/O accesses) • “Cache” control (vector register, scratch pad) • I/O interrupt/polling • Evaluate benchmark performance in conjunction with architectural research • Number crunching (Vector vs. superscalar) • Memory intensive (database, operating system) • Real-time benchmarks (stability and performance) • Pointer intensive (GCC compiler) • Impact of Language on IRAM (Fortran 77 vs. HPF, C/C++ vs Java) Source: David Patterson’s slide in his IRAM Overview talk

Potential IRAM Architecture • “New Model”: VSIW=Very Short Instruction Word! • Compact: Describe N operations with 1 short inst. (vector) • Predictable: (real-time) perf. Vs. statistical perf. (cache) • Multimedia ready: choose Nx64b, 2Nx32b, 4Nx16 • Easy to get high performance; N operations: • Are independent • Use same functional unit • Access disjoint registers • Access registers in same order as previous instructions • Access contiguous memory words or known pattern • Hides memory latency (and any other latency) • Compiler technology already developed.. Source: David Patterson’s slide in his IRAM talk

Why vector processing Scalable design Higher code density Run at a higher clock rate Better energy efficiency due to easier clock gating for vector / scalar units Lower die temperature to keep good data retention rate On-chip DRAM is sufficient for embedded applications Use external off-chip DRAM as secondary memory Pages swapped between on-chip and off-chip DRAMs Berkeley Vector-Intelligent RAM

VIRAM-1 Floorplan • 180nm, CMOS, 6-layer copper • 125 million transistors, 325 mm2 • 2 watts @ 200MHz • 13MB eDRAM macros from IBM and 4 vector units (total 8KB vector registers) • VRF = 32x64b or 64x32b or 128x16b 64-bit MIPS M5Kc ¼ of 8KB VRF (Custom layout) IBM Embedded DRAM macros, each 13Mbit [Gebis et al. DAC student contest 04]

S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, D. M. Tullsen

FU2 FU4 FU1 FU3 Thread 1 Unused Thread 2 Thread 3 Thread 4 Thread 5 Execution Time Fine-grained Multithreading (cycle-by-cycle Interleaving) Chip Multiprocessor (CMP) Conventional Superscalar Single Threaded Coarse-grained Multithreading (Block Interleaving) Simultaneous Multithreading (or Intel’s HT) SMT Concept vs. Other Alternatives • Early SMT idea was developed by UCSB (Mario Nemirosky’s group HICSS’94) • The name SMT was christened by the group at University of Washington ISCA’95

Exploiting Choice: SMT Inst Fetch Policies • FIFO, Round Robin, simple but may be too naive • RR.X.Y • X threads for Y instructions • RR1.8 • RR.2.4 or RR.4.2 • RR.2.8 • What are the main design and/or performance issue when X > 1 [Tullsen et al. ISCA96]

Exploiting Choice: SMT Inst Fetch Policies • Adaptive Fetching Policies • BRCOUNT (reduce wrong path issuing) • Count # of br inst in decode/rename/IQ stages • Give top priority to thread with the least BRCOUNT • MISSCOUT (reduce IQ clog) • Count # of outstanding D-cache misses • Give top priority to thread with the least MISSCOUNT • ICOUNT (reduce IQ clog) • Count # of inst in decode/rename/IQ stages • Give top priority to thread with the least ICOUNT • IQPOSN (reduce IQ clog) • Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues • Due to that threads with the oldest instructions will be most prone to IQ clog • No Counter needed [Tullsen et al. ISCA96]

Exploiting Choice: SMT Inst Fetch Policies [Tullsen et al. ISCA96]

Alpha 21464 (EV8) • Leading-edge process technology • 1.2 to 2.0GHz • 0.125m CMOS • SOI-compatible • Cu interconnect, 7 metal layers • Low-k dielectrics • Chip characteristics • 1.2V Vdd, 250W (EV6: 72W and EV7: 125W) • 250 million transistors, 350mm2 • 1100 signal pins in flip chip packaging Slide Source: Dr. Joel Emer

EV8 Architecture Overview • Enhanced OoO execution • 8-wide issue superscalar processor • Large on-die L2 (1.75MB) • 8 DRDRAM channels • On-chip router for system interconnect • Directory-based ccNUMA for up to 512-way SMP • 4-way SMT Slide Source: Dr. Joel Emer

Decode/Map Queue Fetch Reg Read Execute Dcache/Store Buffer Reg Write Retire PC RegisterMap Regs Regs SMT Pipeline • Replicated • PCs • Register maps • Shared resources • RF • Instruction queue • First and second level caches • Translation buffers • Branch predictor Dcache Icache Slide Source: Dr. Joel Emer

Intel HyperThreading • Intel Xeon Processor, Xeon MP Processor, and ATOM • Enable Simultaneous Multi-Threading (SMT) • Exploit ILP through TLP (—Thread-Level Parallelism) • Issuing and executing multiple threads at the same snapshot • Appears to be 2 logical processors • Share the same execution resources • Duplicate architectural states and certain microarchitectural states • IPs, iTLB, streaming buffer • Architectural register file • Return stack buffer • Branch history buffer • Register Alias Table

Sharing Resource in Intel HT • P4’s TC (or ROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss • TLB shared with logical processor ID but partitioned • X86 does not employ ASID • Hard-partitioning appears to be the only option to allow HT • op queue (into ½) after fetched from TC • ROB (126/2 in P4) • LB (48/2 in P4) • SB (24/2 or 32/2 in P4) • General op queue and memory op queue (1/2) • Retirement: alternating between 2 logical processors

HT in Intel ATOM • First In-order processor with HT • HT claimed to enlarge silicon asset by 8% • Claimed 30% performance increase at 15% power increase • Shared cache space deprived/competed between threads • No dedicated Multiplier – use SIMD Multiplier • No dedicated Int Divider - use FP Divider 32KB 512KB 24KB 25mm2 @45nm Source: Microprocessor Report and Intel

L. Hammond, B. A. Nayfeh, K. Olukotun

Main Argument • Single thread of control has limited parallelism (ILP is dead) • Cost of the above is prohibitive due to complexity • Achieving parallelization with SW, not HW • Inherently parallel multimedia application • Widespread Multi-tasking OS • Emerging parallel compilers (ref. SUIF), mainly for loop-level parallelism • Why not SMT? • Interconnect delay issue • Partitioning is less localized than CMP • Use relatively simple single-thread processor • Exploit only “modest” amount of ILP • Execute multiple threads in parallel • Bottom line

Architectural Comparison

Single Chip Multiprocessor

AMD K10 (Barcelona) Code name “Deneb” 45nm process 4 cores, private 512KB L2 Shared 6MB L3 (2MB in Phenom) Integrated Northbridge Up to 4 DIMMs Sideband Stack optimizer (SSO) Parallelize many POPs and PUSHs (which were dependent on each other) Convert them into pure loads/store instructions No uops in FUs for stack pointer adjustment Commercial CMP (AMD Phenom II Quad-Core)

4-core HT support each core 8MB shared L3 3 DDR3 channels 25.6GB/s memory BW Turbo Boost Technology New P-state (Performance) DFVS when workloads operated under max power Same frequency for all cores Intel Core i7 (Nehalem)

Ultra Sparc T1 • Up to Eight cores, each 4-way threaded • Fine-grained multithreading • a thread-selection logic • Take out threads that encounter long latency events • Round-robin cycle-by-cycle • 4 threads in a group share a processing pipeline (Sparc pipe) • 1.2 GHz (90nm) • In-order, 8 instructions per cycle (single issue from each core) • 1 shared FPU • Caches • 16K 4-way 32B L1-I • 8K 4-way 16B L1-D • Blocking cache (reason for MT) • 4-banked 12-way 3MB L2 + 4 memory controllers. (shared by all) • Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput (200GB/s)

Ultra Sparc T1 • Thread-select logic marks a thread inactive based on • Instruction type • A predecode bit in the I-cache to indicate long-latency instruction • Misses • Traps • Resource conflicts

Ultra Sparc T2 • A fatter version of T1 • 1.4GHz (65nm) • 8 threads per core, 8 cores on-die • 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1) • L2 increased to 8-banked 16-way 4MB shared • 8 stage integer pipeline ( as opposed to 6 for T1) • 16 instructions per cycle • One PCI Express port (x8 1.0) • Two 10 Gigabit Ethernet ports with packet classification and filtering • Eight encryption engines • Four dual-channel FBDIMM memory controllers • 711 signal I/O,1831 total • Subsequent T2 Plus contains 2 sockets: 16 cores / 128 threads

16-core, two threads per core Hardware scout threading (runahead) Invisible to SW Long latency inst starts auto HW scout L1 D$ miss Micro-DTLB miss Divide Warm up branch predictor Prefetch memory Execute Ahead (EXE) Retire independent instructions while scouting Simultaneous Speculative Threading (SST) [ISCA’09] Two hardware threads for one program Runahead speculatively executes under a cache miss OoO retirement HTM Support Sun ROCK Processor

Many-Core Processors • 2KB Data Memory • 3KB Instruction Memory • No coherence support • 2 FMACs • Next-gen will have 3D-integrated memory • SRAM first • DRAM in the future Intel Teraflops (Polaris)

E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, A. Agarwal

MIT RAW Design Tenet • Long wire across chip will be the constraint • Exposed architecture to software (parallelizing compilers) • Explicit parallelization • Pins • Communication • Use tile-based architecture • Similar designs sponsored by DARPA PCA program: UT TRIPS, Stanford Smart Memories • Simple Point-to-point static routing network • One cycle across each tile • Scalable (than bus) • Harnessed by compiler with precise count of wire hops • Use dynamic router to support memory accesses that cannot be analyzed statically.

Application Mapping on RAW Video Data Stream Frame Buffer And Screen Custom Data Path Pipeline (by Compiler) Two-way threaded Java program Four-way parallelized scalar code httpd Zzzz.. Sleep Mode (power saving) Fast Inter-tile ALU forwarding : 3 cycles [Taylor IEEE MICRO’02]

Pipelined w/ Bypass Link Pipelined w/ Bypass Link and Multiple ALUs Lots of live values in the SON Scalar Operand Network Design Non-Pipelined Scalar Operand Network [Taylor et al. HPCA’03]

Communication Scalability Issue • RB (# of result bus) * WS (window size) compares made per cycle • Long, dense wire elongates cycle time • Pipeline the wire • Cost of processing incoming information is high • Similar problem in bus-based snoopy cache protocol Routing area Large MUX Complex Compare logic

RegFile RegFile RegFile RegFile RegFile RegFile RegFile RegFile RegFile Switch Scalar Operand Network On a 2-D, p2p interconnect (e.g., Raw or TRIPS) Scalar Operand Network Multiscalar Operand Network (distributed ILP machine) [Taylor et al. HPCA’03]

Done at Compile time (RAW) Or Runtime “Point-to-point” 2D mesh Tradeoff Computation vs. Communication Compute Affinity (data flow through fewer hops) How to maintain control flow-control RegFile RegFile RegFile RegFile RegFile RegFile Mapping Operations to Tile-based Architecture i = a[j]; q = b[i]; r = q+j; s = q >> 3; t = r * s; b[j] = l; b[t] = t; >> * ld a ld b st b st b +

RAW Core-to-Core Communication • Static Router • Place-and-route wires by software • P2p scalar transport • Compilers (or assembly writers) handle predictable communication • Dynamic Router • Transport dynamic, unpredictable operations • Interrupts • Cache misses • Unpredictable communication at compile-time

Architectural Comparison • Raw replace a bus of a superscalar with switched network • Switched network is tightly integrated into processor’s pipeline to support single-cycle message injection and receive operations • Raw software (compiler) has to implement functions such as instruction scheduling, dependency checking, etc. • Raw yields complexity to software so that more hardware can be used for ALU and memory RAW Superscalar Multiprocessor

RAW’s Four On-Chip Mesh Networks Compute Pipeline 8 32-bit channels Registered at input  longest wire = length of tile [Slide Source: Michael B. Taylor]

Raw Architecture [Slide Source: Volker Strumpen]

Raw Compute Processor Pipeline R24-27 map to 4 on-chip physical networks Fast ALU-to-network (4 cycles) 0-cycle local bypass [Taylor IEEE MICRO’02]

Each tile contains Tile processor 32-bit MIPS, 8-stage in-order, single issue 32KB instruction memory 32KB data cache (not coherent, user managed) Switch processor 8K-instruction memory Executes basic move and branch instructions Transfer between local switch and neighbor switches Dynamic Router Hardware control (not directly under programmer’s control) RAW Processor Tile

Raw Programming • Compute the sumc=a+b across four tiles:

Data Path: Zoom 1 • Stateful hardware: local data memory (a,c), register (b) and both static networks (snet1 and 2)

Zoom 2: Processor Datapaths

ECE8833 Polymorphous and Many-Core Computer Architecture