ECE 4100/6100 Advanced Computer Architecture Lecture 4 ISA Taxonomy

ECE 4100/6100Advanced Computer Architecture Lecture 4 ISA Taxonomy Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Instruction Set Architecture • Specification of a microprocessor design • Interface between user and machine’s functionality • Good instruction set design principles • Compatibility • Implementability • Programmability • Usability • Encoding efficiency

Main ISA Design Philosophy • CISC (Complex Instruction Set Computer) • RISC (Reduced Instruction Set Computer) • VLIW (Very Long Instruction Word) • EPIC (Explicitly Parallel Instruction Computer)

CISC • Complex Instruction Set Computers • Close “semantic gap” between programming and execution • Smaller code size (memory was expensive!) • Simplify compilation • Another state machine (controlled by microcode) inside the machine • Example: x86, Intel 432, IBM 360, DEC VAX

prefix ESI+ECX*2 [--][--]+disp32 CISC Example: x86 • MOVSD ;; move a double word, 1-byte instruction MOVSD // m32[DS:EDI] = m32[DS:ESI] • REP;; 1-byte prefix to repeat string operations REP MOVSD // count set up in ECX LOCK ADD ds:[esi+ecx*2+0x67452301], 0xEFCDAB89 // 13-byte F0 3E 81 84 4E 01 23 45 67 89 AB CD EF

RISC • Observation made by IBM (John Cocke, Eckert-Mauchly Award’85, Turing Award’87, Nat’l Medal of Technology’91, Nat’l Medal of Science’94) • Few of the available instructions are used • CISC : “n+1” phenomenon • Adding an instruction requiring an extra level of decoding logic can slow down the entire ISA • Reduced Instruction Set Computer • Originated at IBM in 1975, a telephone project • To achieve 12 MIPS (300 calls per sec, 20k inst per call) • Simple instructions • IBM 801 in 1978 • More compiler effort to gain performance

A Typical RISC • Smaller number of instructions • Fixed format instruction (e.g., 32 bits) • 3-address, reg-to-reg arithmetic instructions • Single cycle operation for execution • Load-store architecture • Simple address modes • Base + displacement • No indirection • Simple branch conditions • Hardwired control (No microcode) • More compiler effort • Examples: • RISC I and RISC II at Berkeley • MIPS (Microprocessors without Interlocked Pipe Stage) at Stanford • IBM RISC Technology, Sun Sparc, HP PA-RISC, ARM

RISC Example: MIPS R-format (Register-Register) 6 5 11 10 31 26 25 21 20 16 15 0 add $1, $2, $3 Shamt Op Rs Rt Rd Funct I-format (Register-Immediate) 31 26 25 21 20 16 15 0 addi $1, $2, -5 immediate Op Rs Rt I-format (Load/Store) 31 26 25 21 20 16 15 0 lw $1, 24($9) immediate Op Base Dest I-format (Branch) 31 26 25 21 20 16 15 0 beq L1, $4, $0 immediate Op Rs Rt J-format (Jump / Call) 31 26 25 0 j L2 target Op

CISC vs. RISC • Some definitions were from the paper by Colwell et al. in 1985

CISC vs. RISC (Reality) CISC RISC

Observation and Controversy • ”Instruction Set and Beyond: Computers, Complexity and Controversy” by Bob Colwell (Eckert-Mauchly Award, 2005) and gang from CMU, also see response from RISC camp: Patterson (Eckert-Mauchly Award, 2008) and Hennessy (Eckert-Mauchly Award, 2001) • CISC/RISC classification should *not* be a dichotomy • Case in point: MicroVAX-32 by DEC, a single chip implementation • Subsetting VAX instructions (but still, 175 instructions!) • Emulate complex instructions • a RISC or a CISC? (Well, it has variable length instructions, not a ld/st machine, with a microcode control, have all VAX addressing mode) • Effective processor design = CISC experiences + RISC tenets • RISC features are not incompatible or mutually exclusive • Large register file (w/ register windows) • RISC/CISC issues are best considered in light of their function-to-implementation level assignment

Modern X86 Machine Design • CISC outfit • RISC inside • E.g., Intel P6/Netburst/Core, AMD Athlon/Phenom/Opteron • Each x86 instruction is decoded into “micro-op” (op) or “RISC-op” on-the-fly • Internal microarchitecture resembles RISC design philosophy • Processor dynamically schedules “ops” • Compiler’s scheduling is still beneficial

Recent ISA Design Trend • Look at this instruction in MIPS (CISC or RISC?) CABS.LE.PS $fcc0, $f8, $f10 ;; |y||w| , |x||w|? • Many complex instructions emerged for new apps • Viterbi instruction for wireless communication/DSP • Sum of absolute differences in SSE (PSAD) or other DSP: C = |A-B| for MPEG (motion estimation) • In embedded domain, code size is critical • Reducing programming efforts • Optimizing performance via • Specialized hardware (accelerator-based) • Co-processor (controlled by main processor) • ISA plug-in (flexible)

VLIW • Very Long Instruction Word • Originated from microcode compaction • Coined by Josh Fisher (Eckert-Mauchly Award, 2003) • Compiler will • Perform instruction scheduling (latency-aware) • Pack several independent instructions into a VLIW instruction • Issues • Compatibility • Many nop’s • Very complex compiler • Information unavailable at static compile time • interprocedural optimization is difficult) Pioneers • Culler Scientific • Led by Prof. Glen J. Culler (National Medal of Technology winner 2000, Berkeley Prof. David Culler’s father) • Multiflow (Fisher) • Led by Josh Fisher (Eckert-Mauchly Award 2003), John O’Donnell, John Ruttenberg, David Papworth, Bob Colwell (Eckert-Mauchly Award 2005), Geoffery Lowney, etc. • Several Multiflow TRACE were delivered • Cydrome (Rau, Yen’s) in the 80’s • Led by Bob Rau (Eckert-Mauchly Award 2002), David Yen, Wei Yen, etc. • Had a working prototype Modern Processors • Most DSP embrace VLIW (e.g., TI C6x, StarCore, ADI TigerSHARC, etc.) • Transmeta Crusoe (internal, never released ISA)

Intel/HP EPIC • Explicitly Parallel Instruction Computer • A kin breed of VLIW (e.g., compiler holding the key to high performance) • Some new features • Stop bits to address compatibility • ISA enabling data speculation and control speculation (minimum hardware support needed) • Fully predicated ISA • Rotating registers, RSE (not so new, e.g., MRS in RISC I) • Lots of ideas from Polycyclic architecture (TRW) and Cydrome by the late Bob Rau (Eckert-Mauchly Award, 2002) An Itanium Instruction Bundle ld4 r43=[r38] add r38=16,r38 br.call.sptk b0=printf# ;;

VLIW Tradeoffs • Plentiful registers, simple encodings, … • Potentially lower # of transistors than other designs • Reduced speculation, OoO not needed • Size efficiencies, price, power consumption • Is this true for Itanium? • Drawbacks • Backward compatibility or upgradeability • Due to exposed implementation details • VLIW is orthogonal to other techniques • Pipeline, SMT, and CMP/Multi-core can be built on top of processors including VLIW

Same Normal Source code Normal Compiler Design Philosophy: VLIW vs. Superscalar RISC Object code Static _VOID _DEFUN(_mor_nu), struct _reent *ptr _AND register size_t { . . . IM1 = I–1 IM2 = I–2 IM3 = I–3 T1 = LOAD . T3 = 2*T1 . . Scheduling and Operation Independence: Recognizing hardware Run-time The same ILP Hardware in Both cases Compile Time Static _VOID _DEFUN(_mor_nu), struct _reent *ptr _AND register size_t { . . . Normal compiler plus scheduling and operation Independence: Recognizing software

Design Philosophy: VLIW vs. Superscalar • VLIW • Requiring less hardware and lower power • Programs need to be changed to run correctly when even small changes (not always though) • Superscalar • Object-code compatible • Sequential programs can be presented to different superscalar implementation of the same ISA

Design Philosophy: VLIW vs. Superscalar

Superscalar or VLIW? • Reality: the current world is dominated by … • X86: Core (quad-issue) & ATOM (dual-issue) • And ARM (Cortex A8 is a dual-issue; A9 has OOO) • VLIW is largely embraced by the DSP camp

Should we continue to teach this Chapter about ISA? Should we continue to teach this Chapter about ISA?

ECE 4100/6100 Advanced Computer Architecture Lecture 4 ISA Taxonomy