CpE 442 Introduction to Computer Architecture The Role of Performance

CpE 442Introduction to Computer ArchitectureThe Role of Performance Instructor: H. H. Ammar

Overview of Today’s Lecture: The Role of Performance • Review from Last Lecture • Definition and Measures of Performance • Summarizing Performance and Performance Pitfalls

Review: What is "Computer Architecture" ° Co-ordination of levels of abstraction Application Operating System Compiler Instruction Set Architecture Instr. Set Proc. I/O system Digital Design Circuit Design ° Under a set of rapidly changing Forces

Review: Levels of Representation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; High Level Language Program lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) Compiler Assembly Language Program Assembler 0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 Machine Language Program Machine Interpretation Control Signal Specification

Review: Levels of Organization SPARCstation 20 Computer SPARC Processor Memory Devices Control Input Datapath Output

Computer Architecture Simulation Tools 1. The HASE Architecture Simulation Environment2. The New Compiler Technology simulation (shown in class)3. MIPS Assembly Language Simulators a. SPIM A MIPS32 Simulatorhttp://pages.cs.wisc.edu/~larus/spim.html b. MARS (MIPS Assembler and Runtime Simulator)http://courses.missouristate.edu/kenvollmar/mars/

Review: Summary from Last Lecture • All computers consist of five components • Processor: (1) datapath and (2) control • (3) Memory • (4) Input devices and (5) Output devices • Not all “memory” are created equally • Cache: fast (expensive) memory are placed closer to the processor • Main memory: less expensive memory--we can have more • Input and output (I/O) devices has the messiest organization • Wide range of speed: graphics vs. keyboard • Wide range of requirements: speed, standard, cost ... etc. • Least amount of research (so far)

Metrics of performance Response time, Answers per month Operations per second Application Programming Language Compiler (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins

Relating Processor Metrics • CPU execution time = CPU clock cycles/pgm X clock cycle time • or CPU execution time = CPU clock cycles/pgm ÷ clock rate • Define CPI = the avg. clock cycles per instruction, CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program being measured • CPU clock cycles/pgm = Instructions/pgm X CPI • or CPI = CPU clock cycles/pgm ÷ Instructions/pgm

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Aspects of CPU Performance, instr. count CPI clock rate Program Compiler Instr. Set Arch. Organization Technology

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Aspects of CPU Performance instr count CPI clock rate Program X (x) Compiler X (x) Instr. Set. X X Organization X X Technology X

Organizational Trade-offs Application Programming Language Compiler ISA Instruction Mix Single-Cycle Processor Design CPI=1, large cycle time-Slow clock Multi-cycle Processor Design CPI > 1, smaller cycle time- Faster clock Datapath CPI Control Function Units Transistors Wires Pins Cycle Time

CPI “Average cycles per instruction” Invest Resources where time is Spent! • CPI = (CPU Time * Clock Rate) / Instruction Count • = Clock Cycles / Instruction Count n CPU time = ClockCycleTime * S CPI * I i i i = 1 n "instruction frequency" CPI = S CPI * F where F = I i i i i i = 1 Instruction Count

Example Base Machine (Reg / Reg) Op Freq(Fi) CPI(i) % Time ALU 50% 1 .5 33% Load 20% 2 .4 27% Store 10% 2 .2 13% Branch 20% 2 .4 27% 1.5 Typical Mix The CPI = 1.5 cycles per instruction

Assume a program of 1 million instructions, Compare the performance of Base Machine (B) with the above CPI, 1 GHZ clock, and Enhanced Machine (E) with 1.333 GHZ and a one cycle increase for L/S And branch instructions Enhanced Machine (Reg / Reg) Op Freq CPI(i) % Time ALU 50% 1 .5 25% Load 20% 3 .6 30% Store 10% 3 .3 15% Branch20% 3 .6 30% 2.0

Comparing the performance of two machines • Perf. of machine X = 1 / exec. Time of prog. on machine X • Perf. of E / Perf. of B = exec. Time of B / exec. Time of E • = 1.5 * 1 / 2 * 0.75 = 1 • Performance of B is similar to that of E, • No gain in performance

Rate Metrics • MIPS = Instruction Count / (Time * 10^6) • = Clock Rate / (CPI * 10^6) • machines with different instruction sets ? • programs with different instruction mixes ? • dynamic frequency of instructions • uncorrelated with performance • MFLOP/S= FP Operations / (Time * 10^6) • machine dependent • often not where time is spent

Example showing why MIPS can failCompare performance with Compilers 1 and 2 for a given program on a given machine Instruction Count in Billion for instruction classes A B CCompiler 1 5 1 1Compiler 2 10 1 1clock cycles 1 2 3Clock cycles using compiler1 = 10 BillionClock cycles using compiler2 = 15 Billionassuming 1GHZ clockCPU Time 1 = 5x1+1x2 +1x3 = 10 secsCPU Time 2 = 10x1 + 1x2 + 1x3 = 15 secsyet the MIPS rating isMIPS 1 = (instr. Count/cpu time in sec x 10^6) = 700MIPS 2 = 12/15 * 1000 = 800

Why Do Benchmarks? • How we evaluate differences • Different systems • Changes to a single system • Provide a target • Benchmarks should represent large class of important programs • Improving benchmark performance should help many programs • For better or worse, benchmarks shape a field • Good ones accelerate progress • good target for development • Bad benchmarks hurt progress • help real programs v. sell machines/papers? • Inventions that help real programs don’t help benchmark

Programs to Evaluate Processor Performance • (Toy) Benchmarks • 10-100 line • e.g.,: sieve, puzzle, quicksort • Synthetic Benchmarks • attempt to match average frequencies of real workloads • e.g., Whetstone, dhrystone • Kernels • Time critical excerpts Real programs • e.g., gcc, spice

Successful Benchmark: SPEChttp://www.spec.org/benchmarks.htmlhttp://mrob.com/pub/comp/benchmarks/spec.html#CPU_06 • EE Times + 5 companies band together to form the Systems Performance Evaluation Committee (SPEC): Sun, MIPS, HP, Apollo, DEC • Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O

SPEC second round, SPEC95 • 8 integer benchmarks in C and 10 floating pt benchmarks in Fortran

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E) <= 1/(1-F) speed up is bounded by this factor

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Performance Evaluation Summary • Time is the measure of computer performance! • Good products created when have: • Good benchmarks • Good ways to summarize performance • If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins • Remember Amdahl’s Law: Speedup is limited by unimproved part of program

CpE 442 Introduction to Computer Architecture The Role of Performance