Lecture 4: Performance (conclusion) & Instruction Set Architecture

Lecture 4:Performance (conclusion) &Instruction Set Architecture Michael B. Greenwald Computer Architecture CIS 501 Spring 1999

Philosophy • Open-ended problems, messy solutions • Positive: • More like the real world (but still contrived!) • Get good at approximating and back-of-envelope calculation • Negative: • How do you know you have the right answer?

How to Summarize Performance • Track total execution time of all (weighted) benchmarks (Convert to exec. time, sum, convert back to measure, and average). • Arithmetic mean (weighted arithmetic mean) : (Ti)/n or (Wi*Ti) • Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) : n/ (1/Ri) or n/ (Wi/Ri) • Compare to reference architecture (spec92 vs. VAX11/780): normalized execution time • handy for scaling performance (e.g., X times faster than SPARCstation 10) • But do not take the arithmetic mean of normalized execution time, use the geometric((norm timei)^(1/n)

Normalized Execution Times and Geometric Means • Geometric mean independent of normalization. • Numerators the same, regardless of reference machine. Denominator cancels out in any machine-machine comparison.

Normalized Execution Times and Geometric Means • Geometric mean does not track execution time. • Halving a 1 usec. prog. has same effect as halving a 10 hour prog.

SPEC First Round • One program: 99% of time in single line of code • New front-end compiler could improve dramatically

Impact of Means on SPECmark89 for IBM 550 Ratio to VAX: Time:Weighted Time: Program Before After Before After Before After gcc 30 29 49 51 8.91 9.22 espresso 35 34 65 67 7.64 7.86 spice 47 47 510 510 5.69 5.69 doduc 46 49 41 38 5.81 5.45 nasa7 78 144 258 140 3.43 1.86 li 34 34 183 183 7.86 7.86 eqntott 40 40 28 28 6.68 6.68 matrix300 78 730 58 6 3.43 0.37 fpppp 90 87 34 35 2.97 3.07 tomcatv 33 138 20 19 2.01 1.94 Mean 54 72 124 108 54.42 49.99 Geometric Arithmetic Weighted Arith. Ratio 1.33 Ratio 1.16 Ratio 1.09

Marketing Metrics • MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 • Machines with different instruction sets ? • Programs with different instruction mixes ? • Dynamic frequency of instructions • Uncorrelated with performance • MFLOP/s = FP Operations / Time * 10^6 • Machine dependent • Often not where time is spent • Normalized: • add,sub,compare,mult 1 • divide, sqrt 4 • exp, sin, . . . 8

Normalized Performancek-MIPS machine • MIPS is normalized, on a per-program basis, to a VAX 11/780 • Same problems as specmarks • Also needs to be summarized by geometric mean (although performance 1/Time, inverse of a geometric mean is g.m. of inverse).

Performance Evaluation • “For better or worse, benchmarks shape a field” • Good products created when have: • Good benchmarks • Good ways to summarize performance • Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary • If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins! For computer systems the key performance metric is total time

Summary, #1 • Designing to Last through Trends • Capacity Speed • Logic 2x in 3 years 2x in 3 years • DRAM 4x in 3 years 2x in 10 years • Disk 4x in 3 years 2x in 10 years • 6yrs to graduate => 16X CPU speed, DRAM/Disk size • Time to run the task • Execution time, response time, latency • Tasks per day, hour, week, sec, ns, … • Throughput, bandwidth • “X is n times faster than Y” means • ExTime(Y) Performance(X) • --------- = -------------- • ExTime(X) Performance(Y)

1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Summary, #2 • Amdahl’s Law: • CPI Law: • Execution time is the REAL measure of computer performance! • Good products created when have: • Good benchmarks, good ways to summarize performance • Die Cost goes roughly with die area4 • Can PC industry support engineering/research investment?

Instruction Set Architecture(Introduction and/or Review)

Computer Architecture Is … the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE

Architecture, Organization, and Hardware • Instruction set architecture: programmer visible interface between software and hardware. • Organization: High-level aspects of computer’s design such as memory system, bus structure, and the internal CPU to support ISA. • Hardware: Detailed logic design, packaging, etc. can be thought of as implementation of organization.

Interface Design • A good interface: • Lasts through many implementations (portability, compatability) • Is used in many differeny ways (generality) • Provides convenient functionality to higher levels • Permits an efficient implementation at lower levels use time imp 1 Interface use imp 2 use imp 3

Instruction Set Architectures • Interface exposed to programmer • Assembly language programming declining; mostly compiler back-ends. • Even so, simplicity and regularity still useful: • compiler writers: restrict choices and make tradeoffs clear. • implementability: simple i/f leads to simple implementation. Fewer restrictions makes it amenable to different types of implementation. • Generality and Flexibility still useful: • Must last several generations • Must be useful to a range of different clients

Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC (Mips,Sparc,88000,IBM RS6000, . . .1987)

Evolution of Instruction Sets • Major advances in computer architecture are typically associated with landmark instruction set designs • Ex: Stack vs GPR (System 360) • Design decisions must take into account: • technology • machine organization • programming langauges • compiler technology • operating systems And design decisions, in turn, influence these

Design Space of ISA Five Primary Dimensions • Operand Storage Where besides memory? • Number of (explicit) operands ( 0, 1, 2, 3 ) • Effective Address How is memory location specified? • Type & Size of Operands byte, int, float, vector, . . . How is it specified? • Operations add, sub, mul, . . . How is it specifed? Other Aspects • Successor How is it specified? • Conditions How are they determined? • Encodings Fixed or variable? Wide? • Parallelism

ISA Metrics Aesthetics: • Orthogonality • No special registers, few special cases, all operand modes available with any data type or instruction type • Completeness • Support for a wide range of operations and target applications • Regularity • No overloading for the meanings of instruction fields • Streamlined • Resource needs easily determined Ease of compilation (programming?) Ease of implementation Scalability

Design Space of ISA Five Primary Dimensions • Operand Storage Where besides memory? • Number of (explicit) operands ( 0, 1, 2, 3 ) • Effective Address How is memory location specified? • Type & Size of Operands byte, int, float, vector, . . . How is it specified? • Operations add, sub, mul, . . . How is it specifed? Other Aspects • Successor How is it specified? • Conditions How are they determined? • Encodings Fixed or variable? Wide? • Parallelism

Basic ISA Classes Accumulator: 1 address add A acc acc + mem[A] 1+x address addx A acc acc + mem[A + x] Stack: 0 address add tos tos + next General Purpose Register: 2 address add A B EA(A) EA(A) + EA(B) 3 address add A B C EA(A) EA(B) + EA(C) Conventionally viewed as 3 distinct architectures. Actually, continuum. N registers. Accumulator & Stack just have different type of restrictions on register usage and load/store pattern.

Basic ISA Classes GPRRegister-Mem. GPRRegister-Reg. Stack Accumulator Conventionally viewed as 3 distinct architectures. Actually, continuum. N registers. Accumulator & Stack just have different type of restrictions on register usage and load/store pattern. Push APush BAddPop C Load AAdd BStore C Load R1,AAdd R1, BStore C, R1 Load R1,ALoad R2,BAdd R3,R1, R2Store C, R3 C = A + B;

Stack Machines • Instruction set: +, -, *, /, . . . push A, pop A • Example: a*b - (a+c*b) push a push b * push a push c push b * + - A+C*B A B C*B C A B A*B A*B A C A A*B A A*B A A*B A*B - * + * a a b c b

Continuum of ISA Classes • All can have arbitrary number of registers • Accumulator: each register is special, implicit in op-code • Stack: top-of-stack register cache • GPR: no special meaning, so can keep adding. • What values are stored in each register? • Accumulator: forced by instruction • Stack: mostly order of eval • GPR: almost no restrictions

The Case Against Special Purpose Registers • Performance is derived from the existence of several fast registers, not from the way they are organized • Data does not always “surface” when needed • Constants, repeated operands, common subexpressions so TOP and Swap instructions are required • Code density is about equal to that of GPR instruction sets • Registers have short addresses • Keep things in registers and reuse them • Slightly simpler to write a poor compiler, but not an optimizing compiler

Lecture 4: Performance (conclusion) &amp; Instruction Set Architecture

Lecture 4: Performance (conclusion) &amp; Instruction Set Architecture

Presentation Transcript

Lecture 4: Performance (conclusion) & Instruction Set Architecture

Lecture 4: Performance (conclusion) & Instruction Set Architecture