1 / 41

Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

This lecture discusses benchmarks, performance metrics, cost, and instruction set architecture in computer science. Topics include Amdahl's Law, CPU performance, and programs to evaluate processor performance.

rvirginia
Download Presentation

Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001

  2. Administrative • Some textbooks here, more to arrive Friday afternoon • Read Chapter 3 • Homework #1 Due September 11 • Simple scalar, read some of the documentation first • See web page for details • Questions, contact Fareed (fareed@cs.duke.edu) • Policy on Academic Integrity (Cheating..) • Homework • Discussion of topics is encouraged, peers are great resource • But, hand in your work • Projects • Work in pairs, learn how to collaborate. CPS 220

  3. Review • Designing to Last through Trends • Capacity Speed • Logic 2x in 3 years 2x in 3 years • DRAM 4x in 3 years 1.4x in 10 years • Disk 4x in 3 years 1.4x in 10 years • Time to run the task • Execution time, response time, latency • Tasks per day, hour, week, sec, ns, … • Throughput, bandwidth • “X is n times faster than Y” means • ExTime(Y) Performance(X) • --------- = -------------- • ExTime(X) Performance(Y) CPS 220

  4. The Danger of Extrapolation • Dot-com stock value • Technology Trends • Power dissipation? • Cost of new fabs? • Alternative technologies? • GaAs • Optical CPS 220

  5. Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CPS 220

  6. Review: Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle “Average Cycles Per Instruction” “Instruction Frequency” Invest Resources where time is Spent!

  7. Base Machine (Reg / Reg) Op Freq Cycles ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2 Example • Add register / memory operations: • One source operand in memory • One source operand in register • Cycle count of 2 • Branch cycle count to increase to 3. • What fraction of the loads must be eliminated for this to pay off? Typical Mix CPS 220

  8. Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles CPI Freq Cycles CPI ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) CyclesNew InstructionsNew CPINew must be normalized to new instruction frequency CPS 220

  9. Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X) 1.5 = 1.7 – X 0.2 = X ALL loads must be eliminated for this to be a win! CPS 220

  10. Programs to Evaluate Processor Performance • (Toy) Benchmarks • 10-100 line program • e.g.: sieve, puzzle, quicksort • Synthetic Benchmarks • Attempt to match average frequencies of real workloads • e.g., Whetstone, dhrystone • Kernels • Time critical excerpts of real programs • e.g., Livermore loops • Real programs • e.g., gcc, compress, database, graphics, etc. CPS 220

  11. Benchmarking Games • Differing configurations used to run the same workload on two systems • Compiler wired to optimize the workload • Test specification written to be biased towards one machine • Workload arbitrarily picked • Very small benchmarks used • Benchmarks manually translated to optimize performance CPS 220

  12. Common Benchmarking Mistakes • Not validating measurements • Collecting too much data but doing too little analysis • Only average behavior represented in test workload • Loading level (other users) controlled inappropriately • Caching effects ignored • Buffer sizes not appropriate • Inaccuracies due to sampling ignored • Ignoring monitoring overhead • Not ensuring same initial conditions • Not measuring transient (cold start) performance • Using device utilizations for performance comparisons CPS 220

  13. SPEC: System Performance Evaluation Cooperativeã • First Round 1989 • 10 programs yielding a single number • Second Round 1992 • SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs) • Compiler Flags unlimited. March 93 of DEC 4000 Model 610: • spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= memcpy(b,a,c)” • wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 • nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas • Third Round 1995 • Single flag setting for all programs; new set of programs “benchmarks useful for 3 years” • SPEC2000: two options 1) specific flags 2) whatever you want CPS 220

  14. SPEC First Round • One program: 99% of time in single line of code • New front-end compiler could improve dramatically CPS 220

  15. How to Summarize Performance • Arithmetic mean (weighted arithmetic mean) tracks execution time: å(Ti)/n or å(Wi*Ti) • Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ å(1/Ri) or 1/ å(Wi/Ri) • Normalized execution time is handy for scaling performance • But do not take the arithmetic mean of normalized execution time, use the geometric mean (å Ri1/n) CPS 220

  16. Reporting Results • Reproducibility • List everything another researcher needs to duplicate the results • May include archiving your simulation/software infrastructure • Processor, cache hierarchy, main memory, disks, compiler version and optimization flags, OS version, application inputs, etc.

  17. Performance Evaluation • Given sales is a function of performance relative to the competition, big investment in improving product as reported by performance summary • Good products created when you have: • Good benchmarks • Good ways to summarize performance • If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins! • Ex. time or bandwidth is the measure of computer performance! • What about cost? CPS 220

  18. Integrated Circuit Costs Die Cost goes roughly with die area4

  19. Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417 • From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15 CPS 220

  20. List Price Average Discount 25% to 40% Avg. Selling Price Gross Margin 34% to 39% 6% to 8% Direct Cost Component Cost 15% to 33% Cost/PerformanceWhat is Relationship of Cost to Price? • Component Costs • Direct Costs(add 25% to 40%) recurring costs: labor, purchasing, scrap, warranty • Gross Margin(add 82% to 186%) nonrecurring costs: R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax profits, taxes • Average Discountto get List Price (add 33% to 66%): volume discounts and/or retailer markup CPS 220

  21. Instruction Set Architecture

  22. Instruction Set Architecture • 1950s to 1960s: Computer Architecture Course Computer Arithmetic • 1970 to mid 1980s: Computer Architecture Course Instruction Set Design, especially ISA appropriate for compilers • 1990s: Computer Architecture CourseDesign of CPU, memory system, I/O system, Multiprocessors • 2000s: Computer Architecture Course • Power issues • Wire delays (distributed microarchitecture) • New technologies • New applications (media, network, etc.) CPS 220

  23. SOFTWARE Computer Architecture? . . . the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 CPS 220

  24. software instruction set hardware Towards Evaluation of ISA and Organization CPS 220

  25. Interface Design • A good interface: • Lasts through many implementations (portability, compatability) • Is used in many differeny ways (generality) • Provides convenient functionality to higher levels • Permits an efficient implementation at lower levels use time imp 1 Interface use imp 2 use imp 3 CPS 220

  26. Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC (Mips,Sparc,88000,IBM RS6000, . . .1987) CPS 220

  27. Evolution of Instruction Sets • Major advances in computer architecture were typically associated with landmark instruction set designs • Ex: Stack vs GPR (System 360) • Design decisions must take into account: • technology • machine organization • programming langauges • compiler technology • operating systems • And they in turn influence these CPS 220

  28. Design Space of ISA • Five Primary Dimensions • Number of explicit operands ( 0, 1, 2, 3 ) • Operand Storage Where besides memory? • Effective Address How is memory location specified? • Type & Size of Operands byte, int, float, vector, . . . • How is it specified? • Operations add, sub, mul, . . . • How is it specifed? • Other Aspects • Successor How is it specified? • Conditions How are they determined? • Encodings Fixed or variable? Wide? • Parallelism CPS 220

  29. ISA Metrics • Aesthetics: • Regularity (Orthogonality) • No special registers, few special cases, all operand modes available with any data type or instruction type • Primitives not solutions • Completeness • Support for a wide range of operations and target applications • Streamlined • Resource needs easily determined • Ease of compilation (programming?) • Ease of implementation • Scalability • Density (Network BW and Power Consumption) CPS 220

  30. Basic ISA Classes Accumulator: 1 address add A acc ¬ acc + mem[A] 1+x address addx A acc ¬ acc + mem[A + x] Stack: 0 address add tos ¬ tos + next (JAVA VM) General Purpose Register: 2 address add A B A ¬ A + B 3 address add A B C A ¬ B + C Load/Store: 3 address add Ra Rb Rc Ra ¬ Rb + Rc load Ra Rb Ra ¬ mem[Rb] store Ra Rb mem[Rb] ¬ Ra CPS 220

  31. Stack Machines • Instruction set: +, -, *, /, . . . push A, pop A • Example: a*b - (a+c*b) push a push b * push a push c push b * + - C - + * C*B B A B + * A A*B A A*B A*B A C A A*B A A*B A*B - * + * a a b c b CPS 220

  32. The Case Against Stacks • Performance is derived from the existence of several fast registers, not from the way they are organized • Data does not always “surface” when needed • Constants, repeated operands, common subexpressions • so TOP and Swap instructions are required • Code density is about equal to that of GPR instruction sets • Registers have short addresses • Keep things in registers and reuse them • Slightly simpler to write a poor compiler, but not an optimizing compiler • So, why JAVA? CPS 220

  33. VAX-11 • Variable format, 2 and 3 address instruction • 32-bit word size, 16 GPR (four reserved) • Rich set of addressing modes (apply to any operand) • Rich set of operations • bit field, stack, call, case, loop, string, poly, system • Rich set of data types (B, W, L, Q, O, F, D, G, H) • Condition codes CPS 220

  34. Ri Rj v Kinds of Addressing Modes • Register direct Ri • Immediate (literal) v • Direct (absolute) M[v] • Register indirect M[Ri] • Base+Displacement M[Ri + v] • Base+Index M[Ri + Rj] • Scaled Index M[Ri + Rj*d + v] • Autoincrement M[Ri++] • Autodecrement M[Ri--] • Memory Indirect M[M[Ri]] memory reg. file CPS 220

  35. A "Typical" RISC • 32-bit fixed format instruction (3 formats) • 32 64-bit GPR (R0 contains zero) • 3-address, reg-reg arithmetic instruction • Single address mode for load/store: base + displacement • no indirection • Simple branch conditions • Delayed branch see: SPARC, MIPS, MC88100, AMD2900, i960, i860 PARisc, POWERPC, DEC Alpha, Clipper, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 CPS 220

  36. Example: MIPS (like DLX) Register-Register 6 5 11 10 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 0 target Op CPS 220

  37. Next Time • Data path design • Pipelining • Homework #1 Due Sept 11 CPS 220

  38. CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Review: Execution Time and Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CPS 220

  39. Review: How to Summarize Performance • Arithmetic mean (weighted arithmetic mean) tracks execution time: å(Ti)/n or å(Wi*Ti) • Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ å(1/Ri) or 1/ å(Wi/Ri) • Normalized execution time is handy for scaling performance • But do not take the arithmetic mean of normalized execution time, use the geometric mean (å (Ri)^1/n) CPS 220

  40. Review: Performance Evaluation • Benchmarks (toy,synthetic, kernels, full applications) • Games • Mistakes • Influence of making the sale CPS 220

  41. Review: Integrated Circuit Costs Die Cost goes roughly with die area4

More Related