1 / 43

CPU Performance Analysis: CPI and Time Allocation

This text discusses CPI (Average clock cycles per instruction) and how to analyze CPU performance by determining the time allocation for different instructions. It also explores examples and calculations for typical and improved scenarios, such as faster load times and branch prediction. Additionally, it compares the impact of better data caching and executing multiple ALU instructions at once. The text concludes with an analysis of two code sequences and their respective CPI values.

Download Presentation

CPU Performance Analysis: CPI and Time Allocation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TEST 1 – Tuesday March 3Lectures 1 - 8, Ch 1,2 • HW Due Feb 24 • 1.4.1 p.60 • 1.4.4 p.60 • 1.4.6 p.60 • 1.5.2 p.60-61 • 1.5.4 p.61 • 1.5.5 p.61

  2. CPI “Average clock cycles per instruction” • CPI = Clock Cycles / Instruction Count • = (CPU Time * Clock Rate) / Instruction Count Invest Resources where time is Spent! CPU Time = Instruction Count x CPI / Clock Rate = Instruction Count x CPI x Clock Cycle Time Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count Average CPI = SUM of CPI(i) * F(i) for i = 1, n F(i) is the Instruction Frequency

  3. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 Load 20% 5 Store 10% 3 Branch 20% 2 Typical Mix

  4. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 = CPI ave Typical Mix

  5. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 = CPI ave Typical Mix CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave

  6. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 = CPI ave Typical Mix CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave

  7. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 = CPI ave Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

  8. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 (2) 1.0 (.4) 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 (1.6) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? 2.2/1.6 = 1.375 CPU Time = Inst Cnt * CPI ave * Clk Cycle Time

  9. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time?

  10. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 (1) .4 (.2) 18% 2.2 (2.0) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0

  11. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once?

  12. Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 (.5) .5 (.25) 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 (1.95) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once? CPI=1.95

  13. A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.Which sequence will be faster? How much?What is the CPI for each sequence?

  14. A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much?What is the CPI for each sequence?

  15. A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11What is the CPI for each sequence?

  16. A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11What is the CPI for each sequence? 10/5 = 2 9/6 = 1.5

  17. A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10

  18. A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different

  19. A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different • Highly dependent on the program

  20. A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different • Highly dependent on the program • Can be inversely proportional to performance

  21. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A has 1 cycle,Class B has 2 cycles, Class C has 3 cycles Instruction counts ( billions) Code from A B C Compiler 1 5 1 1 Compiler 2 10 1 1 • Which sequence will be faster according to MIPS? • Which sequence will be faster according to execution time?

  22. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 5+1x2+1x3=10 billion Compiler 2

  23. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion Compiler 2 15 billion

  24. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 1010x10-8=100 Compiler 2 15 billion

  25. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec Compiler 2 15 billion 150 sec

  26. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 7x103/100 Compiler 2 15 billion 150 sec

  27. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 12x103/150

  28. MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 80

  29. Benchmarks • Performance best determined by running a real application • Use programs typical of expected workload • Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc.

  30. Benchmarks • Performance best determined by running a real application • Use programs typical of expected workload • Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. • Small benchmarks • nice for architects and designers • easy to standardize • can be abused

  31. SPEC CPU 2006 • SPEC - System Performance Evaluation Cooperative • 12 Integer Benchmarks • 17 Floating Point Benchmarks • Fig 1.20 P.49

  32. Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )

  33. Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?

  34. Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Improved Time = (100 – 80) + 80/n = 100/4

  35. Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Improved Time = (100 – 80) + 80/n = 100/4 20 + 80/n = 25 80 = 5n ; n = 16

  36. Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? How about making it 5 times faster?

  37. Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? How about making it 5 times faster? Improved Time = (100 –80) + 80/n = 100/5

  38. Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? How about making it 5 times faster? Improved Time = (100 –80) + 80/n = 100/5 20 + 80/n = 20 80/n = 0 Impossible!

  39. Remember • Performance is specific to a particular program/s • Total execution time is a consistent summary of performance

  40. Remember • Performance is specific to a particular program/s • Total execution time is a consistent summary of performance • For a given architecture performance increases come from: • increases in clock rate (without adverse CPI affects)

  41. Remember • Performance is specific to a particular program/s • Total execution time is a consistent summary of performance • For a given architecture performance increases come from: • increases in clock rate (without adverse CPI affects) • improvements in processor organization that lower CPI

  42. Remember • Performance is specific to a particular program/s • Total execution time is a consistent summary of performance • For a given architecture performance increases come from: • increases in clock rate (without adverse CPI affects) • improvements in processor organization that lower CPI • compiler enhancements that lower CPI and/or instruction count

More Related