590 likes | 711 Views
Computer Architecture CSE 3322. Send email to Pramod Kumar, pxk3008@exchange.uta.edu , with the names and emails of your Four Project team members by Mon Sept 15. If not on a team, send your email address to Pramod. Web Site crystal.uta.edu/~jpatters/cse3322. CPI.
E N D
Computer Architecture CSE 3322 Send email to Pramod Kumar, pxk3008@exchange.uta.edu, with the names and emails of your Four Project team members by Mon Sept 15. If not on a team, send your email address to Pramod. Web Site crystal.uta.edu/~jpatters/cse3322
CPI “Average clock cycles per instruction” CPI = Clock Cycles / Instruction
CPI “Average clock cycles per instruction” CPI = Clock Cycles / Instruction CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time
CPI “Average clock cycles per instruction” • CPI = Clock Cycles / Instruction CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count
CPI “Average clock cycles per instruction” • CPI = Clock Cycles / Instruction Count • = (CPU Time * Clock Rate) / Instruction Count Invest Resources where time is Spent! CPU Time = Instruction Count x CPI / Clock Rate = Instruction Count x CPI x Clock Cycle Time Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count Average CPI = SUM of CPI(i) * F(i) for i = 1, n F(i) is the Instruction Frequency
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much?
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much?
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much?
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B Note: CPI is Smaller for B
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 Load 20% 5 Store 10% 3 Branch 20% 2 Typical Mix
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 = CPI ave Typical Mix
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 = CPI ave Typical Mix CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 = CPI ave Typical Mix CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 = CPI ave Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 (2) 1.0 (.4) 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 (1.6) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? 2.2/1.6 = 1.375 CPU Time = Inst Cnt * CPI ave * Clk Cycle Time
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time?
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 (1) .4 (.2) 18% 2.2 (2.0) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once?
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 (.5) .5 (.25) 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 (1.95) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once? CPI=1.95
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.Which sequence will be faster? How much?What is the CPI for each sequence?
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much?What is the CPI for each sequence?
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11What is the CPI for each sequence?
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11What is the CPI for each sequence? 10/5 = 2 9/6 = 1.5
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different • Highly dependent on the program
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different • Highly dependent on the program • Can be inversely proportional to performance
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A has 1 cycle,Class B has 2 cycles, Class C has 3 cycles Instruction counts ( billions) Code from A B C Compiler 1 5 1 1 Compiler 2 10 1 1 • Which sequence will be faster according to MIPS? • Which sequence will be faster according to execution time?
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 5+1x2+1x3=10 billion Compiler 2
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion Compiler 2 15 billion
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 1010x10-8=100 Compiler 2 15 billion
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec Compiler 2 15 billion 150 sec
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 7x103/100 Compiler 2 15 billion 150 sec
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 12x103/150
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 80
Benchmarks • Performance best determined by running a real application • Use programs typical of expected workload • Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc.
Benchmarks • Performance best determined by running a real application • Use programs typical of expected workload • Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. • Small benchmarks • nice for architects and designers • easy to standardize • can be abused
Benchmarks • SPEC (System Performance Evaluation Cooperative)
Benchmarks • SPEC (System Performance Evaluation Cooperative) • companies have agreed on a set of real program and inputs
Benchmarks • SPEC (System Performance Evaluation Cooperative) • companies have agreed on a set of real program and inputs • can still be abused
Benchmarks • SPEC (System Performance Evaluation Cooperative) • companies have agreed on a set of real program and inputs • can still be abused • valuable indicator of performance (and compiler technology)
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer applications • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer applications • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive applications • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer applications • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive applications • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags • eliminate special undocumented incantations that may not even generate working code for real programs
Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )
Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?