280 likes | 1.07k Views
Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 start. Performance Summary. Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI
E N D
Lec 3 Sept 2 • complete Chapter 1 • exercises from Chapter 1 • quiz # 1 • Chapter 2 start
Performance Summary • Performance depends on • Algorithm: affects IC, possibly CPI • Programming language: affects IC, CPI • Compiler: affects IC, CPI • Instruction set architecture: affects IC, CPI, Tc The BIG Picture
Exercise 1.2.1 For a color display using 8 bits for each primary color (R, G, B) per pixel and with a resolution of 1280 x 800 pixels, what should be the size (in bytes) of the frame buffer to store a frame? Each frame requires 1280 x 800 x 3 = 3072000 ~ 3 Mbytes If a computer has 3 GB memory to store such frames, how many frames can be stored? 3 x 109 / 3 x 106 ~ 1000 frames
Exercise 1.3 Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below: clock rate CPI P1 2 GHz 1.5 P2 1.5 GHz 1.0 P3 3 GHz 2.5
Exercise 1.3 Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below: clock rate CPI P1 2 GHz 1.5 P2 1.5 GHz 1.0 P3 3 GHz 2.5 1.3.1. Which processor has the highest performance? Suppose the program has N instructions. Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N x 10-9 Time taken to execute on P2 is = N/ (1.5 x 109) = 0.66 N x 10-9 Time taken to execute on P3 is = 2.5 N/ (3 x 109) = 0.83 N x 10-9
Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N x 10-9 Time taken to execute on P2 is = N/ (1.5 x 109) = 0.66 N x 10-9 Time taken to execute on P3 is = 2.5 N/ (3 x 109) = 0.83 N x 10-9 P2 has the best performance (since it takes the least time to execute).
Exercise 1.3 Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below: clock rate CPI P1 2 GHz 1.5 P2 1.5 GHz 1.0 P3 3 GHz 2.5 1.3.2. If the processors each execute a program in 10 seconds, find the number of cycles and the number of instructions.
Exercise 1.3 Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below: clock rate CPI P1 2 GHz 1.5 P2 1.5 GHz 1.0 P3 3 GHz 2.5 1.3.2. If the processors each execute a program in 10 seconds, find the number of cycles and the number of instructions. Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N1 x 10-9 = 10 So N1 = 1.33 x 1010
Exercise 1.4.3 Given below are the number of instructions of a program: arith store load branch total 500 50 100 50 700 Assuming the instructions take 1, 5, 5 and 2 cycles, what is the execution time in a 2 GHz processor?
Exercise 1.4.3 Given below are the number of instructions of a program: arith store load branch total 500 50 100 50 700 Assuming the instructions take 1, 5, 5 and 2 cycles, what is the execution time in a 2 GHz processor? Solution: time to execute = cycle time x CPI x no. of inst Cycle time = 1/(2 x 10-9) CPI = (500/700 + 50 x 5/700 + 100 x 5/700 + 50 x 2/700) So the total time = 675 x 10-9 sec
Exercise 1.6 • Compilers have a profound impact on the performance of an application on a given processor. This problem will explore the impact compilers have on execution time:. • compiler A compiler B • no instructions exec. Time no. instructions exec. Time • 1.0 x 109 1 s 1.2 x 109 1.4 s • (b) 1.4 x 109 0.8 s 1.2 x 109 0.7 s Find the average CPI for each program given that the processor has a cycle time of 1 ns.
Exercise 1.6 • Compilers have a profound impact on the performance of an application on a given processor. This problem will explore the impact compilers have on execution time:. • compiler A compiler B • no instructions exec. Time no. instructions exec. Time • 1.0 x 109 1 s 1.2 x 109 1.4 s • (b) 1.4 x 109 0.8 s 1.2 x 109 0.7 s Find the average CPI for each program given that the processor has a cycle time of 1 ns. Exec. Time = CPI x cycle time x no. of inst (a) Compiler A: CPI = 1/ (10-9 x 109 ) = 1
Power Trends §1.5 The Power Wall • In CMOS IC technology ×30 5V → 1V ×1000
Reducing Power • Suppose a new CPU has • 85% of capacitive load of old CPU • 15% voltage and 15% frequency reduction • The power wall • We can’t reduce voltage further • We can’t remove more heat • How else can we improve performance?
Exercise 1.7 1.7.4. Given the following information about each processor, calculate its capacitive load: Processor 80286: clock rate = 12.5 MHz power = 3.3 W voltage = 5 V Solution: Use the equation power = capacitive load x voltage2 x clock rate Capacitive load = 3.3 / (5 x 5 x 12.5) x 10-6 = 0.01056 x 10-6
Uniprocessor Performance §1.6 The Sea Change: The Switch to Multiprocessors Constrained by power, instruction-level parallelism, memory latency
Multiprocessors General-purpose uni-cores have reached limits of historic performance scaling Power consumption Wire delays DRAM access latency Diminishing returns of more instruction-level parallelism Slide from Prof. Saman Amarasinghe
Multiprocessors • Multicore microprocessors • More than one processor per chip • Requires explicitly parallel programming • Compare with instruction level parallelism • Hardware executes multiple instructions at once • Hidden from the programmer • Hard to do • Programming for performance • Load balancing • Optimizing communication and synchronization
Manufacturing ICs • Yield: proportion of working dies per wafer §1.7 Real Stuff: The AMD Opteron X4
Integrated Circuit Cost • Nonlinear relation to area and defect rate • Wafer cost and area are fixed • Defect rate determined by manufacturing process • Die area determined by architecture and circuit design
SPEC CPU Benchmark • Programs used to measure performance • Supposedly typical of actual workload • Standard Performance Evaluation Corp (SPEC) • Develops benchmarks for CPU, I/O, Web, … • SPEC CPU2006 • Elapsed time to execute a selection of programs • Negligible I/O, so focuses on CPU performance • Normalize relative to reference machine • Summarize as geometric mean of performance ratios • CINT2006 (integer) and CFP2006 (floating-point)
CINT2006 for Opteron X4 2356 High cache miss rates
Amdahl’s Law s = min(p, 1/f) 1 f+(1–f)/p f = fraction unaffected p = speedup of the rest Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast.
Amdahl’s Law in design Example • A processor spends 30% of its time on flp addition, 25% on flp mult, • and 10% on flp division. Evaluate the following enhancements, each • costing the same to implement: • Redesign of the flp adder to make it twice as fast. • Redesign of the flp multiplier to make it three times as fast. • Redesign the flp divider to make it 10 times as fast.
Amdahl’s Law in design Example • A processor spends 30% of its time on flp addition, 25% on flp mult, • and 10% on flp division. Evaluate the following enhancements, each • costing the same to implement: • Redesign of the flp adder to make it twice as fast. • Redesign of the flp multiplier to make it three times as fast. • Redesign the flp divider to make it 10 times as fast. • Solution • Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18 • Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20 • Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10 • What if both the adder and the multiplier are redesigned?
Amdahl’s Law – limit to improvement • Improving an aspect of a computer and expecting a proportional improvement in overall performance §1.8 Fallacies and Pitfalls • Example: multiply accounts for 80s/100s • How much improvement in multiply performance to get 5× overall? • Can’t be done! • Corollary: make the common case fast
Pitfall: MIPS as a Performance Metric • MIPS: Millions of Instructions Per Second • Doesn’t account for • Differences in ISAs between computers • Differences in complexity between instructions • CPI varies between programs on a given CPU
Concluding Remarks • Cost/performance is improving • Due to underlying technology development • Hierarchical layers of abstraction • In both hardware and software • Instruction set architecture • The hardware/software interface • Execution time: the best performance measure • Power is a limiting factor • Use parallelism to improve performance §1.9 Concluding Remarks