190 likes | 411 Views
Computer Engineering Fundamentals Computer Architecture . Fall 2006 R. Venkatesan EN-4036 737-8900 venky@engr.mun.ca. Course details. Objective: Review basic computer architecture topics and thus prepare students for ENGR9861 (hi-perf comp arch)
E N D
Computer Engineering FundamentalsComputer Architecture Fall 2006 R. Venkatesan EN-4036 737-8900 venky@engr.mun.ca
Course details • Objective: Review basic computer architecture topics and thus prepare students for ENGR9861 (hi-perf comp arch) • Course evaluation: final exam (60%), assignments (2x20%) • Classes: Mondays 11 – 12 EN-4033 • Problem sets: 3 (selected problems in the first 2 => submit) • Tutorials: Thursdays 11– 12 EN-4033 Wang Guan • Plagiarism: consult course information sheet • Textbook: Hennessy & Patterson – a quantitative approach • Syllabus: Chapters 1, 2 & 5 in the textbook • Computer organization • Measuring performance, speed up • instruction set architecture, MIPS • memory organization: cache, main memory, virtual memory • Course notes: will make available on the web ENGR9859 R. Venkatesan Computer Architecture
Computer Organization • Input & Output units: mic, keyboard, mouse, monitor, speaker, etc. • CPU: processor (microprocessor) that includes ALU, registers, internal buses, cache, controller • Program Counter: contains address instruction to be executed • General purpose registers: temporary storage (not memory) • Memory unit: kernel of the operating system on ROM, higher levels of cache, DRAM (main memory), disks • Network: LAN, WLAN, WAN: ethernet, Internet, ATM ENGR9859 R. Venkatesan Computer Architecture
Ways to improve performance of computer • Use faster material: silicon, GaA, InP • Use faster technology: photochemical lithography • Employ better architecture within one processor • Selection of instruction set: RISC/CISC, VLIW • Cache (levels of cache): higher throughput • Virtual memory: relocatability, security • Pipelining: k stages gives a maximum speedup of k • Superpipelining, • Superscalar (multiple pipelines) with dynamic scheduling • Branch prediction • Use multiple processors: emphasis of ENGR9861 • Scalability, level of parallelism • Shared memory, array processing, multicomputers, MPP • Employ better software: compilers, etc. ENGR9859 R. Venkatesan Computer Architecture
Speedup • Any (architectural) enhancement will hopefully lead to better performance, and speedup is a measure of this improvement. • Performance improvement should be based on the total CPU time taken to execute the application, and not just any of the component times like memory access time or clock period. • If the whole processor is replicated, then the fraction enhanced is 100%, as the whole computation will be impacted. • If an enhancement affects only a part of the computation, then we need to determine the fraction of the CPU time impacted by the enhancement. ENGR9859 R. Venkatesan Computer Architecture
Amdahl’s Law • The following simple, but important law tells us that we need to always aim at making enhancements that will affect a large fraction of the computation, if not the whole computation. ENGR9859 R. Venkatesan Computer Architecture
CPU (Computation) time • CPU time is the product of three quantities: • Number of instructions executed or Instruction Count (IC): remember this is not the code (program) size • Average number of clock cycles per instruction (CPI): if CPI varies for different instruction, a weighted average is needed • Clock period (τ) • CPU time = IC * CPI * τ • An architectural (or compiler-based) enhancement that is aimed to decrease one of the above two factors might end up increasing one or both of the other two. It is the product of the three quantities after applying the enhancement that gives us the new CPU time. ENGR9859 R. Venkatesan Computer Architecture
Measuring performance of computers • CPU clock speed is an indicator, but not sufficient • CPU speed, memory speed, other devices, system configuration: gives better idea, but not sufficient • Metrics such as millions of instructions per second (MIPS), billions of floating point instructions per second (GFLOPS), trillions of operations per second (TOPS), thousands of polygons per second (kpolys), millions of network transactions per second, etc. give only a partial picture – they leave out IC • Benchmark programs: real applications such as gcc, Tex; toy benchmarks such as sieve of Eratosthenes, Puzzle; kernels such as Livermore loops, Linpack; synthetic benchmarks such as Whetstone, Dhrystone, Dhampstone • Benchmark suites such as SPECint95 and SPEC CPU2000 offer collections of the above. Performance is compared with a selected standard (using geometric mean), and given as a dimensionless number. ENGR9859 R. Venkatesan Computer Architecture
Speedup example • Three enhancements for different parts of computation are contemplated, with speedups of 40, 20 and 5, respectively. E1 improves 20%, E2 improves 30% and E3 improves 70% of the computation. Assuming both cost the same, which is a better choice? • Speedup due to E1 = 1 / ((1-0.2) + 0.2/40) = 1.242 • Speedup due to E2 = 1 / ((1-0.3) + 0.3/20) = 1.399 • Speedup due to E3 = 1 / ((1-0.7) + 0.7/5) = 2.272 • So, a higher fraction enhanced is more beneficial than a huge speedup for a small fraction. • So, frequency of execution of different instructions becomes important – statistics. ENGR9859 R. Venkatesan Computer Architecture
RISC/CISC • Complex instruction set computers have 1000s of instructions, whereas RISC processors are designed to efficiently execute about 100 instructions. • Now, most processors have certain RISC and certain CISC features. However, the architects / designers should remember Amdahl’s law and implement most frequently executed instructions efficiently using hardware. Some of the rarely executed instructions can even be microcoded. • Simplicity and uniformity (consistency and orthogonality) in architectural choices will lead to hardware that has a short critical path (worst-case propagation delay), resulting in high performance. ENGR9859 R. Venkatesan Computer Architecture
Instruction types • Arithmetic & Logic (ALU) instructions • Only register operands • Register operands plus one immediate operand (with instn.) • Memory access instructions: Load & Store • Transfer of control instructions: branches and jumps that are conditional or unconditional, procedure calls, returns, interrupts, exceptions, traps, etc. • Special instructions: flag setting, etc. • Floating point instructions • Graphic, digital signal processing (DSP) instructions • Input / Output instructions: Memory-mapped I/O? • Other instructions ENGR9859 R. Venkatesan Computer Architecture
Instruction occurrence • ALU (including ALU immediates): 25-40% • Loads: 20-30% • Stores: 10-15% • Conditional Branches: 15-30% • Other instructions: >2% • Data processing applications use loads and stores more frequently than ALU instructions; numeric (scientific & engineering) applications use more ALU. • Graphic processors, network processors, DSP, vector coprocessors will have a different mix. ENGR9859 R. Venkatesan Computer Architecture
Size for word, instruction & memory access • Most common word sizes for general purpose CPU are 32 bits and 64 bits; even now, 8-bit and 16-bit microcontrollers are common. • Instruction size is variable or fixed, the latter being common in high-performance processors; can be smaller than, equal to or larger than word size. • Most GP CPUs use 64-bit words and 32-bit instructions; very large instruction word (VLIW) processors are emerging. • Memory access, especially writing, needs to be done one byte (8 bits) at a time due to character data type. That is why, memory is byte-addressable, even when the word size is larger. ENGR9859 R. Venkatesan Computer Architecture
Memory organization • lsb is bit 0 or msb is bit 0: only a convention. • Every byte of memory has a unique address, and so each word spans several addresses. For example, in a 64-bit computer, each word spans 8 addresses – must be consecutive 8 addresses. • Little endian (the lowest address points to the LSB of a word) or big endian: only a convention • Aligned or non-aligned memory access: • Aligned: efficient storage but complex and slow hardware • Nonaligned: gaps in memory but fast and simple hardware • Most (but not all) modern processors restrict that the size of all operands in ALU instructions be equal to the word size – again, for fast and simple hardware. ENGR9859 R. Venkatesan Computer Architecture
A typical modern CPU’s instruction set • MIPS-64 follows GPR load-store architecture. • Uses 64-bit word, 32-bit instructions. • Has 32 64-bit GPRs; R0 is read-only and has 0; R31 is used with procedure calls; 32 more 32-bit FPRs that can be accessed in pairs as 64-bit entities. • Instruction formats: • R-type: 6-bit opcode, 3x5-bit register ids, 6-bit function This is used by ALU reg-reg instructions • I-type: 6-bit opcode, 2x5-bit register ids, 16-bit disp./imm. This is used by ALU reg-imm., load, store, cond. branch. • J-type: used only for J and JAL • Load, store: byte, half word, word, double word! • ALU, FP instructions; BEQ & BNE; J, JR, JAL. JALR. ENGR9859 R. Venkatesan Computer Architecture
Sample code in MIPS-64 DADDI R1, R0, #8000 ; 1000 words in array here: SUBDI R1, R1, #8 ; Word has 8 bytes LD R2, 30000(R1) ; Array starts at 30000 DADD R2, R2, R2 ; Double contents SD 30000(R1), R2 ; Store back in array BNEZ R1, here ; if not done, repeat A 1000-word array, starting at the memory location 30000, is accessed word-by-word (64-bit words), and the value of each word is doubled. Memory access: Approximately, 40% of all instructions executed access memory. Without cache, memory is accessed 2*1000 times during the execution of the code. If L1 data cache is at least 64kB large, then there would be only compulsory misses of the cache during the execution of the code. ENGR9859 R. Venkatesan Computer Architecture
Compilers and architecture • Architectures must be developed in such a manner that compilers can be easily developed. Architects should be aware of how compilers work. Compiler designers can design good products if they are aware of the details of the architecture. • Simple compilers perform table look-up. Optimizing compilers can come up with machine code that is up to 50 times faster in execution. • Optimization is done by compilers at various levels: • Front-end: transform to common intermediate form • High-level: loop transformations and procedure in-lining • Global optimizer: global and local optimizations; register allocation • Code generator: machine-dependent optimizations ENGR9859 R. Venkatesan Computer Architecture
Example – compiler & architecture • A program execution entails running 10 million instructions. 45% of these are ALU instructions. An optimizing compiler removes a third of them. The processor runs at 1 GHz clock speed. Instruction execution times are: 1 cc for ALU, 2 cc for load & store that comprise 25% of all operations, and 3 cc for conditional branches. Computer MIPS ratings and CPU times before and after optimization. • Before optimization: CPIavg = 0.45*1 + 0.25*2 + 0.3*3 = 1.85 MIPS = 10^-6 / (1.85 * 10^-9) = 540.5 CPU time = 10^-9*10^7*(4.5*1 + 2.5*2 + 3.0*3) = 0.0185 s • After optimization: CPIavg = (0.3*1 + 0.25*2 + 0.3*3)/0.85 = 2.176 MIPS = 10^+6 / (2.176 * 10^-9) = 459.6 (smaller!?!) CPU time = 10^-9 * 10^7 * (3.0*1 + 2.5*2 + 3.0*3) = 17 ms ENGR9859 R. Venkatesan Computer Architecture