350 likes | 458 Views
EECS 470. Computer Architecture Lecture 2 Coverage: Chapters 1-2. A Quantitative Approach. Hardware systems performance is generally easy to quantify Machine A is 10% faster than Machine B Of course Machine B’s advertising will show the opposite conclusion
E N D
EECS 470 Computer Architecture Lecture 2 Coverage: Chapters 1-2
A Quantitative Approach • Hardware systems performance is generally easy to quantify • Machine A is 10% faster than Machine B • Of course Machine B’s advertising will show the opposite conclusion • Example: Pentium 4 vs. AMD Hammer • Many software systems tend to have much more subjective performance evaluations.
Measuring Performance • Use Total Execution Time: • A is 3 times faster than B for programs P1,P2 • Issue: Emphasizes long running programs 1 n Timei n i=1
Measuring Performance • Weighted Execution Time: • What if P1 is executed far more frequently? n WeightiTimei Weighti = 1 Arithmetic mean (AM) = i=1 n i=1
Measuring Performance • Normalized Execution Time: • Compare machine performance to a reference machine and report a ratio. • SPEC ratings measure relative performance to a reference machine.
Example using execution times Conclusion: B is faster than A It is 1001/111 = 9.1 times faster
Averaging Performance Over Benchmarks n 1 • Arithmetic mean (AM) = • Geometric mean (GM) = • Harmonic mean (HM) = Timei n i = 1 √ n n ∏ Timei i = 1 n n 1 Ratei i = 1
Which is the right Mean? • Arithmetic when dealing with execution time • Harmonic when dealing with rates • flops • MIPS • Hertz • Geometric mean gives an “equi-weighted” average
Use Harmonic Mean with Rates Notice that the total time ordering is preserved in the HM of the rates Rates (mflops) from above table
Normalized Times • Don’t take AM of normalized execution times which one? which one? • GM doesn’t track total execution time – last line
Notes & Benchmarks • AM ≥ GM • GM (Xi) / GM (Yi) = GM (Xi /Yi ) • The GM is unaffected by normalizing – it just doesn’t track execution time • Why does SPEC use it? • SPEC – system performance evaluation cooperative • http://www.specbench.org/ • EEMBC – benchmarks for embedded applications: embedded microporcessor benchmark consortium • http://www.eembc.org/
Amdahl’s Law • Rule of Thumb: Make the common case faster Execution timenew =Execution timeold (1 - Fractionenhanced) +) Fractionenhanced Speedupenhanced (Attack longest running part until it is no longer) repeat
Instruction Set Design • Software Systems: named variables; complex semantics. • Hardware systems: tight timing requirements; small storage structures; simple semantics • Instruction set: the interface between very different software and hardware systems
Design decisions • How much “state” is in the microarchitecture? • Registers; Flags; IP/PC • How is that state accessed/manipulated? • Operand encoding • What commands are supported? • Opcode; opcode encoding
Design Challenges: or why is architecture still relevant? • Clock frequency is increasing • This changes the number of levels of gates that can be completed each cycle so old designs don’t work. • It also tend to increase the ration of time spent on wires (fixed speed of light) • Power • Faster chips are hotter; bigger chips are hotter
Design Challenges (cont) • Design Complexity • More complex designs to fix frequency/power issues leads to increased development/testing costs • Failures (design or transient) can be difficult to understand (and fix) • We seem far less willing to live with hardware errors (e.g. FDIV) than software errors • which are often dealt with through upgrades – that we pay for!)
Techniques for Encoding Operands • Explicit operands: • Includes a field to specify which state data is referenced • Example: register specifier • Implicit operands: • All state data can be inferred from the opcode • Example: function return (CISC-style)
Accumulator • Architectures with one implicit register • Acts as source and/or destination • One other source explicit • Example: C = A + B • Load A // (Acc)umulator A • Add B // Acc Acc + B • Store C // C Acc Ref: “Instruction Level Distributed Processing: Adapting to Shifting Technology”
Stack • Architectures with implicit “stack” • Acts as source(s) and/or destination • Push and Pop operations have 1 explicit operand • Example: C = A + B • Push A // Stack = {A} • Push B // Stack = {A, B} • Add // Stack = {A+B} • Pop C // C A+B ; Stack = {} Compact encoding; may require more instructions though
Registers • Most general (and common) approach • Small array of storage • Explicit operands (register file index) • Example: C = A + B Register-memory load/store Load R1, A Load R1, A Load R2, B Add R3, R1, B Add R3, R1, R2 Store R3, C Store R3, C
Memory • Big array of storage • More complex ways of indexing than registers • Build addressing modes to support efficient translation of software abstractions • Uses less space in instruction than 32-bit immediate field A[i]; use base (A) + displacement (i) (scaled?) a.ptr; use base (ptr) + displacement (a)
Addressing modes Register Add R4, R3 Immediate Add R4, #3 Base/Displacement Add R4, 100(R1) Register Indirect Add R4, (R1) Indexed Add R4, (R1+R2) Direct Add R4, (1001) Memory Indirect Add R4, @(R3) Autoincrement Add R4, (R2)+
Other Memory Issues What is the size of each element in memory? Byte Half word Word 0x000 0-255 0x000 0 - 65535 0 - ~4B 0x000
Other Memory Issues Big-endian or Little-endian? Store 0x114488FF Points to most significant byte Points to least significant byte 0x000 11 0x000 FF 44 88 88 44 FF 11
Other Memory Issues Non-word loads? ldb R3, (000) 00 00 00 11 0x000 11 44 88 FF
Other Memory Issues Non-word loads? ldb R3, (003) FF FF FF FF 11 44 Sign extended 88 0x003 FF
Other Memory Issues Non-word loads? ldbu R3, (003) 00 00 00 FF 11 44 Zero filled 88 FF 0x003
Other Memory Issues Alignment? Word accesses only address ending in 00 Half-word accesses only ending in 0 Byte accesses any address 11 44 ldw R3, (002) is illegal! 88 0x002 Why is it important to be aligned? How can it be enforced? FF
Techniques for Encoding Operators • Opcode is translated to control signals that • direct data (MUX control) • select operation for ALU • Set read/write selects for register/memory/PC • Tradeoff between how flexible the control is and how compact the opcode encoding. • Microcode – direct control of signals (Improv) • Opcode – compact representation of a set of control signals. • You can make decode easier with careful opcode selection (as done in HW1)
Handling Control Flow • Conditional branches (short range) • Unconditional branches (jumps) • Function calls • Returns • Traps (OS calls and exceptions) • Predicates (conditional retirement)
Encoding branch targets • PC-relative addressing • Makes linking code easier • Indirect addressing • Jumps into shared libraries, virtual functions, case/switch statements • Some unusual modes to simplify target address calculation • (segment offset) or (trap number)
Condition codes • Flags • Implicit: flag(s) specified in opcode (bgt) • Flag(s) set by earlier instructions (compare, add, etc.) • Register • Uses a register; requires explicit specifier • Comparison operation • Two registers with compare operation specified in opcode.
Higher Level Semantics: Functions • Function call semantics • Save PC + 1 instruction for return • Manage parameters • Allocate space on stack • Jump to function • Simple approach: • Use a jump instruction + other instructions • Complex approach: • Build implicit operations into new “call” instruction
Role of the Compiler • Compilers make the complexity of the ISA (from the programmers point of view) less relevant. • Non-orthogonal ISAs are more challenging. • State allocation (register allocation) is better left to compiler heuristics • Complex Semantics lead to more global optimization – easier for a machine to do. People are good at optimizing 10 lines of code. Compilers are good at optimizing 10M lines.
Next time • Compiler optimizations • Interaction between compilers and architectures • Higher level machine codes (Java VM) • Starting Pipelining: Appendix A