390 likes | 730 Views
VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5. The Generic Processor Microarchitecture trends Performance/power/frequency implications Insights. Today's lecture: Comprehend performance, power and area implications of various Microarchitectures. References of the day.
E N D
VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5 • The Generic Processor • Microarchitecture trends • Performance/power/frequency implications • Insights Today's lecture: Comprehend performance, power and area implications of various Microarchitectures
References of the day • “Computer Architecture - A Quantitative Approach” (The second edition), John L. Hennessy, David A. Patterson, Chapter 3-4 (p. 125-370) • “Computer Organization and Design”, John L. Hennessy, David A. Patterson, Chapter 5-6, 9 (p. 268-451, 594-646) • “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996 • IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999. • “Billion-Transistor Architecture: There and Back again” Doug Burger, James Goodman, Computer, March • “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987 • Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984. • “The IBM 360/91: Machine Philosophy and Instruction Handling”, R.M. Tomasulo et al, IBM Journal of Research and Development 11:1, 1967 • “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987 • Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984. • “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996 • IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999 Some of the lectures material have been prepared by Ronny Ronen
Computing platformMessages • Balanced design • Power CV2f • System Performance • Transactions overhead • Memory as a scratch pad • Scheduling • System efficiency • … • CPU • ILP and IPC vs. Frequency • External vs. internal frequency • Speculation • Branch Prediction • $ (Caches) • Memory disambiguation • Instructions and Data Prefetch • Value prediction • …. • Multithread • Multithread on single core • Multi-cores system • $ in multi-core • Asymmetry • NUMA • Scheduling in MC • Mtulti-Core vs. Multi-thread machines • ….
The Generic Processor Instruction supply Sophisticated organization to “service” instructions • Instruction supply • Instruction cache • Branch prediction • Instruction decoder • ... • Execution engine • Instruction scheduler • Register files • Execution units • ... • Data supply • Data cache • TLB’s • … • Goal - Maximum throughput – balanced design Data supply Execution engine
Power & Performance • Performance 1/Execution Time (IPC x Frequency) / #-of-instructions-in-Task For a given instruction stream: Performance depends on the number of instructions executed per time-unit: • Performance IPC x Frequency Sometimes, Measured in MIPS - Million Instructions Per Second • PowerC x V2 x Frequency C = overall capacitance: for a given technology, is ~proportional to the # of transistors • Energy Efficiency = Performance/Power • Measured in MIPS/Watt Message: Power = C x V2 x Frequency
[John DeVale & Bryan Black, 2006] Microprocessor Performance Evolution MRM IPC Itanium YNH Intel P-M Power 3 Power 4 AMD Opetron AMD Athlon Intel P4 Message: Frequency vs. IPC
866 95% 87892% 80787% 708 90% Real life:Performance vs. frequency Message: Internal vs. external frequency * Source: Intel ® Pentium ® 4 Processor and Intel ® 850 Performance Brief, April2002
Microarchitecture • Micro-Processor Core – Performance/ power/area insights • Parallelism • Pipeline stalls/Bypasses • Superpipeline • Static/Dynamic scheduling • Branch prediction • Memory Hierarchy • VLIW / EPIC
... PE PE PE PE PE PE PE PE PE ... f a n n b c d a a e b c Parallelism Evolution Performance, power, area Insights? Pipeline Superscalar - In order Basic configuration PE PE=Processor Element ... a Instruction a b c n VLIW Superscalar - Out of Order
IF ID IE IW FF FD FE FW MF MD ME MW BF BD BE BW IF ID IE IW FF FD FE FW MF MD ME MW BF BD BE BW IF ID IE st st st st IW FF FD FE st st st st FW MF MD ME st st st st MW BF BD BE st st st st BW IF ID st st st st IE IW FF FD st st st st FE FW MF MD st st st st ME MW BF BD st st st st BE BW Static Scheduling: VLIW / EPICPerformance, power, area Insights? • Static scheduling of instructions by compiler • VLIW: Very Long Instruction Word (MultiFlow, TI6X family) • EPIC: Explicit Parallel Instruction set Computer (IA64) • Shorter pipe, wider machine, global view=> potentially huge ILP (wider & simpler than plain superscalar!) • Many nops, sensitive to varying latencies (memory accesses) • Low utilization • Huge code size • Highly depends on compiler • EPIC overcomes some of theselimitations: • Advance loads (hide memory latency) • Predicated execution (avoid branches) • Decoder templates (reduce nops) But at increased complexity I: integer F: Float M: Memory B: Branch st: stall Gray: nop Pipeline stages Perf/power Examples Intel Itanium® proc. DSPs increase time decrease
Dynamic SchedulingPerformance, power, area Insights? • Scheduling instructions at run time, by the HW • Advantages: • Works on the dynamic instruction flow:Can schedule across procedures, modules... • Can see dynamic values (memory addresses) • Can accommodate varying latencies and cases (e.g. cache miss) • Disadvantages • Can schedule within a limited window only • Should be fast - cannot be too smart Perf/power increase decrease
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 1F 1D 1E 1E 1W 2F 2D 2E 2E 2E 2W 3F 3D 3E 3E 3E 3E 3W 4F 4D 4E 4E 4E 4E 4E 4W 5F 5D 5E 5E 5E 5E 5E 5E 5W 1F 1D 1E 1E 1W 2F 2D 2E 2E 2E 2W 3F 3D 3E 3E 3w 3W 4F 4D 4E 4E 4E 4W 5F 5D 5E 5E 5E 5E 5W Out Of Order Execution • In Order Execution: instructions are processed in their program order. • Limitation to potential Parallelism. • OOO: Instructions are executed based on “data flow” rather than program order Before: src -> dest (1) load (r10), r21(2) mov r21, r31 (2 depends on 1)(3) load a, r11 (4) mov r11, r22 (4 depends on 3)(5) mov r22, r23 (5 depends on 4) After:(1)load (r10), r21; (3) load a, r11;<wait for loads to complete>(2) mov r21,r31; (4) mov r11,r22;(5) mov r22,r23; • Usually highly superscalar t In Order Processing t Out of Order Processing In Order vs. OOO execution.Assuming: - Unlimited resources- 2 cycles load latency Examples: Intel Pentium® II/III/4 Compaq Alpha 21264
Out Of Order (cont.)Performance, power, area Insights? • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., cache miss, divide) • Artificially increase the Register file size (i.e. number of registers) ? • Superior/complementary to compiler scheduler • Dynamic instruction window • Make usage of more registers than the Architecture Registers ? • Complex microarchitecture • Complex scheduler • Large instruction window • Speculative execution • Requires reordering back-end mechanism (retirement) for: • Precise interrupt resolution • Misprediction/speculation recovery • Memory ordering Perf/power increase decrease
Branch PredictionPerformance, power, area Insights? • Goal - ensure instruction supply by correct prefetching • In the past - prefetcher assumed fall-through • Lose on unconditional branch (e.g., call) • Lose on frequently taken branches (e.g., loops) • Dynamic Branch prediction • Predicts whether a branch is taken/not taken • Predicts branch target address • Misprediction cost varies (higher w/ increased pipeline depth) • Typical Branch prediction rates: ~90%-96% 4%-10% misprediction, 10-25 branches between mispredictions 50-125 instructions between mispredictions • Misprediction cost increased with • Pipeline depth • Machine width • e.g. 3 width x 10 stages = 30 inst flushed! ? Perf/power increase decrease
Caches In computer engineering, a cache (pronounced /kæʃ/kash in US and /keɪʃ/ kaysh in Aust/NZ) is a component that transparently stores data so that future requests for that data can be served faster (Wikipedia)
Small Fast <500B CPU Registers 0.25ns 64KB 1-2ns L1 cache 8MB 5ns L2 cache Speed Capacity (Size) Main memory (DRAM) 4GB 100ns DISK/Flash 1ms/ 100GB Slow Big Memory hierarchyPerformance, power, area Insights? 10us Perf/power: What are the parameters to consider here?
Environment and motivation Moore’s Law: 2X transistors (cores?) per chip every technology generationhowever, current process generation provide almost same clock rate • Processor running single process can compute only as fast as memory • A 3Ghz processor can execute an “add” operation in 0.33ns • Today’s “external Main memory” latency is 50-100ns • Naïve implementation: loads/stores can be 300x slower than other operations
Cache Motivation CPU - DRAM Gap (latency) µProc 60%/yr. (2X/1.5yr) “Moore’s Law” CPU-DRAM Gap Processor-Memory Performance Gap:(grows 50% / year) DRAM 9%/yr. (2X/10 yrs) • Memory latency can be handle by: • Multi-threaded engine (no cache) every memory access = off-chip access BW and power implications? • Caches every Cache miss = off-chip access BW and power implications?
Memory Hierarchy ! Number of CPU cycles to reach memory domain latency Memory * 1,000,000 C to Disk 10,000 C to SSD 1C T=300 C Registers CPU Disk/SSD C=CPU cycles 046267 Computer Architecture 1 U Weiser
Cache A cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations 048750 CMP Cache/Mem Arch – Uri W.,Evgeny B.
Memory Hierarchysolution I – single core environment ! A fast memory structure between CPU and memory solves latency issue Memory 1,000,000 C to Disk 10,000 C to SSD 1C 10 C 300 C Registers CPU Cache Disk/SSD C=CPU cycles 046267 Computer Architecture 1 U Weiser
Memory Hierarchysolution II – Multi-thread environment ! Memory Many Threads execution hide latency 1,000,000 C to Disk 10,000 C to SSD 300 C Performance1 BW1, P1 execution Disk/SSD execution Memory access (300 C) Memory access 046267 Computer Architecture 1 U Weiser 22
Memory HierarchySolution II – Multi-thread environment ! Memory 1,000,000 C to Disk 10,000 C to SSD Memory structure ($) between CPU and memory serves as BW filter 10 C 300 C Cache Disk/SSD Same performance:Performance1 BW1*MR, P1*MR 046267 Computer Architecture 1 U Weiser MR=Cache Miss rate
Power, Performance, Area:Insights – 1 • Energy to process one instruction: Wi • increases with the complexity of the processorE.g., OOO processor consumes more energy per instruction than an in-order processor Perf/Power • Energy efficiency=Perf/Power • value deteriorates as speculation increases and complexity grows • Area efficiency = Performance/area • Leakage become a major issue • Effectiveness of area – how to get more performance for a given area(secondary to power)
Power, Performance, Area:Insights - 2 • Performance • Perf a IPC * f • Voltage Scaling • Increased operating voltage to increase frequency • f = k * V (within a given voltage range) • Power & Energy consumption • P a C * V2 * f P ~ a * C * V3 • E = P * t • Tradeoff • Maximum performance • Minimum energy 1% perf 1% power < W/O voltage scaling> • Maximum performance within constrained power 1% perf 3% power <with voltage scaling>
Many things do not scale Wire delays Power Memory latencies and bandwidth Instruction Level parallelism (ILP) … We solve one: we fertilize the others! Performance = frequency * IPC Increasing IPC => more work per instruction Prediction, renaming, scheduling, etc… More useless work: Speculation, replays... More Frequency => More pipe stages Less gate delays per stage More gate delays per instruction overall Bigger loss due to flushes, cache misses, prefetch miss We may “gain” Performance => But with a lot of area and power! Power, Performance, Area Insight- 3
Static Scheduling: VLIW / EPICA short architectural case study • Why “new”? ….CISC = Old • Why reviving? ….OOO complexity • Advantages – simplicity (pipeline, dependency, dynamic) • reasons: • EOL of X86? • Business? • Servers? • Questions to ask? • Technical • Business • Controllers • Questions to ask? • Technical • Business
Static Issuing - exampleVLIW-Very Long Instruction WordMultiflow 7/200 • A VLIW Performs many program steps at once. • Many operations are grouped together into Very Long Instruction Word and execute together Memory Register File LD/ST FADD FMUL IALU Instruction Word LD/ST FADD FMUL IALU BRANCH Ref: “VLIW Architecture for a Trace Scheduling Compiler” Colwell. Nix, O’Donnell
Multiflow 7/200 (cont’)Compiler Basic Concept Optimized compiler arrange instructions according to instruction timing example: LD #B, R1 LD #C, R2 FADD R1, R2, R3 LD #D, R4 LD #E, R5 FADD R4, R5, R6 FMUL R6, R3, R1 STO R1, #A LD #G, R7 LD #H, R8 FMULL R7, R8, R9 LD #X, R4 LD #Y, R5 FMULL R4, R5, R6 FADD R6, R9, R1 STO R1, #F A = (B+C) * (D+E) F = G*H + X*Y Assume latencies: Load 3 FADD 3 FMUL 3 Store 1
Assume latencies: Load 3 FADD 3 FMUL 3 Store 1 Multiflow 7/200 (cont’)Compiler Basic Concept Example (Cont.): A = (B+C) * (D+E) F = G*H + X*Y LD/ST IALU FADD FMUL BR LD #B, R1 LD #C, R2 LD #D, R4 LD #E, R5 LD #G, R7 FADD R1,R2,R3 LD #H, R8 LD #X, R4 FADD R4,R5,R6 LD #Y, R5 FMUL R7,R8,R9 FMUL R3,R6,R1 FMUL R4,R5,R6 STO R1, #A FADD R9,R6,R1 - - - - - - - - - : stalled cycle, takes time, but no space. Overall latency 17 cycles.Very Low code efficiency: <25%! STO R1, #F
Intel® Itanium™ Processor Block Diagram IA-32 Decode and Control L1 Instruction Cache and Fetch/Pre-fetch Engine ITLB ECC Branch Prediction Instruction Queue 8 bundles B B B M M I I F F 9 Issue Ports Register Stack Engine / Re-Mapping L2 Cache L3 Cache Branch & Predicate Registers 128 Integer Registers 128 FP Registers Scoreboard, Predicate ,NaTs, Exceptions Branch Units Integer and MM Units Dual-Port L1 Data Cache and DTLB Floating Point Units ALAT SIMD FMAC SIMD FMAC ECC ECC Bus Controller ECC ECC ECC
Instruction Types M: Memory I: Shifts, MM A: ALU B: Branch F: Floating point L+X: Long Template types Regular: MII, MLX, MMI, MFI, MMF Stop: MI_I M_MI Branch: MIB, MMB, MFB, MBB, BBB All come in two versions: with stop at end without stop at end template 5 bits IA64 Instruction Template 128 bits Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits • Microarchitecture considerations: • Can run N bundles per clock (Merced = 2) • Limits on numbers of memory ports (Merced =2, future > 2?)