1 / 78

Introduction to Energy Aware Computing

Explore the advancements and trends in energy-aware computing, from high-end processors consuming 10 kW to low-end processors delivering higher performance with lower energy consumption. Learn about power efficiency, top supercomputers' energy profiles, and strategies for reducing power consumption at all design levels.

whitei
Download Presentation

Introduction to Energy Aware Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IntroductiontoEnergy Aware Computing Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Energy Aware Computing Soesterberg, March 2012

  2. Core i7 3GHz 100W • Intel Trends • #transistors follows Moore • but not freq. and performance/core 5 Henk Corporaal

  3. Types of compute systems Henk Corporaal

  4. A 20nm scenario (high end processor) • This means: • a 2cm2 processor consumens 10 kW • a bound of 100W requires only 1% to be active dark silicon Henk Corporaal

  5. Intel's answer: 48-core x86 Henk Corporaal

  6. Power versus Energy • Power P = fCVdd2 •  switching activity (<1); f frequency; C switching capacitance, Vdd supply voltage • heat / temperature constraint • wear-out • peak power delivery constraint • Energy E = P*t or, for time varying P: P(t).dt • battery life • cost: electricity bill • Note: lowering f reduces P, but not necessarily E; E may even increase due to leakage (static power dissipation) Henk Corporaal

  7. What's happening at the top Henk Corporaal

  8. Top500 nr 1 • 1st : K Computer: • 10.51 Petaflop/s on Linpack • 705024 SPARC64 cores (8 per die; 45 nm) (Fujitsu design) • Tofu interconnect (6-D torus) • 12.7 MegaWatt Henk Corporaal

  9. Top500 nr 2 • 2nd : Chinese Tianhe-1A: • 2.57 Petaflop/s • 186368 cores (Xeon + NVDIA proc) • 4.0 MegaWatt Henk Corporaal

  10. What's happening at the low end…. • March 14, 2012: ARM announced the Cortex M0+ • "The 32-bit Cortex-M0+ consumes just 9µA/MHz on a low-cost 90nm LP process, around one third of the energy of any 8- or 16-bit processor available today, while delivering significantly higher performance" • 2-stage pipeline • option: 1-cycle MUL Henk Corporaal

  11. Low end: How much energy in the air? [Rabaey 2009] Henk Corporaal

  12. 10000 W m / s p o M 0 0 0 1 ) IBM Cell s 1000 p W m o / 4 G Wireless s p o G M 0 ( W 0 m 1 / s p e o M c 0 1 n SODA a 100 ( 90 nm ) m P SODA r o Imagine Mobile HD o ( 65 nm ) w f B e r W Video m e r e / s t p o E t P M e f 1 f r i 10 c i 3 G Wireless e VIRAM Pentium M n TI C 6 X c y 1 0 . 1 1 10 100 Power ( Watts ) Computational efficiency (Mops/mW): what do we need? This means 1 pJ / operation or 1 TeraOp/Watt Woh e.a., ISCA 2009 Henk Corporaal

  13. Green500: Top 10 in green supercomputing Henk Corporaal

  14. Green500: evolution • 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation • 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation • Cell cluster, ranking 110 in top500 • 2010: best result = 1684 MFlops/Watt => 594 pJ / FloatingPt operation • IBM BlueGene/Q prototype 1, ranking 101 in top500, Peakperf: 65 TFlops; see also http://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_super/ • 2011: best result = 2097 MFlops/Watt => 476 pJ / FloatingPt operation • IBM BlueGene/Q prototype 2 • power consumption: 41 kW / Peak 85 TFlop/s Henk Corporaal

  15. Energy cost At ~$1M per MW, energy costs are substantial • 1 petaflop in 2010 uses 3 MW • 1 exaflop in 2018 possible in 200 MW with “usual” scaling • 1 exaflop in 2018 at 20 MW is DOE (Dep Of Energy) target • see also MontBlanc EU project: www.montblanc-project.eu • goal 200PFlops for 10MWatt in 2017 normal scaling desired scaling from: Katy Yelick, Berkeley Henk Corporaal

  16. Reducing power @ all design levels • Algoritmic level • Compiler level • Architecture level • Organization level • Circuit level • Silicon level • Important concepts: • Lower Vdd and freq. (even if errors occur) / dynamically adapt Vdd and freq. • Reduce circuit • Exploit locality • Reduce switching activity, glitches, etc. P = α.f.C.Vdd2 E= P.dt  E/cycle =α.C.Vdd2 Henk Corporaal

  17. Algoritmic level • The best indicator for energy is …..…. the number of cycles • Try alternative algorithms with lower complexity • E.g. quick-sort, O(n log n)  bubble-sort, O (n2) • … but be aware of the 'constant' : O(n log n)  c*(n log n) • Heuristic approach • Go for a good solution, not the best !! Biggest gains at this level !! Henk Corporaal

  18. Compiler level • Source-to-Source transformations • loop trafo's to improve locality • Strength reduction • E.g. replace Const * A with Add's and Shift's • Replace Floating point with Fixed point • Reduce register pressure / number of accesses to register file • Use software bypassing • Scenarios: current workloads are highly dynamic • Determine and predict execution modes • Group execution modes into scenarios • Perform special optimizations per scenario • DFVS: Dynamic Voltage and Frequency Scaling • More advanced loop optimizations • Reorder instructions to reduce bit-transistions Henk Corporaal

  19. Architecture level • Going parallel • Going heterogeneous • tune your architecture, exploit SFUs (special function units) • trade-off between flexibility / programmability / genericity and efficiency • Add local memories • prefer scratchpad i.s.o. cache • Cluster FUs and register files (see next slide) • Reduce bit-width • sub-word parallelism (SIMD) Henk Corporaal

  20. Organization (micro-arch.) level • Enabling Vdd reduction • Pipelining • cheap way of parallelism • Enabling lower freq.  lower Vdd • Note 1: don't pipeline if you don't need the performance • Note 2: don't exaggerate (like the 31-stage Pentium 4) • Reduce register traffic • avoid unnecessary reads and write • make bypass registers visible Henk Corporaal

  21. Circuit level • Clock gating • Power gating • Multiple Vdd modes • Reduce glitches: balancing digital path's • Exploit Zeros • Special SRAM cells • normal SRAM can not scale below Vdd = 0.7 - 0.8 Volt • Razor method; replay • Allow errors and add redundancy to architectural invisible structures • branch predictor • caches • .. and many more .. Henk Corporaal

  22. Silicon level • Higher Vt (V_threshold) • Back Biasing control • see thesis Maurice Meijer (2011) • SOI (Silicon on Insulator) • silicon junction is above an electr. insulator (silicon dioxide) • lowers parasitic device capacitance • Better transistors: Finfet • multi-gate • reduce leakage (off-state curent) • .. and many more Wait for lectures of Pineda on Friday Henk Corporaal

  23. Let's detail a few examples • Algoritmic level • Exploiting locality • Compiler level • Software bypassing • Architecture level • Going parallel • Organization level • Razor • Circuit level • Exploit zeros in a Multiplier • Silicon level • Sub-threshold Henk Corporaal

  24. Algorithm level: Exploiting locality Generic platform: Level-2 Level-3 Level-4 Level-1 SCSI bus bus bus Chip on-chip busses bus-if bridge SCSI Disk L2 Cache ICache CPUs DCache Main Memory Disk HW accel Local Memory Local Memory Disk Local Memory Henk Corporaal

  25. Power(memory) = 33 Power(arithmetic) Data transfer and storage power Henk Corporaal

  26. Loop transformations • Loop transformations • improve regularity of accesses • improve temporal locality: production  consumption • Expected influence • reduce temporary storage and (anticipated) background storage • Work horse: Loop Merging • typically many enabling trafos needed before you can merge loops Henk Corporaal

  27. Location Production/Consumption Consumption(s) Time Location Production/Consumption Consumption(s) Time Loop transformations: Merging for (i=0; i<N; i++) B[i] = f(A[i]); for (j=0; j<N; j++) C[j] = f(B[j],A[j]); for (i=0; i<N; i++) B[i] = f(A[i]); C[i] = f(B[i],A[i]); Locality improved ! Henk Corporaal

  28. Foreground memory External memory interface CPU Background memory Loop transformations Example: for (i=0; i<N; i++) B[i] = f(A[i]); for (i=0; i<N; i++) C[i] = g(B[i]); for (i=0; i<N; i++){ B[i] = f(A[i]); C[i] = g(B[i]); } N cyc. 2N cyc. N cyc. 2 background ports 1 backgr. + 1 foregr. ports Henk Corporaal

  29. for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1 storage size N Loop transformations Example: enabling trafo required Henk Corporaal

  30. Compiler level: Software bypassing Datapath buffers RF More efficient Larger Local Memory Global Memory PAGE 30 • Register file consumes considerable amount of total processor power • > 15% in simple 5-stage RISC (2R1W, 32bx32) • Even more in VLIW and SIMD as size and number of ports increase Henk Corporaal

  31. Reducing RF Accesses r4 r7 add r3, r4, r7 + r4 r7 add r12, r3, r7 + r3 r7 sw 0(r1), r12 + + r1 0 sw r12 r1 0 sw Only 3 RF reads are actually needed. • Many RF accesses can be eliminated • Bypass read: read operands from bypass network instead of RF • Writeback elimination: skip writeback if the variable is dead • Operand sharing: the same variable in the same port only needs to be read from RF once Henk Corporaal PAGE 31

  32. Move-Pro: an Improved TTA • Being able to perform bypass is critical to code density: • FU output buffer is added to help • Eventually it is up to the compiler to get a good code density 32-bit 16-bit x3 R4 ->ALU[add].o • Unified input ports • with buffer: • Isolate FUs • Enable operand sharing R7 ->ALU[add].t ALU.o->R3 PAGE 32 • Original TTA has a few drawbacks: • Separate schedule of operands may increase circuit activity • The trigger port introduces extra scheduling constraints • TTA Code density is likely to be lower compared to RISC/VLIW • May need more slots for the same performance • Increases instruction fetching energy Henk Corporaal

  33. Compiler Framework PAGE 33 • Low level IR • Similar to RISC assembly • With extra metadata to the backend • Local instruction scheduling Henk Corporaal

  34. Scheduling Example Software bypassing & scheduling PAGE 34 • Direct translation results in bad code density • More instruction also means worse performance • Bypassing improves code density and reduces RF accesses • Performance and energy consumption are also improved Henk Corporaal

  35. Graph-based Resource Model #Issue-Nodes are the same as #Issue-Slot PAGE 35 • Nodes represent resources • Resources are duplicated for each cycle • Edges represent connectivity or storage • Each node has capacity and cost • Cost determined by power model • Instruction cost is taken into account Henk Corporaal

  36. Energy Results Compared to RISC • RF energy saving >70% • No loss in instr-mem • R1 and M2 has the same performance PAGE 36 • 3 Configurations • R1: RISC, 2R1W RF • M2: 2-issue MOVE-Pro, 2R1W RF • M3: 3-issue MOVE-Pro, 2R1W RF • 8KB (32-bit)/9KB (48-bit) I-Mem Henk Corporaal

  37. Architecture level: going parallel • Running into the • Frequency wall • ILP wall • Memory wall • Energy wall • Chip area enabler: Moore's law goes well below 22 nm • What to do with all this area? • Multiple processors fit easily on a single die • Application demands • Cost effective • Reusue: just connect existing processors or processor cores • Low power: parallelism may allow lowering Vdd Henk Corporaal

  38. CPU CPU1 CPU2 Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P1 = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P2 = f/2*2CV’2 = fCV’2 < P1 • Check yourself whether this worksfor pipelining as well ! Henk Corporaal

  39. 4-D model of parallel architectures How to speedup your favorite processor? • Super-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue • Single stream: Superscalar • Multiple streams • Single core, multiple threads: Simultaneously Multi-Threading • Multiple cores Henk Corporaal

  40. IF IF IF IF DC DC DC DC RF RF RF RF EX EX EX EX WB WB WB WB Architecture methods1. Pipelined Execution of Instructions • Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one (instead of a large number for the multicycle machine) • More efficient Hardware • Some bad news: Hazards or pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 2 INSTRUCTION 3 4 Simple 5-stage pipeline Henk Corporaal

  41. * Architecture methods1. Super pipelining • Superpipelining: • Split one or more of the critical pipeline stages • Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set where: f(op) is frequency of operation op lt(op) is latency of operation op Henk Corporaal

  42. Architecture methods2. Powerful Instructions (1) • MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Henk Corporaal

  43. SIMD Execution Method time PE1 PE2 PEn Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methods2. Powerful Instructions (1) • SIMD computing • All PEs (Processing Elements) execute same operation • Typical mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) Henk Corporaal

  44. * * * * Architecture methods2. Powerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Many processors support this • Examples • MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4 |ai-bi| Henk Corporaal

  45. Architecture methods2. Powerful Instructions (2) • MO-technique: multiple operations per instruction • Two options: • CISC (Complex Instruction Set Computer) • this is what we did in the 'old' days of microcoded processors • VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example Henk Corporaal

  46. VLIW architecture: central Register File Register file Exec unit 1 Exec unit 2 Exec unit 3 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 1 Issue slot 2 Issue slot 3 Q: How many ports does the registerfile need for n-issue? Henk Corporaal

  47. Level 1 Instruction Cache loop buffer loop buffer loop buffer FU FU FU FU FU FU FU FU FU Level 2 (shared) Cache register file register file register file Level 1 Data Cache Clustered VLIW • Clustering = Splitting up the VLIW data path- same can be done for the instruction path – • Exploit locality @ Level 0, for Instructions and Data Henk Corporaal

  48. Architecture methods3. Multiple instruction issue (per cycle) • Who guarantees semantic correctness? • can instructions be executed in parallel • User: he specifies multiple instruction streams • Multi-processor: MIMD (Multiple Instruction Multiple Data) • HW: Run-time detection of ready instructions • Superscalar, single instruction stream • Compiler: Compile into dataflow representation • Dataflow processors • Multi-threaded processors Henk Corporaal

  49. SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined S(architecture) = f(Op) * lt (Op) 10 VLIW Op I_set 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Four dimensional representation of the architecture design space <I, O, D, S> Mpar = I*O*D*S You should exploit this amount of parallelism !!! Henk Corporaal

  50. Examples of many core / PE architectures • SIMD • Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ) • VLIW • ADRES, TriMedia • more dynamic: Itanium (static sched., rt mapping), TRIPS/EDGE (rt scheduling) • Multi-threaded • idea: hide long latencies • Denelcor HEP (1982), SUN Niagara (2005) • Multi-processor • RaW, PicoChip, Intel/AMD, GRID, Farms, ….. • Hybrid, like , Imagine, GPUs, XC-Core, Cell • actually, most are hybrid !! Henk Corporaal

More Related