780 likes | 788 Views
Explore the advancements and trends in energy-aware computing, from high-end processors consuming 10 kW to low-end processors delivering higher performance with lower energy consumption. Learn about power efficiency, top supercomputers' energy profiles, and strategies for reducing power consumption at all design levels.
E N D
IntroductiontoEnergy Aware Computing Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Energy Aware Computing Soesterberg, March 2012
Core i7 3GHz 100W • Intel Trends • #transistors follows Moore • but not freq. and performance/core 5 Henk Corporaal
Types of compute systems Henk Corporaal
A 20nm scenario (high end processor) • This means: • a 2cm2 processor consumens 10 kW • a bound of 100W requires only 1% to be active dark silicon Henk Corporaal
Intel's answer: 48-core x86 Henk Corporaal
Power versus Energy • Power P = fCVdd2 • switching activity (<1); f frequency; C switching capacitance, Vdd supply voltage • heat / temperature constraint • wear-out • peak power delivery constraint • Energy E = P*t or, for time varying P: P(t).dt • battery life • cost: electricity bill • Note: lowering f reduces P, but not necessarily E; E may even increase due to leakage (static power dissipation) Henk Corporaal
What's happening at the top Henk Corporaal
Top500 nr 1 • 1st : K Computer: • 10.51 Petaflop/s on Linpack • 705024 SPARC64 cores (8 per die; 45 nm) (Fujitsu design) • Tofu interconnect (6-D torus) • 12.7 MegaWatt Henk Corporaal
Top500 nr 2 • 2nd : Chinese Tianhe-1A: • 2.57 Petaflop/s • 186368 cores (Xeon + NVDIA proc) • 4.0 MegaWatt Henk Corporaal
What's happening at the low end…. • March 14, 2012: ARM announced the Cortex M0+ • "The 32-bit Cortex-M0+ consumes just 9µA/MHz on a low-cost 90nm LP process, around one third of the energy of any 8- or 16-bit processor available today, while delivering significantly higher performance" • 2-stage pipeline • option: 1-cycle MUL Henk Corporaal
Low end: How much energy in the air? [Rabaey 2009] Henk Corporaal
10000 W m / s p o M 0 0 0 1 ) IBM Cell s 1000 p W m o / 4 G Wireless s p o G M 0 ( W 0 m 1 / s p e o M c 0 1 n SODA a 100 ( 90 nm ) m P SODA r o Imagine Mobile HD o ( 65 nm ) w f B e r W Video m e r e / s t p o E t P M e f 1 f r i 10 c i 3 G Wireless e VIRAM Pentium M n TI C 6 X c y 1 0 . 1 1 10 100 Power ( Watts ) Computational efficiency (Mops/mW): what do we need? This means 1 pJ / operation or 1 TeraOp/Watt Woh e.a., ISCA 2009 Henk Corporaal
Green500: Top 10 in green supercomputing Henk Corporaal
Green500: evolution • 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation • 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation • Cell cluster, ranking 110 in top500 • 2010: best result = 1684 MFlops/Watt => 594 pJ / FloatingPt operation • IBM BlueGene/Q prototype 1, ranking 101 in top500, Peakperf: 65 TFlops; see also http://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_super/ • 2011: best result = 2097 MFlops/Watt => 476 pJ / FloatingPt operation • IBM BlueGene/Q prototype 2 • power consumption: 41 kW / Peak 85 TFlop/s Henk Corporaal
Energy cost At ~$1M per MW, energy costs are substantial • 1 petaflop in 2010 uses 3 MW • 1 exaflop in 2018 possible in 200 MW with “usual” scaling • 1 exaflop in 2018 at 20 MW is DOE (Dep Of Energy) target • see also MontBlanc EU project: www.montblanc-project.eu • goal 200PFlops for 10MWatt in 2017 normal scaling desired scaling from: Katy Yelick, Berkeley Henk Corporaal
Reducing power @ all design levels • Algoritmic level • Compiler level • Architecture level • Organization level • Circuit level • Silicon level • Important concepts: • Lower Vdd and freq. (even if errors occur) / dynamically adapt Vdd and freq. • Reduce circuit • Exploit locality • Reduce switching activity, glitches, etc. P = α.f.C.Vdd2 E= P.dt E/cycle =α.C.Vdd2 Henk Corporaal
Algoritmic level • The best indicator for energy is …..…. the number of cycles • Try alternative algorithms with lower complexity • E.g. quick-sort, O(n log n) bubble-sort, O (n2) • … but be aware of the 'constant' : O(n log n) c*(n log n) • Heuristic approach • Go for a good solution, not the best !! Biggest gains at this level !! Henk Corporaal
Compiler level • Source-to-Source transformations • loop trafo's to improve locality • Strength reduction • E.g. replace Const * A with Add's and Shift's • Replace Floating point with Fixed point • Reduce register pressure / number of accesses to register file • Use software bypassing • Scenarios: current workloads are highly dynamic • Determine and predict execution modes • Group execution modes into scenarios • Perform special optimizations per scenario • DFVS: Dynamic Voltage and Frequency Scaling • More advanced loop optimizations • Reorder instructions to reduce bit-transistions Henk Corporaal
Architecture level • Going parallel • Going heterogeneous • tune your architecture, exploit SFUs (special function units) • trade-off between flexibility / programmability / genericity and efficiency • Add local memories • prefer scratchpad i.s.o. cache • Cluster FUs and register files (see next slide) • Reduce bit-width • sub-word parallelism (SIMD) Henk Corporaal
Organization (micro-arch.) level • Enabling Vdd reduction • Pipelining • cheap way of parallelism • Enabling lower freq. lower Vdd • Note 1: don't pipeline if you don't need the performance • Note 2: don't exaggerate (like the 31-stage Pentium 4) • Reduce register traffic • avoid unnecessary reads and write • make bypass registers visible Henk Corporaal
Circuit level • Clock gating • Power gating • Multiple Vdd modes • Reduce glitches: balancing digital path's • Exploit Zeros • Special SRAM cells • normal SRAM can not scale below Vdd = 0.7 - 0.8 Volt • Razor method; replay • Allow errors and add redundancy to architectural invisible structures • branch predictor • caches • .. and many more .. Henk Corporaal
Silicon level • Higher Vt (V_threshold) • Back Biasing control • see thesis Maurice Meijer (2011) • SOI (Silicon on Insulator) • silicon junction is above an electr. insulator (silicon dioxide) • lowers parasitic device capacitance • Better transistors: Finfet • multi-gate • reduce leakage (off-state curent) • .. and many more Wait for lectures of Pineda on Friday Henk Corporaal
Let's detail a few examples • Algoritmic level • Exploiting locality • Compiler level • Software bypassing • Architecture level • Going parallel • Organization level • Razor • Circuit level • Exploit zeros in a Multiplier • Silicon level • Sub-threshold Henk Corporaal
Algorithm level: Exploiting locality Generic platform: Level-2 Level-3 Level-4 Level-1 SCSI bus bus bus Chip on-chip busses bus-if bridge SCSI Disk L2 Cache ICache CPUs DCache Main Memory Disk HW accel Local Memory Local Memory Disk Local Memory Henk Corporaal
Power(memory) = 33 Power(arithmetic) Data transfer and storage power Henk Corporaal
Loop transformations • Loop transformations • improve regularity of accesses • improve temporal locality: production consumption • Expected influence • reduce temporary storage and (anticipated) background storage • Work horse: Loop Merging • typically many enabling trafos needed before you can merge loops Henk Corporaal
Location Production/Consumption Consumption(s) Time Location Production/Consumption Consumption(s) Time Loop transformations: Merging for (i=0; i<N; i++) B[i] = f(A[i]); for (j=0; j<N; j++) C[j] = f(B[j],A[j]); for (i=0; i<N; i++) B[i] = f(A[i]); C[i] = f(B[i],A[i]); Locality improved ! Henk Corporaal
Foreground memory External memory interface CPU Background memory Loop transformations Example: for (i=0; i<N; i++) B[i] = f(A[i]); for (i=0; i<N; i++) C[i] = g(B[i]); for (i=0; i<N; i++){ B[i] = f(A[i]); C[i] = g(B[i]); } N cyc. 2N cyc. N cyc. 2 background ports 1 backgr. + 1 foregr. ports Henk Corporaal
for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1 storage size N Loop transformations Example: enabling trafo required Henk Corporaal
Compiler level: Software bypassing Datapath buffers RF More efficient Larger Local Memory Global Memory PAGE 30 • Register file consumes considerable amount of total processor power • > 15% in simple 5-stage RISC (2R1W, 32bx32) • Even more in VLIW and SIMD as size and number of ports increase Henk Corporaal
Reducing RF Accesses r4 r7 add r3, r4, r7 + r4 r7 add r12, r3, r7 + r3 r7 sw 0(r1), r12 + + r1 0 sw r12 r1 0 sw Only 3 RF reads are actually needed. • Many RF accesses can be eliminated • Bypass read: read operands from bypass network instead of RF • Writeback elimination: skip writeback if the variable is dead • Operand sharing: the same variable in the same port only needs to be read from RF once Henk Corporaal PAGE 31
Move-Pro: an Improved TTA • Being able to perform bypass is critical to code density: • FU output buffer is added to help • Eventually it is up to the compiler to get a good code density 32-bit 16-bit x3 R4 ->ALU[add].o • Unified input ports • with buffer: • Isolate FUs • Enable operand sharing R7 ->ALU[add].t ALU.o->R3 PAGE 32 • Original TTA has a few drawbacks: • Separate schedule of operands may increase circuit activity • The trigger port introduces extra scheduling constraints • TTA Code density is likely to be lower compared to RISC/VLIW • May need more slots for the same performance • Increases instruction fetching energy Henk Corporaal
Compiler Framework PAGE 33 • Low level IR • Similar to RISC assembly • With extra metadata to the backend • Local instruction scheduling Henk Corporaal
Scheduling Example Software bypassing & scheduling PAGE 34 • Direct translation results in bad code density • More instruction also means worse performance • Bypassing improves code density and reduces RF accesses • Performance and energy consumption are also improved Henk Corporaal
Graph-based Resource Model #Issue-Nodes are the same as #Issue-Slot PAGE 35 • Nodes represent resources • Resources are duplicated for each cycle • Edges represent connectivity or storage • Each node has capacity and cost • Cost determined by power model • Instruction cost is taken into account Henk Corporaal
Energy Results Compared to RISC • RF energy saving >70% • No loss in instr-mem • R1 and M2 has the same performance PAGE 36 • 3 Configurations • R1: RISC, 2R1W RF • M2: 2-issue MOVE-Pro, 2R1W RF • M3: 3-issue MOVE-Pro, 2R1W RF • 8KB (32-bit)/9KB (48-bit) I-Mem Henk Corporaal
Architecture level: going parallel • Running into the • Frequency wall • ILP wall • Memory wall • Energy wall • Chip area enabler: Moore's law goes well below 22 nm • What to do with all this area? • Multiple processors fit easily on a single die • Application demands • Cost effective • Reusue: just connect existing processors or processor cores • Low power: parallelism may allow lowering Vdd Henk Corporaal
CPU CPU1 CPU2 Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P1 = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P2 = f/2*2CV’2 = fCV’2 < P1 • Check yourself whether this worksfor pipelining as well ! Henk Corporaal
4-D model of parallel architectures How to speedup your favorite processor? • Super-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue • Single stream: Superscalar • Multiple streams • Single core, multiple threads: Simultaneously Multi-Threading • Multiple cores Henk Corporaal
IF IF IF IF DC DC DC DC RF RF RF RF EX EX EX EX WB WB WB WB Architecture methods1. Pipelined Execution of Instructions • Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one (instead of a large number for the multicycle machine) • More efficient Hardware • Some bad news: Hazards or pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 2 INSTRUCTION 3 4 Simple 5-stage pipeline Henk Corporaal
* Architecture methods1. Super pipelining • Superpipelining: • Split one or more of the critical pipeline stages • Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set where: f(op) is frequency of operation op lt(op) is latency of operation op Henk Corporaal
Architecture methods2. Powerful Instructions (1) • MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Henk Corporaal
SIMD Execution Method time PE1 PE2 PEn Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methods2. Powerful Instructions (1) • SIMD computing • All PEs (Processing Elements) execute same operation • Typical mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) Henk Corporaal
* * * * Architecture methods2. Powerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Many processors support this • Examples • MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4 |ai-bi| Henk Corporaal
Architecture methods2. Powerful Instructions (2) • MO-technique: multiple operations per instruction • Two options: • CISC (Complex Instruction Set Computer) • this is what we did in the 'old' days of microcoded processors • VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example Henk Corporaal
VLIW architecture: central Register File Register file Exec unit 1 Exec unit 2 Exec unit 3 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 1 Issue slot 2 Issue slot 3 Q: How many ports does the registerfile need for n-issue? Henk Corporaal
Level 1 Instruction Cache loop buffer loop buffer loop buffer FU FU FU FU FU FU FU FU FU Level 2 (shared) Cache register file register file register file Level 1 Data Cache Clustered VLIW • Clustering = Splitting up the VLIW data path- same can be done for the instruction path – • Exploit locality @ Level 0, for Instructions and Data Henk Corporaal
Architecture methods3. Multiple instruction issue (per cycle) • Who guarantees semantic correctness? • can instructions be executed in parallel • User: he specifies multiple instruction streams • Multi-processor: MIMD (Multiple Instruction Multiple Data) • HW: Run-time detection of ready instructions • Superscalar, single instruction stream • Compiler: Compile into dataflow representation • Dataflow processors • Multi-threaded processors Henk Corporaal
SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined S(architecture) = f(Op) * lt (Op) 10 VLIW Op I_set 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Four dimensional representation of the architecture design space <I, O, D, S> Mpar = I*O*D*S You should exploit this amount of parallelism !!! Henk Corporaal
Examples of many core / PE architectures • SIMD • Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ) • VLIW • ADRES, TriMedia • more dynamic: Itanium (static sched., rt mapping), TRIPS/EDGE (rt scheduling) • Multi-threaded • idea: hide long latencies • Denelcor HEP (1982), SUN Niagara (2005) • Multi-processor • RaW, PicoChip, Intel/AMD, GRID, Farms, ….. • Hybrid, like , Imagine, GPUs, XC-Core, Cell • actually, most are hybrid !! Henk Corporaal