270 likes | 459 Views
ECE 260C – VLSI Advanced Topics Term paper presentation. May 27, 2014 Keyuan Huang Ngoc Luong. Low Power Processor Architectures and Software Optimization Techniques. Motivation. Global Mobile Devices and Connections Growth . ~10 billion mobile devices in 2018 Moore’s law is slowing down
E N D
ECE 260C – VLSI Advanced TopicsTerm paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization Techniques
Motivation Global Mobile Devices and Connections Growth • ~10 billion mobile devices in 2018 • Moore’s law is slowing down • Power dissipation per gate remains unchanged • How to reduce power? • Circuit level optimizations (DVFS, power gating, clock gating) • Microarchitecture optimization techniques • Compiler optimization techniques Trend: More innovations on architectural and software techniques to optimize power consumption
Low Power Architectures Overview • Asynchronous Processors • Eliminate clock and use handshake protocol • Save clock power but higher area • Ex: SNAP, ARM996HS, SUN Sproull. • Application Specific Instruction Set Processors • Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic • Combine basic instructions with custom instruction based on application • Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI • Reconfigurable Instruction Set Processors • Combine fixed core with reconfigurable logic (FPGA) • Low NRE cost vs ASIP • Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO • No Instruction Set Computer • Build custom datapath based on application code • Compiler has low-level control of hardware resource • Ex: WISHBONE system. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).
Combine GP processor with ASIP to focus on reducing energy and energy delay for a range of applications • Broader range of applications compared to accelerator • Reconfigurable via patching algorithm • Automatically synthesizable by toolchain from C source code • Energy consumption is reduced up to 16x for functions and 2.1x for whole application Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
C-core organization • Data path (FU, mux, register) • Control unit (state machine) • Cache interface (ld, st) • Scan chain (CPU interface) Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
C-core execution • Compiler insert stubs into code compatible with c-core • Choose between c-core and CPU and use c-core if available • If no c-core available, use GP processor, else use c-core to execute • C-core raises exception when finish executing and return the value to CPU Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Patching support • Basic block mapping • Control flow mapping • Register mapping • Patch generation Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Patching Example • Configurable constants • Generalized single-cycle datapath operators • Control flow changes Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Results • 18 fully placed-and routed c-cores vs MIPS • 3.3x – 16x energy efficiency improvement • Reduce system energy consumption by upto47% • Reduce energy-delay by up to 55% at the full application level • Even higher energy saving without patching support Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Software Optimization Technique • Memory system uses power (1/10 to ¼) in portable computers • System bus switching activity controlled by software • ALU and FPU data paths needs good scheduling to avoid pipeline stalls • Control logic and clock reduce by using shortest possible program to do the computation K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
General categories of software optimization • Minimizing memory accesses • Minimize accesses needed by algorithm • Minimize total memory size needed by algorithm • Use multiple-word parallel loads, not single word loads • Optimal selection and sequencing of machine instruction • Instruction packing • Minimizing circuit state effect • Operand swapping K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
Global program knowledge • Proactive optimizations • Efficient execution Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran ,Michael Chu,ScottMahlke Basic Idea: Compiler Managed, Hardware Assisted Software Hardware Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー Conservative Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
Traditional Cache Architecture tag set offset Disadvantages tag data lru tag data lru tag data lru tag data lru • – Fixed replacement policy • – Set index no program locality • – Set-associativity has high overhead • – Activate multiple data/tag-array • per access Replace =? =? =? =? 4:1 mux • Lookup Activate all ways on every access • Replacement Choose among all the ways
PartitionedCacheArchitecture Advantages Ld/St Reg [Addr] [k-bitvector] [R/U] tag set offset + Improve performance by controlling replacement + Reduce cache access power by restricting number of accesses tag data lru tag data lru tag data lru tag data lru P0 P1 P2 P3 Replace =? =? =? =? 4:1 mux • Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions • Replacement Restricted to partitions specified in bit-vector
Partitioned Caches: Example for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } part-3 part-1 part-0 tag data tag data tag data ld1/st1 ld3 ld5 ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 w1/w2 y x (b) Fused load/store instructions ld2/st2 ld4 ld1 [100], R ld5 [010], R ld3 [001], R ld6 (a) Annotated code segment (d) Actual cache partition assignment for each instuction (c) Trace consisting of array references, cache blocks, and load/stores from the example
Compiler Controlled Data Partitioning • Goal: Place loads/stores into cache partitions • Analyze application’s memory characteristics • Cache requirements Number of partitions per ld/st • Predict conflicts • Place loads/stores to different partitions • Satisfies its caching needs • Avoid conflicts, overlap if possible
Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the reuse distance to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 B1 B2 B2 M M M M • M has reuse distance = 1
Cache Analysis:Estimating Number Of Partitions • Avoid conflict/capacity misses for an instruction • Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) D = 2 D = 1 D = 0 1 2 3 4 1 2 3 4 1 2 3 4 8 8 8 16 16 16 24 24 24 32 32 32 • Compute energy matrices in reality • Pick most energy efficient configuration per instruction
Cache Analysis: Computing Interferences • Avoid conflicts among temporally co-located references • Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D = 1 M1 D = 1 M2 D = 1 M3 D = 1
Partition Assignment part-0 part-1 part-2 tag data tag data tag data • Placement phase can overlap references • Compute combined working-set • Use graph-theoretic notion of a clique • For each clique, new D Σ D of each node • Combined D for all overlaps Max (All cliques) ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 y w1/w2 x ld1 [100], R ld5 [010], R ld3 [001], R Actual cache partition assignment for each instruction M4 D = 1 Clique 1 Clique 1 : M1, M2, M4 New reuse distance (D) = 3 Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3 M1 D = 1 M2 D = 1 M3 D = 1 Clique 2
Experimental Setup • Trimaran compiler and simulator infrastructure • ARM9 processor model • Cache configurations: • 1-Kb to 32-Kb • 32-byte block size • 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache • Mediabench suite • CACTI for cache energy modeling
Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part 7 6 5 Average way accesses 4 3 2 1 0 1-K 2-K 4-K 8-K 16-K 32-K Average Cache size • 25%,30%,36% access reduction on a 2-,4-,8-partition cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
Improvement in Fetch Energy 16-Kb cache 60 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way 50 40 30 Percentage energy improvement 20 10 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode • 8%,16%,25% energy reduction on a 2-,4-,8-partition cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
Summary • Maintain the advantages of a hardware-cache • Expose placement and lookup decisions to the compiler • Avoid conflicts, eliminate redundancies • Achieve a higher performance and a lower power consumption
Future Works • Hybrid scratch-pad and caches • Develop advance toolchain for newer technology node such as 28nm • Incorporate the ability of partitioning data cache into the compiler of the toolchain for the ASIP
Reference Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009). Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics