1 / 26

ECE 260C – VLSI Advanced Topics Term paper presentation

ECE 260C – VLSI Advanced Topics Term paper presentation. May 27, 2014 Keyuan Huang Ngoc Luong. Low Power Processor Architectures and Software Optimization Techniques. Motivation. Global Mobile Devices and Connections Growth . ~10 billion mobile devices in 2018 Moore’s law is slowing down

dexter
Download Presentation

ECE 260C – VLSI Advanced Topics Term paper presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 260C – VLSI Advanced TopicsTerm paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization Techniques

  2. Motivation Global Mobile Devices and Connections Growth • ~10 billion mobile devices in 2018 • Moore’s law is slowing down • Power dissipation per gate remains unchanged • How to reduce power? • Circuit level optimizations (DVFS, power gating, clock gating) • Microarchitecture optimization techniques • Compiler optimization techniques Trend: More innovations on architectural and software techniques to optimize power consumption

  3. Low Power Architectures Overview • Asynchronous Processors • Eliminate clock and use handshake protocol • Save clock power but higher area • Ex: SNAP, ARM996HS, SUN Sproull. • Application Specific Instruction Set Processors • Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic • Combine basic instructions with custom instruction based on application • Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI • Reconfigurable Instruction Set Processors • Combine fixed core with reconfigurable logic (FPGA) • Low NRE cost vs ASIP • Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO • No Instruction Set Computer • Build custom datapath based on application code • Compiler has low-level control of hardware resource • Ex: WISHBONE system. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).

  4. Combine GP processor with ASIP to focus on reducing energy and energy delay for a range of applications • Broader range of applications compared to accelerator • Reconfigurable via patching algorithm • Automatically synthesizable by toolchain from C source code • Energy consumption is reduced up to 16x for functions and 2.1x for whole application Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

  5. C-core organization • Data path (FU, mux, register) • Control unit (state machine) • Cache interface (ld, st) • Scan chain (CPU interface) Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

  6. C-core execution • Compiler insert stubs into code compatible with c-core • Choose between c-core and CPU and use c-core if available • If no c-core available, use GP processor, else use c-core to execute • C-core raises exception when finish executing and return the value to CPU Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

  7. Patching support • Basic block mapping • Control flow mapping • Register mapping • Patch generation Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

  8. Patching Example • Configurable constants • Generalized single-cycle datapath operators • Control flow changes Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

  9. Results • 18 fully placed-and routed c-cores vs MIPS • 3.3x – 16x energy efficiency improvement • Reduce system energy consumption by upto47% • Reduce energy-delay by up to 55% at the full application level • Even higher energy saving without patching support Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

  10. Software Optimization Technique • Memory system uses power (1/10 to ¼) in portable computers • System bus switching activity controlled by software • ALU and FPU data paths needs good scheduling to  avoid pipeline stalls • Control logic and clock reduce by using shortest possible program to do the computation K. Roy and M. C. Johnson,  Software design for low power,  1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics 

  11. General categories of software optimization • Minimizing memory accesses • Minimize accesses needed by algorithm • Minimize total memory size needed by algorithm • Use multiple-word parallel loads, not single word loads • Optimal selection and sequencing of machine instruction • Instruction packing • Minimizing circuit state effect • Operand swapping K. Roy and M. C. Johnson,  Software design for low power,  1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics 

  12. Global program knowledge • Proactive optimizations • Efficient execution Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran ,Michael Chu,ScottMahlke Basic Idea: Compiler Managed, Hardware Assisted Software Hardware Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー Conservative Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)

  13. Traditional Cache Architecture tag set offset Disadvantages tag data lru tag data lru tag data lru tag data lru • – Fixed replacement policy • – Set index no program locality • – Set-associativity has high overhead • – Activate multiple data/tag-array • per access Replace =? =? =? =? 4:1 mux • Lookup  Activate all ways on every access • Replacement  Choose among all the ways

  14. PartitionedCacheArchitecture Advantages Ld/St Reg [Addr] [k-bitvector] [R/U] tag set offset + Improve performance by controlling replacement + Reduce cache access power by restricting number of accesses tag data lru tag data lru tag data lru tag data lru P0 P1 P2 P3 Replace =? =? =? =? 4:1 mux • Lookup  Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions • Replacement  Restricted to partitions specified in bit-vector

  15. Partitioned Caches: Example for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } part-3 part-1 part-0 tag data tag data tag data ld1/st1 ld3 ld5 ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 w1/w2 y x (b) Fused load/store instructions ld2/st2 ld4 ld1 [100], R ld5 [010], R ld3 [001], R ld6 (a) Annotated code segment (d) Actual cache partition assignment for each instuction (c) Trace consisting of array references, cache blocks, and load/stores from the example

  16. Compiler Controlled Data Partitioning • Goal: Place loads/stores into cache partitions • Analyze application’s memory characteristics • Cache requirements  Number of partitions per ld/st • Predict conflicts • Place loads/stores to different partitions • Satisfies its caching needs • Avoid conflicts, overlap if possible

  17. Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the reuse distance to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 B1 B2 B2 M M M M • M has reuse distance = 1

  18. Cache Analysis:Estimating Number Of Partitions • Avoid conflict/capacity misses for an instruction • Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) D = 2 D = 1 D = 0 1 2 3 4 1 2 3 4 1 2 3 4 8 8 8 16 16 16 24 24 24 32 32 32 • Compute energy matrices in reality • Pick most energy efficient configuration per instruction

  19. Cache Analysis: Computing Interferences • Avoid conflicts among temporally co-located references • Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D = 1 M1 D = 1 M2 D = 1 M3 D = 1

  20. Partition Assignment part-0 part-1 part-2 tag data tag data tag data • Placement phase can overlap references • Compute combined working-set • Use graph-theoretic notion of a clique • For each clique, new D  Σ D of each node • Combined D for all overlaps  Max (All cliques) ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 y w1/w2 x ld1 [100], R ld5 [010], R ld3 [001], R Actual cache partition assignment for each instruction M4 D = 1 Clique 1 Clique 1 : M1, M2, M4  New reuse distance (D) = 3 Clique 2 : M1, M3, M4  New reuse distance (D) = 3 Combined reuse distance  Max(3, 3) = 3 M1 D = 1 M2 D = 1 M3 D = 1 Clique 2

  21. Experimental Setup • Trimaran compiler and simulator infrastructure • ARM9 processor model • Cache configurations: • 1-Kb to 32-Kb • 32-byte block size • 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache • Mediabench suite • CACTI for cache energy modeling

  22. Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part 7 6 5 Average way accesses 4 3 2 1 0 1-K 2-K 4-K 8-K 16-K 32-K Average Cache size • 25%,30%,36% access reduction on a 2-,4-,8-partition cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)

  23. Improvement in Fetch Energy 16-Kb cache 60 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way 50 40 30 Percentage energy improvement 20 10 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode • 8%,16%,25% energy reduction on a 2-,4-,8-partition cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)

  24. Summary • Maintain the advantages of a hardware-cache • Expose placement and lookup decisions to the compiler • Avoid conflicts, eliminate redundancies • Achieve a higher performance and a lower power consumption

  25. Future Works • Hybrid scratch-pad and caches • Develop advance toolchain for newer technology node such as 28nm • Incorporate the ability of partitioning data cache into the compiler of the toolchain for the ASIP

  26. Reference Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009). Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) K. Roy and M. C. Johnson,  Software design for low power,  1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics 

More Related