200 likes | 355 Views
Efficient Performance Scaling of Future CGRAs for Mobile Applications. Yongjun Park , Jason Jong Kyu Park , and Scott Mahlke. December 11, 2012 University of Michigan, Ann Arbor. Convergence of Functionalities. Flexible Accelerator!. 4G Wireless. Audio Video 3D. Navigation.
E N D
Efficient Performance Scaling of Future CGRAs for Mobile Applications Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke December 11, 2012 • University of Michigan, Ann Arbor 1
Convergence of Functionalities Flexible Accelerator! 4G Wireless Audio Video 3D Navigation Anatomy of an iPhone4 Convergence of functionalities demands a flexible solution due to the design cost and programmability 2
CGRA : Attractive Alternative to ASICs • Array of PEs connected in a mesh-like interconnect • High throughput with a large number of resources • Distributed hardware offers low cost/power consumption • High flexibility with dynamic reconfiguration 3
Bridging the Gap Between Market Demandand Computation Power How to scale performance with retaining energy efficiency? [Canali, Internet Computing Magazine, IEEE, 2009] 4
Agenda:Scaling the Energy Efficiency of CGRAs • Investigate the key factors and their feasibility in the view of performance and power efficiency • Hardware scalability vs. hardware flexibility • Interconnection topology • Complex PE vs. simple PE • Vector memory operation support • Homogeneity vs. Heterogeneity 5
Experimental Setup • Target applications • Media benchmark: AAC decoder, H.264 decoder, and 3D rendering • Game physics benchmarks: line of sight, convolution, and conjugate • Target architecture: various types of CGRAs • 16 ~ 64 heterogeneous/homogeneous resources • IMPACT frontend compiler + Edge-centric modulo scheduler • Power measurement • IBM 65nm technology @ 200MHz/1V 6
Q1: Interconnection Topology • Overview • Routing overhead limits the performance when increasing the size of the CGRA • Common solution: clustering • What is the optimal interconnection topology? • Methodology • Compare the performance of three different clustering schemes. • Baseline • Fixed partition: CGRAs are physically split into multiple partitions • Flexible partition: number of partitions can be dynamically changed from 1 to 8 • Total number of PEs: 4 to 128 7
Q1: Interconnection Topology Application No-DLP loops Baseline DLP loops Fixed partition Flexible mapping 8
Performance Comparison (Base, Fixed, Flex) • Fixed partitioning doesn’t always show better performance. • Flexible architectures show the best performance and retain scalability 9
Q2: ComplexPEs vs. Simple PEs • Overview • CGRAs with complex PEs are introduced • Two level interconnect • Number of RFs can decrease • Multiple instructions can be chained • Challenge: resource utilization • Goal: determine the availability of complex PEs in the view of energy consumption • Methodology • Compare the energy consumption on different PE styles • Number of FUs inside a PE: 1 ~ 6 • Uniform vs. Optimized 10
PE Designs 11
Energy Consumption • Energy consumption does not increase dramatically as number of PEs • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions 1.5x energy 12
Q3: SIMD Memory Support • Overview • SIMD memory support provides less power and less number of instructions • Challenge: degree of DLP. • Goal: determine the availability of SIMD memory access in the view of energy consumption • Methodology • Compare the energy consumption on different SIMD widths: 1 ~ 16 13
Relative Energy Consumption • Total energy consumption at wider vector width can be a similar level to a scalar memory unit • High degree of spatial locality can compensate for power overheads 14
Conclusion Beginning • Flexible partitioning should be supported for further improving the performance. • Complex PE can be more energy efficient even in low resource utilizations. • The wide SIMD memory support can be realistic due to the mobile application characteristics. 15
Questions? • For more information • http://cccp.eecs.umich.edu 16
Q1: Homogeneity vs. Heterogeneity • Overview • Heterogeneous CGRAs are common • No experiments on the effect of heterogeneity over homogeneity • Methodology • Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit) • Decrease the number of PEs supporting complex ALU and memory unit • Performance goal: 80% of performance @ homogeneous CGRA How about performance? 17
Performance Degradation Media Game • The amounts of performance degradation are not substantial • The performance is normally constrained not by the complex instructions • Performance degradation depends much more on memory operations • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%. 18
Conclusion Beginning • Heterogeneous FU organization is highly effective. • Flexible partitioning should be supported for further improving the performance. • Complex PE can be more energy efficient even in low resource utilizations. • The wide SIMD memory support can be realistic due to the mobile application characteristics. 19
CGRA : Attractive Alternative to ASICs • Suitable for running multimedia applications for future embedded systems • High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW • Morphosys : 8x8 array with RISC processor • SiliconHive : hierarchical systolic array • ADRES : 4x4 array with tightly coupled VLIW 20