Compiler Supports and Optimizations for PAC VLIW DSP Processors

Compiler Supports and Optimizations for PAC VLIW DSP Processors Y.-C. Lin C.-L. Tang C.-J. Wu M.-Y. Hung Y.-P. You Y.-C. Moo S.-Y. Chen and J.-K. Lee National Tsing-Hua University Taiwan

Outline • PAC VLIW DSP Architectures • Optimization Issues • Preliminary Compiler Supports • Experimental Results • Conclusion LCPC2005

Introduction • Parallel Architecture Core (PAC) is designed by SoC Technology Center, ITRI, Taiwan. • 32bit, fixed-point, 5-way issue VLIW DSP • scalable architecture • optimized instruction set for audio/video/image • innovative register file structure • two generations developed • TSMC’s 0.13 μm technology (taped-out in Aug. 2005) High-performance Low-power LCPC2005

Key Issues • Deploy the general-purpose high-performance open source compiler for DSP processors • ORC  PAC DSP • Address issues for fragmentary register banks of DSP processors • Methods for irregular register constraints and instruction scheduling LCPC2005

Cluster Cluster Cluster Cluster Cluster B-Unit I-Unit I-Unit M-Unit M-Unit A Registers A Registers A Registers A Registers A Registers A Registers A Registers A Registers A Registers B-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit B-Unit B-Unit B-Unit B-Unit D Registers D Registers D Registers D Registers D Registers D Registers D Registers Extend More Clusters I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit M-Unit R Registers R Registers R Registers R Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers I-Unit I-Unit PAC DSP Overview • Cluster Design: • Scalability • Explicit Inter-Cluster Data Transfer Instructions • Five-Way Issues: • 1 Scalar/Control Unit (B) • 2 Arithmetic Unit (I) • 2 Load/Store Unit (M) • Distributed Register Files: • 5 Local Register Files (A, AC, R) • 2 Global Register Files (D) • Other Features: • 8-bit/16-bit SIMD operations • Variable instruction word/bundle length • Dynamic Power Management • Standard AMBA interface A Registers A Registers B-Unit R Registers AC Registers AC Registers LCPC2005

So called as Ping-pong! Load I-Unit M-Unit Compute Load Store Compute M-Unit and I-Unit operate on different data streams at the same time! Store Ping-pong Register File Structure • Used by Global Register File (D) • Concept: • Overlap processing different data streams in a cluster • Benefit: • Decrease the port number for low-power and size LCPC2005

M-Unit M-Unit M-Unit Bank 1 Bank 1 Bank 2 Bank 2 Bank 2 Bank 1 I-Unit I-Unit I-Unit Ping-pong Register Access • Each ‘D’ register file contains 2 banks. • Rules: • Access by one unit to the 2 banks is mutually-exclusivein a cycle. • M-Unit and I-Unit can only access to different banks in a cycle. Instructional Switcher Only 1 state for each cycle! LCPC2005

We need to schedule into 2 bundles since they use the same bank! For compilers optimizations: Better register (file/bank) allocation  Better schedule in fewer bundles Issues for Ping-pong Registers(1) Lw D8, A0 Add D1,D0,AC0 • Example for ping-pong usage: • Able toform a bundle • Unable toform a bundle Lw D2, A0 Add D1,D0,AC0 LCPC2005

Lw D8, A0 Add D1,D0,AC0 Need cross ping-pong communication! Additional copy-operation needed! Sw D1, A0 Sub D9,D8,D1 Mov AC1, D1 Sw D1, A0 Sub D9,D8,AC1 Invalid operation! Issues for Ping-pong Registers(2) • Data transfer between ping-pong banks: • For compiler optimizations: • Well-handle data-communication between ping-pong banks within any code manipulation • Generate additional copy-operation as few as possible LCPC2005

A B C D Additional Cross-Cluster Copy E F Cluster2 Cluster1 G Issues for Inter-cluster Communication • To exploit cluster parallelism: • PAC needs explicit instruction to be issued for inter-cluster communication! Cluster1 Cluster2 B-Unit A B C D • Optimize code partitioning: • Fewer communication • Better scheduling E F G LCPC2005

More Considerations • Two optimized codes of the same performance: • Upper  Smaller code size • Lower  Lower power consumption LCPC2005

Compiler Supports for PAC DSP • Essential supports (IA-64 ORC  PAC) • New Target_Info • PAC Architecture and ISA descriptions • Complicated hazard descriptions • PAC application-binary-interface (ABI) • data type mapping • memory usage layout • register usage conventions • calling conventions • PAC code generation • 32-bit WHIRL code generation • PAC WHIRL-to-CGIR procedures • PAC assembly code emission LCPC2005

Register Allocation Instruction Scheduling Code Insertion for Distributed Register Communication Simulated-Annealing (SA) Based Register Allocation Approach • Motivation: • Complex interference from: • We appreciate a machine-learning method to give a near-optimal results. • To be a base reference for developing heuristic methods! LCPC2005

To Determine: Virtual Register  Register File (Bank) • Input: un-scheduled instructions • Output: a schedule of the instructions a register file assignment (RFA) map • RFA map = {(v1, f1), (v2, f2), ...} • Where vi : a virtual register, fi : a register file (bank) • PAC_Scheduler: • Graph-coloring based register allocation according to the RFA map • Instruction scheduling and code insertion for register file communication • Setup SA: • An initial random RFA map • schedule_len = PAC_Scheduler ( initial RFA map ) • SA control variables: • threshold • p_test: a probability test value (0 < p_test < 1). • energy: initial value > threshold. LCPC2005

new RFA map Re-run: new_schedule_len = PAC_Scheduler (new RFA map) Randomly change: a mapping (vi, fi) yes SA stop test: energy > threshold Better result test: new_schedule_len < schedule_len new RFA map yes energy--schedule_len = new_schedule_len no no yes Random test: a random number > p_test FinalRFA map & schedule old RFA map energy++ no To Optimize: Scheduling Result LCPC2005

Preliminary Experimental Results (DSPStone benchmarks) LCPC2005

Related Works • Register Allocation • R. Leupers: Instruction scheduling for clustered VLIW DSPs. In Proc. Int’l Conference on Parallel Architecture and Compilation Techniques, pages 291–300, Oct. 2000 • Register File Organizations • S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens: Register organization for media processing. International Symposium on High Performance Computer Architecture (HPCA), pp.375-386, 2000 • Tay-Jyi Lin, Chin-Chi Chang. Chen-Chia Lee, and Chein-Wei Jen: An Efficient VLIW DSP Architecture for Baseband Processing. Proceedings of the 21th International Conference on Computer Design, 2003 LCPC2005

Conclusion • We developed a compiler prototype for a new VLIW DSP architecture, called as PAC. • Based on ORC • New optimization issues by the irregular hardware design • Highly distributed register files • Port-access restricted ping-pong structures • A SA approach employed to obtain a preliminary result of exploiting register allocation on PAC • We will extend our works on the upcoming next version of PAC DSP. LCPC2005

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Presentation Transcript

Compiler Optimizations for Modern VLIW/EPIC Architectures

Outline – DSP Processors and Hardware

Workshop on Optimizations for DSP and Embedded Systems

Compiler Support for Superscalar Processors

Weakest Precondition Synthesis for Compiler Optimizations

Optimizing Compiler . Scalar optimizations .

Clustered Data Cache Designs for VLIW Processors

Optimizing compiler . Interpocedural optimizations .

Compiler Issues for Embedded Processors

Compilers for DSP Processors and Low-Power

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Compiler Speculative Optimizations

Generation of optimized DSP library for mAgicV VLIW DSP

Performance Analysis and Compiler Optimizations

Multiple Issue Processors: Superscalar and VLIW

Compiler Optimizations

DSP Processors

VLIW Processors

Optimizing Compiler . Scalar optimizations .

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors