180 likes | 195 Views
This paper presents compiler supports and optimizations for PAC VLIW DSP processors, including optimization issues, preliminary compiler supports, and experimental results. It also discusses the PAC DSP architecture and its unique features such as the innovative register file structure.
E N D
Compiler Supports and Optimizations for PAC VLIW DSP Processors Y.-C. Lin C.-L. Tang C.-J. Wu M.-Y. Hung Y.-P. You Y.-C. Moo S.-Y. Chen and J.-K. Lee National Tsing-Hua University Taiwan
Outline • PAC VLIW DSP Architectures • Optimization Issues • Preliminary Compiler Supports • Experimental Results • Conclusion LCPC2005
Introduction • Parallel Architecture Core (PAC) is designed by SoC Technology Center, ITRI, Taiwan. • 32bit, fixed-point, 5-way issue VLIW DSP • scalable architecture • optimized instruction set for audio/video/image • innovative register file structure • two generations developed • TSMC’s 0.13 μm technology (taped-out in Aug. 2005) High-performance Low-power LCPC2005
Key Issues • Deploy the general-purpose high-performance open source compiler for DSP processors • ORC PAC DSP • Address issues for fragmentary register banks of DSP processors • Methods for irregular register constraints and instruction scheduling LCPC2005
Cluster Cluster Cluster Cluster Cluster B-Unit I-Unit I-Unit M-Unit M-Unit A Registers A Registers A Registers A Registers A Registers A Registers A Registers A Registers A Registers B-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit M-Unit B-Unit B-Unit B-Unit B-Unit D Registers D Registers D Registers D Registers D Registers D Registers D Registers Extend More Clusters I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit I-Unit M-Unit R Registers R Registers R Registers R Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers AC Registers I-Unit I-Unit PAC DSP Overview • Cluster Design: • Scalability • Explicit Inter-Cluster Data Transfer Instructions • Five-Way Issues: • 1 Scalar/Control Unit (B) • 2 Arithmetic Unit (I) • 2 Load/Store Unit (M) • Distributed Register Files: • 5 Local Register Files (A, AC, R) • 2 Global Register Files (D) • Other Features: • 8-bit/16-bit SIMD operations • Variable instruction word/bundle length • Dynamic Power Management • Standard AMBA interface A Registers A Registers B-Unit R Registers AC Registers AC Registers LCPC2005
So called as Ping-pong! Load I-Unit M-Unit Compute Load Store Compute M-Unit and I-Unit operate on different data streams at the same time! Store Ping-pong Register File Structure • Used by Global Register File (D) • Concept: • Overlap processing different data streams in a cluster • Benefit: • Decrease the port number for low-power and size LCPC2005
M-Unit M-Unit M-Unit Bank 1 Bank 1 Bank 2 Bank 2 Bank 2 Bank 1 I-Unit I-Unit I-Unit Ping-pong Register Access • Each ‘D’ register file contains 2 banks. • Rules: • Access by one unit to the 2 banks is mutually-exclusivein a cycle. • M-Unit and I-Unit can only access to different banks in a cycle. Instructional Switcher Only 1 state for each cycle! LCPC2005
We need to schedule into 2 bundles since they use the same bank! For compilers optimizations: Better register (file/bank) allocation Better schedule in fewer bundles Issues for Ping-pong Registers(1) Lw D8, A0 Add D1,D0,AC0 • Example for ping-pong usage: • Able toform a bundle • Unable toform a bundle Lw D2, A0 Add D1,D0,AC0 LCPC2005
Lw D8, A0 Add D1,D0,AC0 Need cross ping-pong communication! Additional copy-operation needed! Sw D1, A0 Sub D9,D8,D1 Mov AC1, D1 Sw D1, A0 Sub D9,D8,AC1 Invalid operation! Issues for Ping-pong Registers(2) • Data transfer between ping-pong banks: • For compiler optimizations: • Well-handle data-communication between ping-pong banks within any code manipulation • Generate additional copy-operation as few as possible LCPC2005
A B C D Additional Cross-Cluster Copy E F Cluster2 Cluster1 G Issues for Inter-cluster Communication • To exploit cluster parallelism: • PAC needs explicit instruction to be issued for inter-cluster communication! Cluster1 Cluster2 B-Unit A B C D • Optimize code partitioning: • Fewer communication • Better scheduling E F G LCPC2005
More Considerations • Two optimized codes of the same performance: • Upper Smaller code size • Lower Lower power consumption LCPC2005
Compiler Supports for PAC DSP • Essential supports (IA-64 ORC PAC) • New Target_Info • PAC Architecture and ISA descriptions • Complicated hazard descriptions • PAC application-binary-interface (ABI) • data type mapping • memory usage layout • register usage conventions • calling conventions • PAC code generation • 32-bit WHIRL code generation • PAC WHIRL-to-CGIR procedures • PAC assembly code emission LCPC2005
Register Allocation Instruction Scheduling Code Insertion for Distributed Register Communication Simulated-Annealing (SA) Based Register Allocation Approach • Motivation: • Complex interference from: • We appreciate a machine-learning method to give a near-optimal results. • To be a base reference for developing heuristic methods! LCPC2005
To Determine: Virtual Register Register File (Bank) • Input: un-scheduled instructions • Output: a schedule of the instructions a register file assignment (RFA) map • RFA map = {(v1, f1), (v2, f2), ...} • Where vi : a virtual register, fi : a register file (bank) • PAC_Scheduler: • Graph-coloring based register allocation according to the RFA map • Instruction scheduling and code insertion for register file communication • Setup SA: • An initial random RFA map • schedule_len = PAC_Scheduler ( initial RFA map ) • SA control variables: • threshold • p_test: a probability test value (0 < p_test < 1). • energy: initial value > threshold. LCPC2005
new RFA map Re-run: new_schedule_len = PAC_Scheduler (new RFA map) Randomly change: a mapping (vi, fi) yes SA stop test: energy > threshold Better result test: new_schedule_len < schedule_len new RFA map yes energy--schedule_len = new_schedule_len no no yes Random test: a random number > p_test FinalRFA map & schedule old RFA map energy++ no To Optimize: Scheduling Result LCPC2005
Preliminary Experimental Results (DSPStone benchmarks) LCPC2005
Related Works • Register Allocation • R. Leupers: Instruction scheduling for clustered VLIW DSPs. In Proc. Int’l Conference on Parallel Architecture and Compilation Techniques, pages 291–300, Oct. 2000 • Register File Organizations • S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens: Register organization for media processing. International Symposium on High Performance Computer Architecture (HPCA), pp.375-386, 2000 • Tay-Jyi Lin, Chin-Chi Chang. Chen-Chia Lee, and Chein-Wei Jen: An Efficient VLIW DSP Architecture for Baseband Processing. Proceedings of the 21th International Conference on Computer Design, 2003 LCPC2005
Conclusion • We developed a compiler prototype for a new VLIW DSP architecture, called as PAC. • Based on ORC • New optimization issues by the irregular hardware design • Highly distributed register files • Port-access restricted ping-pong structures • A SA approach employed to obtain a preliminary result of exploiting register allocation on PAC • We will extend our works on the upcoming next version of PAC DSP. LCPC2005