180 likes | 313 Views
VEAL: Virtualized Execution Accelerator for Loops. Nate Clark 1 , Amir Hormati 2 , Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan. How to get Efficiency?. Microarchitecture changes Multi- / many-core Heterogeneity. Core2 Duo. STI Cell. Engineer/ Compiler. How is Heterogeneity Used?.
E N D
VEAL: Virtualized Execution Accelerator for Loops Nate Clark1, Amir Hormati2, Scott Mahlke2 1 Georgia Tech., 2U. Michigan
How to get Efficiency? • Microarchitecture changes • Multi- / many-core • Heterogeneity Core2 Duo STI Cell
Engineer/ Compiler How is Heterogeneity Used? Program Hetero. GPP Control Statically Placed in Binary
Hetero. Hetero. CPU CPU Engineer/ Compiler Problem With Static Control Not forward/backward compatible CPU Program 4
Hetero. Hetero. CPU CPU Program CPU Dyn Comp. Engineer/ Compiler Dyn Comp. Dyn Comp. Solution: Virtualization • Abstract accelerator features • Reexamine compiler algorithms • Key: do the hard stuff offline Offline Online
This Paper: • Examines loops as heterogeneity target • ASICs often implement loops • Design a generalized loop accelerator • Not covered in this talk • Explore how to virtualize loop accelerators • I.e. abstract the accelerator interface
Why More Efficient Than GPP? • Simple control flow • Decoupled memory accesses • I-Cache unnecessary • Customize execution resources for loops
Proposed Loop Accelerator • 1 CCA • 2 Int units • 16 regs • Memory (4x) • 16 Input streams • 8 Output streams • 0.8 mm2, 90nm
Modulo Scheduling + High quality software pipelining technique + Simple control structure (low HW cost) - Can be slow, i.e., hard to do dynamically - Loops: no side exits, no while, if convertible
Modulo Scheduling Basics FU C Kernel
0 1 2 Modulo Scheduling Example 1. CCA Mapping 2. II Calculation 3. Priority 4. Scheduling 5. Reg. assignment/ communication 2 3 7 4 5 Time 6 Priority: 2, 4, 6 3, 5 7
Measured Scheduling Overhead 70% Priority, 19% CCA
Supporting Hybrid Compilation Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or 6 or 7 add 8 str CCA: and sub xor ret Data: 0 1 4 6 3 … Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or … Loop: 1 ld 2 add 3 sub and sub xor 5 or 6 or 7 add 8 str
Summary • Virtualization key to heterogeneity • VEAL speedup: 2.54 • 2.63 w/o translation (i.e., not binary compatible) • 2.17 fully dynamic • CCA and priority: 89% overhead • mpeg2dec 2.1 vs. 1.15
Thank you! Questions? http://www.cc.gatech.edu/~ntclark http://cccp.eecs.umich.edu/