300 likes | 556 Views
Xtensa C and C++ Compiler Ding-Kai Chen Tensilica, Inc dkchen@tensilica.com. Presentation Outline. XCC history XCC target -- Xtensa configurable processor XCC details with examples User defined C types Operator overloading VLIW scheduling Auto-SIMD vectorization Operation fusion
E N D
Xtensa C and C++ CompilerDing-Kai Chen Tensilica, Inc dkchen@tensilica.com
Presentation Outline • XCC history • XCC target -- Xtensa configurable processor • XCC details with examples • User defined C types • Operator overloading • VLIW scheduling • Auto-SIMD vectorization • Operation fusion • SWP Changes
XCC History • Got the first version of SGI Pro64 in May 2000 • First customer release, August 2001 • Release with IPA, August 2002 • Release with SWP, Feedback, VLIW, September 2004 • Release with GCC 4.2 Front End, October 2009 • Supports C and C++ applications • Other languages are not as important for embedded applications
Xtensa Processor • 32-bit RISC processor targeting embedded dataplane applications • 16 32-bit general registers (AR) • 24-bit base instructions • Configurable at design-time (not at run-time) Xtensa Core Architecture
Xtensa Configuration Options • Many pre-defined options to choose from • Endianness • Windowed vs non-windowed register file • Narrow (16-bit) instructions • Multipliers • Coprocessors (HiFi, Vectra, BBE, FP) • Specialized (e.g., MAX) instructions, etc Configuration Options Xtensa Core Architecture
Targeting XCC to Base Xtensa and Tensilica Configurations • As part of retargeting to Xtensa, we used/added • Code-generator generator tool Olive for WHIRL to CGIR translation • Handles a lot of configuration specific code • Support for Xtensa zero-overhead loop instructions • CG Code-size optimization that commonizes instructions from control-flow predecessors • Feedback-directed speed vs code-size tradeoff • Support for flexible VLIW formats • Formats of different bit width and different number of issue slots
Tensilica Instruction Extension (TIE) • TIE is a language to describe new custom: • Register files up to 512 bits wide • Instructions up to 128 bits • VLIW formats up to 15 slots • C types mapped to custom register files • Vectorization rules • Fusion patterns • Operator overloading Custom TIE Configuration Options Xtensa Architecture
XCC Challenges • Custom extensions in TIE are written at customer site and cannot be configured at XCC build time • Design goals: • Separation of config-independent code and config-dependent libraries • Re-targeting in minutes after TIE is designed or modified by processor architect at customer site • programming new HW extensions as native C types/operations
Xtensa - Full Development Automation Processor Complete Hardware Design Source pre-verified RTL, EDA scripts, test suite Extensions Processor Configuration Xtensa Processor Generator* Use standard ASIC/COT design techniques andlibraries for any IC fabrication process 1. Select from menu 2. Explicit instruction description (TIE) Customized Software Tools C/C++ compiler Debuggers, Simulators, RTOSes * US Patent: 6,477,697
TIE register file and operation in C: void vsum() { int i; int32x4* va = (int32x4*)a; int32x4* vb = (int32x4*)b; int32x4* vc = (int32x4*)c; for (i=0; i<VSIZE; i++) { // C intrinsic call vc[i] = add_v(va[i] , vb[i]); } } // new register file for int32x4 // vectorization Regfile v 128 16 // a new C type based on <v> regfile // and has 128-bit size and // 128-bit alignment ctype int32x4 128 128 v operation add_v { out v vout, in v va, in v vb } {} { assign vout = { va[127:96] + vb[127:96], va[95:64] + vb[95:64], va[63:32] + vb[63:32], va[31:0] + vb[31:0] }; } add_v is an intrinsic call in C In WHIRL, it is an intrinsic_op optimizer friendly
TIE C type support • Each TIE C type maps to a new WHIRL mtype • Each TIE regfile maps to a ISA_REGCLASS • GCC FE declares new C types and new intrinsics (added new TIE_TYPE tree code) • WGEN translates TIE C type references to WHIRL loads/stores • Olive tool adds dynamic rules to handle new types and WHIRL opcodes • Added TN_mtype() for register spills/reloads • Made BE optimizations (CSE, ebo, etc) work
TIE example – generated code #<loop> Loop body line 28, nesting depth: 1, iterations: 8 #<loop> unrolled 4 times load_v v0,a2,0 # [0*II+0] id:20 b+0x0 load_v v1,a3,0 # [0*II+1] id:19 a+0x0 load_v v2,a2,16 # [0*II+2] id:20 b+0x0 load_v v3,a3,16 # [0*II+3] id:19 a+0x0 load_v v4,a2,32 # [0*II+4] id:20 b+0x0 load_v v5,a3,32 # [0*II+5] id:19 a+0x0 load_v v6,a2,48 # [0*II+6] id:20 b+0x0 load_v v7,a3,48 # [0*II+7] id:19 a+0x0 addi a2,a2,64 # [0*II+8] addi a3,a3,64 # [0*II+9] addi a4,a4,64 # [0*II+10] add_v v0,v1,v0 # [0*II+11] add_v v1,v3,v2 # [0*II+12] add_v v2,v5,v4 # [0*II+13] add_v v3,v7,v6 # [0*II+14] store_v v0,a4,-64 # [0*II+15] id:21 c+0x0 store_v v1,a4,-48 # [0*II+16] id:21 c+0x0 store_v v2,a4,-32 # [0*II+17] id:21 c+0x0 store_v v3,a4,-16 # [0*II+18] id:21 c+0x0 Total 19/4 = 4.75 cycles per iteration
TIE updating ld/st // pre-increment load/store operation load_vu { out v vout, inout AR base, in simm8 offset } { out VAddr, in MemDataIn128 } { assign VAddr = base + offset; assign vout = MemDataIn128; assign base = base + offset; } operation store_vu { in v vin, inout AR base, in simm8 offset } { out VAddr, out MemDataOut128 } { assign VAddr = base + offset; assign MemDataOut128 = vin; assign base = base + offset; } proto int32x4_loadiu { out int32x4 vout, inout int32x4* base, in immediate offset } {} { load_vu vout, base, offset; } proto int32x4_storeiu { in int32x4 vin, inout int32x4* base, in immediate offset } {} { store_vu vin, base, offset; }
TIE updating ld/st • XCC Identifies updating ld/st operations • Pre-bias ld/st bases to work with pre-increment • Combine ld/st with addi in CG • #<loop> Loop body line 28, nesting depth: 1, iterations: 32 • load_vu v0,a2,16 # [0*II+0] id:20 b+0x0 • load_vu v1,a3,16 # [0*II+1] id:19 a+0x0 • store_vu v2,a4,16 # [1*II+2] id:21 c+0x0 • add_v v2,v1,v0 # [0*II+3] • total 4 cycles per iteration
TIE operator overloading • Check for TIE type operands and operator overloading in build_binary_op in c-typeck.c of GCC • Build proper call to mapped TIE intrinsic // map “+” operator to add_v for // type int32x4 operator "+" add_v in C: void vsum_op() { int i; int32x4* va = (int32x4*)a; int32x4* vb = (int32x4*)b; int32x4* vc = (int32x4*)c; for (i=0; i<VSIZE; i++) { // more natural using C “+” syntax vc[i] = va[i] + vb[i]; } }
TIE VLIW scheduling format flix0 64 {slot0,slot1} // add 2-slots 64-bit VLIW format slot_opcodes slot0 { load_v, store_v, load_vu, store_vu, add_v } slot_opcodes slot1 { load_v, store_v, load_vu, store_vu, add_v } ---------------------------------- .s output -------------------------------------------------- #<loop> unrolled 2 times { # format flix0 load_vu v3,a2,32 # [0*II+0] id:20 b+0x0 add_v v5,v4,v3 # [1*II+0] } { # format flix0 load_v v0,a2,-16 # [0*II+1] id:20 b+0x0 add_v v2,v1,v0 # [1*II+1] } { # format flix0 load_v v1,a3,16 # [0*II+2] id:19 a+0x0 load_vu v4,a3,32 # [0*II+2] id:19 a+0x0 } { # format flix0 store_v v2,a4,16 # [1*II+3] id:21 c+0x0 store_vu v5,a4,32 # [1*II+3] id:21 c+0x0 } total 4/2=2 cycles per iteration
TIE VLIW scheduling • XCC initialization includes analysis on TIE VLIW formats • Create resources that model bundling constraints • Consider a simpler case: 1 slot is allowed for each opcode • Each VLIW slot in a format is viewed as a resource • Different formats are treated separately • Each opcode consumes the resource of the slot it is allowed • For a group of operations, if the total resource usage is within the limit can be scheduled in the same cycle • Get complicated when multiple slots are allowed for opcodes • Resource reservation modeling allows de-coupling of scheduling and slot assignment in CG • Extended resource reservation word type SI_RRW to arbitrary length bit-vectors • TI_RES_RES_Resources_Available() also checks for compatible formats
TIE auto-SIMD vectorization propertyvector_ctype {int32x4, int32, 4} propertyvector_proto {add_v, xt_add, 4} in C: for (i=0; i<SIZE; i++) { c[i] = a[i] + b[i]; } with -O3 -LNO:simd -clist, in .w2c: int32x4 V_00; int32x4 V_; int32x4 V_0; int32x4 V_4; _INT32 i; for(i = 0; i <= 127; i = i + 4) { V_00 = *(int32x4 *)(&a[i]); V_ = *(int32x4 *)(&b[i]); V_0 = add_v(V_00, V_); V_4 = V_0; * (int32x4 *)(&c[i]) = V_4; }
TIE auto-SIMD vectorization • Developed independently (before) Open64 Vectorizer • Integrate into Phase2 of LNO • Scan all loops in a nest • Check for presence of vectorized versions of each op in the loop • Check for stride-1 or invariant memory references • Support for loads and stores with addresses not aligned as vector type • Pre-load once before the vector loop • Subsequent loads in the vector loop combine with the prior loads • Support for spatial reuse within a vector using select instruction • E.g. a[i] + a[i+1] in the scalar loop • Pre-load once before the vector loop • Only a single load is needed now for each iteration • Select instructions shuffle data from loads of consecutive iterations
TIE operation fusion • Combine multiple operations to one • E.g., combines an add followed by a shift to one add_shift operation • Performed in CG • Build dataflow graphs from input patterns • Repeatedly search for matches in BBs • Peephole optimization with custom patterns imap add_shift_v { out v vout, in v va, in v vb, in immediate amount } { {} { // the output pattern add_shift_v vout, va, vb, amount; } } { { v v_temp } { // the input pattern add_v v_temp, va, vb; shift_v vout, v_temp, amount; } }
TIE operation fusion • Example C code: • for (i=0; i<VSIZE; i++) { vc[i] = (va[i] + vb[i]) << 2; } • Original schedule is 5 cycles / 2 iter = 2.5 cycles per iteration • New schedule with operation fusion is 4 cycles / 2 iter = 2 cycles per iteration
XCC SWP scheduler • Xtensa has no rotating registers – added 2 register allocators, simple and coloring. Use simple first to get tighter bound then try coloring. • Performance is critical: added back-tracking for the following • Unrolling (hard to guess best unrolling) • Different priority heuristics for choosing candidates • Different initial op orderings • Register allocation failures • Runs slightly longer but complements the original IA-64 based SWP algorithm well
Conclusion • Open64 is versatile in providing optimized performance for embedded applications. • XCC experience shows that many of the optimizations can be adapted to retarget for ISA extensions quickly. • Sample Performance Data: • EEMBC Consumer benchmark gained 6x speedup with automatic vectorization + vliw scheduling + operation fusion • XCC solution is not final. It is still evolving with new HW features offered from Tensilica. • Want to explore new ways in TIE to describe HW that supports optimizations.
Tensilica is looking for new talent to join the compiler team. http://www.tensilica.com dkchen@tensilica.com