SIMD Optimization in COINS Compiler Infrastructure

SIMD Optimization in COINS Compiler Infrastructure Mitsugu Suzuki (The University of Electro-Communications) Nobuhisa Fujinami (Sony Computer Entertainment Inc.)

Agenda • COINS SIMD optimization • Two topics on SIMD optimization • Data Size Inference • SIMD Benchmark • Current status and required improvements

SIMD optimization‥‥Concept and decision • implemented as an LIR to LIR transformer • requires no additional special extensionsfor source languages. • source-level optimizable matters are postponed. → HIR-level matter ex. Vectorization (appropriate loop unrolling), if-peeling, complex if-conversion, etc.

#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1)) short *v1, *v2, *v3; /* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) // case-A *v1++ = AVE(*v2++, *v3++); for (i = 0; i < M; i++) // case-B v1[i] = AVE(v2[i], v3[i]); for (i = 0; i < M; i += 4) { // case-C v1[i] = AVE(v2[i], v3[i]); v1[i+1] = AVE(v2[i+1], v3[i+1]); ... v1[i+3] = AVE(v2[i+3], v3[i+3]); } for (i = 0; i < M; i += 4) { // case-D v1[0] = AVE(v2[0], v3[0]); v1[1] = AVE(v2[1], v3[1]); ... v1[3] = AVE(v2[3], v3[3]); v1+=4; v2+=4; v3+=4; } × ○

#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1)) struct { short r, g, b, a; } *u1, *u2, *u3; /* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) { // case-E u1[i].r = AVE(u2[i].r, u2[i].r); u1[i].g = AVE(u2[i].g, u2[i].g); u1[i].b = AVE(u2[i].b, u2[i].b); u1[i].a = AVE(u2[i].a, u2[i].a); } ○

SIMD optimization‥‥Processing flow • If-conversion • Decompose basic blocks into DAGs. • Match LIR patterns to specific SIMD operation. • Combine same basic operations. (parallelization) (⇒3rd page of hand script)

8bits 8bits 7bits 7bits 9bits 8bits 8bits Data size inference ‥‥Why needed? Two styles of averaging integers: (assumption : Both x and y are given 8 bits unsigned integers.) #define AVE(x,y) (((x) + (y) + 1) >> 1) ⇒max 9bits: zero-extension is needed (normal instruction oriented coding) #define AVE(x,y) (((x)>>1) + ((y)>>1) + (((x)|(y))&1)) ⇒max 8bits: no extension is needed (SIMD instruction oriented coding) Butcompiler must extend x and y to its integral type (typically 32 bits) ← Integral promotion rule

Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules • Patterns of the meaningful bits are matched while instruction selection.

0..255 0..255 0..255 0..255 0..254 0..1 1..1 0..511 0..127 1..1 0..510 1..1 0..255 0..127 0..255 0..255 0..255 0..255 1..1 1..1 SET SET CONVIT:I8 CONVIT:I8 MEM:I8 MEM:I8 ADD RSHU ADD BAND ADD CONST ADD CONST 1 CONST RSHU BOR RSHU 1 1 CONST CONST CONVZX CONVZX CONVZX CONVZX MEM:I8 MEM:I8 MEM:I8 MEM:I8 1 1 *a = (*b>>1 + *c>>1 +((*b | *c) & 1)); *a = (*b + *c + 1) >> 1;

Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules • Patterns of the meaningful bits are matched while instruction selection.

８８８８８８９９９８８８８８８８８８８ SET SET 0..255 0..255 CONVIT:I8 CONVIT:I8 MEM:I8 MEM:I8 0..255 0..255 ADD RSHU 0..254 0..1 1..1 0..511 ADD BAND ADD CONST 0..127 1..1 0..510 1..1 0..255 0..127 ADD CONST 1 CONST RSHU BOR RSHU 1 0..255 0..255 0..255 0..255 1..1 1..1 1 CONST CONST CONVZX CONVZX CONVZX CONVZX MEM:I8 MEM:I8 MEM:I8 MEM:I8 1 1 *a = (*b>>1 + *c>>1 +((*b | *c) & 1)); *a = (*b + *c + 1) >> 1;

Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules. • Patterns of the meaningful bits are matched while instruction selection.

Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on theirInference Rules • Patterns of the meaningful bits are matched while instruction selection.

SIMD Benchmark‥‥Why needed? • Existing benchmarks are not suited for tuning of SIMD optimization. • SIMD-optimizable patterns are covered with non-SIMD-optimizable ones. • Existing codes are far from SIMD-optimization (without hole-in-one matching). • Step-wise milestones for SIMD-optimization was required.

SIMD Benchmark‥‥Design • SIMD-optimizable code patterns were extracted from real media processing applications. • Multiple versions were crafted by hand for each code patterns so as • covering wide range, from easily SIMD optimized level to original • classified by SIMD optimization techniques • execution times are reported for each version

Original If-peeled int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel; } else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;} acLevel = ((data[i] < 0) ? -data[i] : data[i]) - quant_d_2; acLevel2 = (acLevel * mult) >> SCALEBITS; sum += ((acLevel < quant_m_2) ? 0 : acLevel2); coeff[i] = ((acLevel < quant_m_2) ? 0 : ((data[i] < 0) ? -acLevel2 : acLevel2)); and loop-unrolled / not

Original If-conversed int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel; } else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;} acMsk1 = (int)data[i] >> 31; acLevel = ((data[i] & ~acMsk1)| ((-data[i]) & acMsk1)) - quant_d_2; acMsk2 = (acLevel < quant_m_2) ? 0 : 0xffff; acLevel = (acLevel * mult) >> SCALEBITS; sum += acMsk2 & acLevel; coeff[i] = acMsk2 & (((-acLevel) & acMsk1) | (acLevel & (~acMsk1))); and loop-unrolled / not

Current status andrequired improvements • Bone of SIMD opt. has been implemented. • Following are MUST • Enrichment of template for specific SIMD op. • Isolation of machine dependent and independent part in SIMD opt. • Recovery method from failure in SIMD op. matching. • Alignment and overlapping check for pointers . ⇒ will be solved in the next release

SIMD Optimization in COINS Compiler Infrastructure

SIMD Optimization in COINS Compiler Infrastructure

Presentation Transcript

Compiler-Assisted Optimization for Graphics

Static Single Assignment Form in the COINS Compiler Infrastructure

Infrastructure- Networking Optimization

SIMD

Machine Learning in Compiler Optimization

ROSE Compiler Infrastructure Source-to-Source Analysis and Optimization

IBM Compiler Optimization on Bassi

Compiler Optimization and Code Generation

Optimizing General Compiler Optimization

Exploiting SIMD parallelism with the CGiS compiler framework

Compiler Optimization-Space Exploration

IBM Compiler Optimization Arguments

IBM Compiler Optimization Arguments

Static Compiler Optimization Techniques

Static Compiler Optimization Techniques

CSC D70: Compiler Optimization

CSC D70: Compiler Optimization Parallelization

IBM Compiler Optimization on Bassi

Static Compiler Optimization Techniques

ROSE Compiler Infrastructure Source-to-Source Analysis and Optimization

Compiler Optimization-Space Exploration

Compiler Optimization and Code Generation