180 likes | 332 Views
SIMD Optimization in COINS Compiler Infrastructure. Mitsugu Suzuki (The University of Electro-Communications) Nobuhisa Fujinami (Sony Computer Entertainment Inc.). Agenda. COINS SIMD optimization Two topics on SIMD optimization Data Size Inference SIMD Benchmark
E N D
SIMD Optimization in COINS Compiler Infrastructure Mitsugu Suzuki (The University of Electro-Communications) Nobuhisa Fujinami (Sony Computer Entertainment Inc.)
Agenda • COINS SIMD optimization • Two topics on SIMD optimization • Data Size Inference • SIMD Benchmark • Current status and required improvements
SIMD optimization‥‥Concept and decision • implemented as an LIR to LIR transformer • requires no additional special extensionsfor source languages. • source-level optimizable matters are postponed. → HIR-level matter ex. Vectorization (appropriate loop unrolling), if-peeling, complex if-conversion, etc.
#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1)) short *v1, *v2, *v3; /* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) // case-A *v1++ = AVE(*v2++, *v3++); for (i = 0; i < M; i++) // case-B v1[i] = AVE(v2[i], v3[i]); for (i = 0; i < M; i += 4) { // case-C v1[i] = AVE(v2[i], v3[i]); v1[i+1] = AVE(v2[i+1], v3[i+1]); ... v1[i+3] = AVE(v2[i+3], v3[i+3]); } for (i = 0; i < M; i += 4) { // case-D v1[0] = AVE(v2[0], v3[0]); v1[1] = AVE(v2[1], v3[1]); ... v1[3] = AVE(v2[3], v3[3]); v1+=4; v2+=4; v3+=4; } × ○
#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1)) struct { short r, g, b, a; } *u1, *u2, *u3; /* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) { // case-E u1[i].r = AVE(u2[i].r, u2[i].r); u1[i].g = AVE(u2[i].g, u2[i].g); u1[i].b = AVE(u2[i].b, u2[i].b); u1[i].a = AVE(u2[i].a, u2[i].a); } ○
SIMD optimization‥‥Processing flow • If-conversion • Decompose basic blocks into DAGs. • Match LIR patterns to specific SIMD operation. • Combine same basic operations. (parallelization) (⇒3rd page of hand script)
8bits 8bits 7bits 7bits 9bits 8bits 8bits Data size inference ‥‥Why needed? Two styles of averaging integers: (assumption : Both x and y are given 8 bits unsigned integers.) #define AVE(x,y) (((x) + (y) + 1) >> 1) ⇒max 9bits: zero-extension is needed (normal instruction oriented coding) #define AVE(x,y) (((x)>>1) + ((y)>>1) + (((x)|(y))&1)) ⇒max 8bits: no extension is needed (SIMD instruction oriented coding) Butcompiler must extend x and y to its integral type (typically 32 bits) ← Integral promotion rule
Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules • Patterns of the meaningful bits are matched while instruction selection.
0..255 0..255 0..255 0..255 0..254 0..1 1..1 0..511 0..127 1..1 0..510 1..1 0..255 0..127 0..255 0..255 0..255 0..255 1..1 1..1 SET SET CONVIT:I8 CONVIT:I8 MEM:I8 MEM:I8 ADD RSHU ADD BAND ADD CONST ADD CONST 1 CONST RSHU BOR RSHU 1 1 CONST CONST CONVZX CONVZX CONVZX CONVZX MEM:I8 MEM:I8 MEM:I8 MEM:I8 1 1 *a = (*b>>1 + *c>>1 +((*b | *c) & 1)); *a = (*b + *c + 1) >> 1;
Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules • Patterns of the meaningful bits are matched while instruction selection.
8 8 8 8 8 8 9 9 9 8 8 8 8 8 8 8 8 8 8 SET SET 0..255 0..255 CONVIT:I8 CONVIT:I8 MEM:I8 MEM:I8 0..255 0..255 ADD RSHU 0..254 0..1 1..1 0..511 ADD BAND ADD CONST 0..127 1..1 0..510 1..1 0..255 0..127 ADD CONST 1 CONST RSHU BOR RSHU 1 0..255 0..255 0..255 0..255 1..1 1..1 1 CONST CONST CONVZX CONVZX CONVZX CONVZX MEM:I8 MEM:I8 MEM:I8 MEM:I8 1 1 *a = (*b>>1 + *c>>1 +((*b | *c) & 1)); *a = (*b + *c + 1) >> 1;
Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules. • Patterns of the meaningful bits are matched while instruction selection.
Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on theirInference Rules • Patterns of the meaningful bits are matched while instruction selection.
SIMD Benchmark‥‥Why needed? • Existing benchmarks are not suited for tuning of SIMD optimization. • SIMD-optimizable patterns are covered with non-SIMD-optimizable ones. • Existing codes are far from SIMD-optimization (without hole-in-one matching). • Step-wise milestones for SIMD-optimization was required.
SIMD Benchmark‥‥Design • SIMD-optimizable code patterns were extracted from real media processing applications. • Multiple versions were crafted by hand for each code patterns so as • covering wide range, from easily SIMD optimized level to original • classified by SIMD optimization techniques • execution times are reported for each version
Original If-peeled int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel; } else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;} acLevel = ((data[i] < 0) ? -data[i] : data[i]) - quant_d_2; acLevel2 = (acLevel * mult) >> SCALEBITS; sum += ((acLevel < quant_m_2) ? 0 : acLevel2); coeff[i] = ((acLevel < quant_m_2) ? 0 : ((data[i] < 0) ? -acLevel2 : acLevel2)); and loop-unrolled / not
Original If-conversed int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel; } else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;} acMsk1 = (int)data[i] >> 31; acLevel = ((data[i] & ~acMsk1)| ((-data[i]) & acMsk1)) - quant_d_2; acMsk2 = (acLevel < quant_m_2) ? 0 : 0xffff; acLevel = (acLevel * mult) >> SCALEBITS; sum += acMsk2 & acLevel; coeff[i] = acMsk2 & (((-acLevel) & acMsk1) | (acLevel & (~acMsk1))); and loop-unrolled / not
Current status andrequired improvements • Bone of SIMD opt. has been implemented. • Following are MUST • Enrichment of template for specific SIMD op. • Isolation of machine dependent and independent part in SIMD opt. • Recovery method from failure in SIMD op. matching. • Alignment and overlapping check for pointers . ⇒ will be solved in the next release