Optimizing C63 for x86

Optimizing C63 for x86 Group 9

Bird’s-eye view: gprof of reference encoder • Optimizing SAD • Results Outline

Each sample counts as 0.01 seconds. • % cumulative self self total • time seconds seconds calls us/call us/call name • 93.05 37.36 37.36 500214528 0.07 0.07 sad_block_8x8 • 3.04 38.58 1.22 705672 1.73 54.67 me_block_8x8 • 1.77 39.29 0.71 712800 1.00 1.00 dequant_idct_block_8x8 • 1.25 39.79 0.50 712800 0.70 0.70 dct_quant_block_8x8 • 0.35 39.93 0.14 713100 0.20 0.29 flush_bits • 0.17 40.00 0.07 16398942 0.00 0.00 put_bits • granularity: each sample hit covers 2 byte(s) for 0.02% of 40.16 seconds gprof: reference,foreman

Optimizing SAD

void sad_block_8x8(uint8_t *block1, uint8_t *block2, intstride, int *result) • { • int v; • __m128i r = _mm_setzero_si128(); • for (v = 0; v < 8; v += 2) • { • const __m128i b1 = _mm_set_epi64(*(__m64 *) &block1[(v+0)*stride], • *(__m64 *) &block1[(v+1)*stride]); • const __m128i b2 = _mm_set_epi64(*(__m64 *) &block2[(v+0)*stride], • *(__m64 *) &block2[(v+1)*stride]); • r = _mm_add_epi16(r, _mm_sad_epu8(b2, b1)); • } • *result = • _mm_extract_epi16(r, 0) + • _mm_extract_epi16(r, 4); • } SAD SSE2PSAD

Gprof:Improved SSE2 SAD uses 3.69 s vs. 38.50s* • Cachegrind:Lots of branch prediction misses in me_block_8x8*) Foreman sequence on gpu-7 How well does this perform?

void sad_block_8x8x8(uint8_t *block1, uint8_t *block2, int stride, int *best, • int *result) • { • int v; • __m128i r = _mm_setzero_si128(); • union { • __m128i v; • struct { • uint16_t sad; • unsigned int index : 3; • } minpos; • } mp; • for (v = 0; v < 8; v += 2) { • const __m128i b1 = _mm_set_epi64(*(__m64 *) &block1[(v+1)*stride], • *(__m64 *) &block1[(v+0)*stride]); • const __m128i b2 = _mm_loadu_si128((__m128i *) &block2[(v+0)*stride]); • const __m128i b3 = _mm_loadu_si128((__m128i *) &block2[(v+1)*stride]); • r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b2, b1, 0b000)); • r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b2, b1, 0b101)); • r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b3, b1, 0b010)); • r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b3, b1, 0b111)); • } • mp.v = _mm_minpos_epu16(r); • *result = mp.minpos.sad; • *best = mp.minpos.index; • } SAD SSE4.1MPSADBW+PHMINSUM

Gprof:Improved SSE4.1 SAD uses 0.90s vs 3.69s • Cachegrind:Branch prediction misses reduced by a factor of 8 • Intel’s IACA tool:CPU pipeline appears to be filled!Appears both source and reference block loads compete with (V)MPSADBW for the CPUs execution ports • Assembly:Better utilize AVX’s non-destructive instructions (less register copies)Better utilize loaded data for SAD computations with same source block How well does this perform?

Load source block once:.macro .load_src • vmovq (%rdi), %xmm0 # src[0] • vpinsrq $1,(%rdi,%rdx),%xmm0,%xmm0 # src[1] • vmovq (%rdi,%rdx,2),%xmm1 # src[2] • vmovhps (%rdi,%r8), %xmm1,%xmm1 # src[3] • vmovq (%rdi,%rdx,4),%xmm2 # src[4] • vmovhps (%rdi,%r9), %xmm2,%xmm2 # src[5] • vmovq (%rdi,%r8,2), %xmm3 # src[6] • vmovhps (%rdi,%rax), %xmm3,%xmm3 # src[7] • Do sad for 8x8 - 8x8 blocks (relative y = 0…8, x = 0…8):.macro .8x8x8x8 • vmovdqu(%rsi), %xmm12 # ref[0] • .8x8x1 0, 0, 12, 4 0 • vmovdqu (%rsi,%rdx), %xmm13 # ref[1] • .8x8x1 0, 1, 13, 4 0 • .8x8x1 1, 0, 13, 5 0 SAD 8x8x8x8:Less src loads, branches

Gprof:Improved SAD 8x8x8x8 uses 0.53s vs0.90s • Valgrind:Even less branching How well does this perform?

sad_4x8x8x8x8:.load_src # Load source block from %rdi • mov %rsi, %rdi # Reference block x,y • .8x8x8x8 • lea 8(%rdi), %rsi # Reference block x+8, y • .8x8x8x8 • lea (%rdi,%rdx,8), %rsi • .8x8x8x8 • lea 8(%rdi,%rdx,8), %rsi # • .8x8x8x8 • … • vphminposuw • … • ret SAD 4x8x8x8x8:Branchless UV-plane ME

Gprof:Improved 4x8x8x8x8 SAD uses 0.47s vs0.53s • Valgrind:Even less branching! • Total runtime of Foreman sequence reduced from ~40.1s to ~1.6s (factor of 25) How well does this perform?

SAD assembly code:Iterative improvement

Image quality:comparable

NOT ON NDLAB! • Each sample counts as 0.01 seconds. • % cumulative self self total • time seconds seconds calls s/call s/call name • 92.90 124.66 124.66 1809584896 0.00 0.00 sad_block_8x8 • 3.06 128.76 4.10 49 0.08 2.63 c63_motion_estimate • 1.58 130.89 2.13 2448000 0.00 0.00 dct_quant_block_8x8 • 1.58 133.01 2.13 2448000 0.00 0.00 dequant_idct_block_8x8 • 0.26 133.36 0.35 2448000 0.00 0.00 write_block • 0.18 133.60 0.24 150 0.00 0.02 dequantize_idct • 0.16 133.82 0.22 49 0.00 0.00 c63_motion_compensate • 0.13 133.99 0.17 150 0.00 0.02 dct_quantize • 0.11 134.14 0.15 55939437 0.00 0.00 put_bits • granularity: each sample hit covers 2 byte(s) for 0.01% of 134.21 seconds gprof: reference,tractor 50 frames (-O3)

NOT ON NDLAB! • Each sample counts as 0.01 seconds. • % cumulative self self total • time seconds seconds calls ms/call ms/call name • 31.15 1.83 1.83 6869016 0.00 0.00 sad_4x8x8x8x8 • 24.75 3.28 1.45 2448000 0.00 0.00 dct_quant_block_8x8 • 23.81 4.67 1.40 2448000 0.00 0.00 dequant_idct_block_8x8 • 7.17 5.09 0.42 50 8.40 12.20 write_frame • 4.44 5.35 0.26 150 1.73 11.34 dequantize_idct • 3.24 5.54 0.19 55937692 0.00 0.00 put_bits • 1.71 5.64 0.10 150 0.67 10.64 dct_quantize • 1.54 5.73 0.09 9792000 0.00 0.00 transpose_block_avx • 0.85 5.78 0.05 798700 0.00 0.00 sad_8x8x8x8 • 0.68 5.82 0.04 49 0.82 39.09 c63_motion_estimate • 0.51 5.85 0.03 49 0.61 0.61 c63_motion_compensate • granularity: each sample hit covers 2 byte(s) for 0.17% of 5.86 seconds • 134.2/5.86 = ~24 gprof: improved,tractor 50 frames (-O3)

Optimizing C63 for x86

Optimizing C63 for x86

Presentation Transcript

Facilities for x86 debugging

Facilities for x86 debugging

Oracle x86 Systems The Best x86 Systems for Oracle Solaris

Home Exam 1 : Codec c63 on Intel x86 using Streaming SIMD Extensions (SSE)

Facilities for x86 debugging

x86 Assembly

Assembly For X86

Assembly For X86

CodeSurfer / x86 A Platform for Analyzing x86 Executables

Optimizing for Television

Optimizing for Television

x86 MMU

x86, Assembler

Assembly For X86

C63-SC8

Facilities for x86 debugging