140 likes | 299 Views
Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn. Dr. Zvi Danovich, Senior Application Engineer November – December 2007. Agenda. General description of 2x Shrink Step 1: weights computation Step 2: components computation Benchmarks and conclusions.
E N D
Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn Dr. Zvi Danovich, Senior Application Engineer November – December 2007
Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions
General description • Pixel has 3 components (r,g,b) and 4th, ‘a’ – weight, all are 1byte length • Each pair of pixel lines is interpolated to one line 2x shortened: 2 opposite pixel pairs are combined to 1 pixel in shrunk image • New (interpolated) component C = ∑(ca)0-3 ∕ ∑(a)0-3, where ‘c’ is r, g or b. New weight ‘a’ A = min(255, ½ ∑(a)0-3 ). • Preliminary step: reading (loading) m128i_Ev01, m128i_Ev23, m128i_Od01, m128i_Od23 m128i_Ev01 m128i_Ev23 Sourse: even line r g b a “Shrunk” pixels 0 1 2 3 Sourse: odd line m128i_Od01 m128i_Od23
Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions
Step 1: weights computation 1.1 Building the partial sums (a0+a1), (a2+a3) … • Building 8*16bit ‘a’-s by 2 shuffles and logical ‘or’ • Part sum by MADD with 8*16bit ‘1’-s Even line m128i_8a r g b a a0 a1 a2 a3 a4 a5 a6 a7 equivalent Odd line 1 1 1 1 1 1 1 1 MADD a0 a1 a2 a3 a4 a5 a6 a7 a0+a1 a2+a3 a4+a5 a6+a7
Step 1: weights computation (cont)1.2 Building the sums (a0+a1+a2+a3), (a4+a5+a6+a7) … and reciprocals • Perform the same computation for second pair of pixel quads, obtaining • Building final sums using HADD • Converting the result to Float Point (FP) and computation reciprocals Here we have 4 FP ‘a’-sum reciprocals - normalization coefficients a8+a9 a10+a11 a12+a13 a14+a15 HADD a0+a1 a2+a3 a4+a5 a6+a7 a8+a9 a10+a11 a12+a13 a14+a15 0+1+2+3 = ∑(a)0-3 ∑(a)4-7 ∑(a)8-11 ∑(a)12-15 FP 1/∑(a)0-3 FP 1/∑(a)4-7 FP 1/∑(a)8-11 FP 1/∑(a)12-15
Step 1: weights computation (cont) 1.3 Building new A0, A1, A2, A3 • Computing new ‘a’: min(255, ½∑a) • And, finally – logical shift to 4th position SRAI ( (∑a)0 (∑a)1 (∑a)2 (∑a)3 , 1) arithmetic shift 1bit to right: division by 2 ) , MIN ( ½ (∑a)0 ½ (∑a)1 ½ (∑a)2 ½ (∑a)3 255 255 255 255 equivalent as values <= 255 A0 A1 A2 A3 A0 A1 A2 A3 ≡ A0 A1 A2 A3 This is the basis of resulting quad of pixels
Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions
Step 2: components computation2.1 Computation 4 ‘b’-s Building the partial sums (a0b0+a1b1), (a2b2+a3b3) … • Building 8*16bit ‘b’-s by 2 shuffles and logical ‘or’ • Part sum by MADD with 8*16bit ‘a’-s Even line 8 8bit ‘b’-s r g b a b0 b1 b2 b3 b4 b5 b6 b7 equivalent Odd line b0 b1 b2 b3 b4 b5 b6 b7 MADD a0 a1 a2 a3 a4 a5 a6 a7 8 16bit ‘a’-s from previous step a0b0+a1b1 a2b2+a3b3 a4b4+a5b5 a6b6+a7b7 ≡ ∑(ab)0,1 ∑(ab)2,3 ∑(ab)4,5 ∑(ab)6,7 short notation
Step 2: components computation2.2 Building the sums (a0b0+a1b1+a2b2+a3b3), … and final results in FP form • Perform the same computation for second pair of pixel quads, obtaining • Building final NON-normalized interpolation sums using HADD • Converting the result to Float Point (FP) and normalizing by multiplication with ‘a’-sum reciprocals from Step 1 ∑(ab)8,9 ∑(ab)10,11 ∑(ab)12,13 ∑(ab)14,15 HADD ∑(ab)0,1 ∑(ab)2,3 ∑(ab)4,5 ∑(ab)6,7 ∑(ab)8,9 ∑(ab)10,11 ∑(ab)12,13 ∑(ab)14,15 ∑(ab)0-3 ∑(ab)4-7 ∑(ab)8-11 ∑(ab)12-15 cvtepi32_ps FP ∑(ab)0-3 FP ∑(ab)4-7 FP ∑(ab)8-11 FP ∑(ab)12-15 mul_ps FP 1/∑(a)0-3 FP 1/∑(a)4-7 FP 1/∑(a)8-11 FP 1/∑(a)12-15 B0 B1 B2 B3 Here we have 4 final ‘b’ values in FP form
Step 2: components computation2.3 Building new B0, B1, B2, B3 • Conversion new ‘B’-s to integer form B0 B1 B2 B3 cvtps_epi32 equivalent as values <= 255 B0 B1 B2 B3 B0 B1 B2 B3 ≡ • Logical shift to 3rd position and logical sum with quad of ‘A’-s from previous step B0 B1 B2 B3 OR A0 A1 A2 A3 B0 A0 B1 A1 B2 A2 B3 A3 Future resulting quad of pixels – A and B are ready
Step 2: components computation 2.4-2.9 Building new quads of G and R and summing final results • Perform sub-steps of 2.1-2.3 for ‘G’-s and ‘R’-s, when the ‘G’-s quad is shifted to 2nd position before logical sum, and ‘R’-s quad is not shifted. G0 G1 G2 G3 G0 G1 G2 G3 OR R0 R1 R2 R3 OR B0 A0 B1 A1 B2 A2 B3 A3 R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3 This final quad of pixels is stored in resulting image
Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions
Benchmarking (1 thread) • Merom core - WC, 2.66GHz • Penryn core – HPTN, 2.88GHz VTune CPI = 0.78 VTune CPI = 0.46 Speed-up on Penryn (7.0x) is 1.5 better than on Merom (4.6x) It is close to theoretical limit for 8-16bit-vector operations ! Overall speed-up Penryn(Vector)/Merom(Ser) = 8.1x