1 / 14

Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn

Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn. Dr. Zvi Danovich, Senior Application Engineer November – December 2007. Agenda. General description of 2x Shrink Step 1: weights computation Step 2: components computation Benchmarks and conclusions.

nadda
Download Presentation

Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn Dr. Zvi Danovich, Senior Application Engineer November – December 2007

  2. Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions

  3. General description • Pixel has 3 components (r,g,b) and 4th, ‘a’ – weight, all are 1byte length • Each pair of pixel lines is interpolated to one line 2x shortened: 2 opposite pixel pairs are combined to 1 pixel in shrunk image • New (interpolated) component C = ∑(ca)0-3 ∕ ∑(a)0-3, where ‘c’ is r, g or b. New weight ‘a’ A = min(255, ½ ∑(a)0-3 ). • Preliminary step: reading (loading) m128i_Ev01, m128i_Ev23, m128i_Od01, m128i_Od23 m128i_Ev01 m128i_Ev23 Sourse: even line r g b a “Shrunk” pixels 0 1 2 3 Sourse: odd line m128i_Od01 m128i_Od23

  4. Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions

  5. Step 1: weights computation 1.1 Building the partial sums (a0+a1), (a2+a3) … • Building 8*16bit ‘a’-s by 2 shuffles and logical ‘or’ • Part sum by MADD with 8*16bit ‘1’-s Even line m128i_8a r g b a a0 a1 a2 a3 a4 a5 a6 a7 equivalent Odd line 1 1 1 1 1 1 1 1 MADD a0 a1 a2 a3 a4 a5 a6 a7 a0+a1 a2+a3 a4+a5 a6+a7

  6. Step 1: weights computation (cont)1.2 Building the sums (a0+a1+a2+a3), (a4+a5+a6+a7) … and reciprocals • Perform the same computation for second pair of pixel quads, obtaining • Building final sums using HADD • Converting the result to Float Point (FP) and computation reciprocals Here we have 4 FP ‘a’-sum reciprocals - normalization coefficients a8+a9 a10+a11 a12+a13 a14+a15 HADD a0+a1 a2+a3 a4+a5 a6+a7 a8+a9 a10+a11 a12+a13 a14+a15 0+1+2+3 = ∑(a)0-3 ∑(a)4-7 ∑(a)8-11 ∑(a)12-15 FP 1/∑(a)0-3 FP 1/∑(a)4-7 FP 1/∑(a)8-11 FP 1/∑(a)12-15

  7. Step 1: weights computation (cont) 1.3 Building new A0, A1, A2, A3 • Computing new ‘a’: min(255, ½∑a) • And, finally – logical shift to 4th position SRAI ( (∑a)0 (∑a)1 (∑a)2 (∑a)3 , 1) arithmetic shift 1bit to right: division by 2 ) , MIN ( ½ (∑a)0 ½ (∑a)1 ½ (∑a)2 ½ (∑a)3 255 255 255 255 equivalent as values <= 255 A0 A1 A2 A3 A0 A1 A2 A3 ≡ A0 A1 A2 A3 This is the basis of resulting quad of pixels

  8. Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions

  9. Step 2: components computation2.1 Computation 4 ‘b’-s Building the partial sums (a0b0+a1b1), (a2b2+a3b3) … • Building 8*16bit ‘b’-s by 2 shuffles and logical ‘or’ • Part sum by MADD with 8*16bit ‘a’-s Even line 8 8bit ‘b’-s r g b a b0 b1 b2 b3 b4 b5 b6 b7 equivalent Odd line b0 b1 b2 b3 b4 b5 b6 b7 MADD a0 a1 a2 a3 a4 a5 a6 a7 8 16bit ‘a’-s from previous step a0b0+a1b1 a2b2+a3b3 a4b4+a5b5 a6b6+a7b7 ≡ ∑(ab)0,1 ∑(ab)2,3 ∑(ab)4,5 ∑(ab)6,7 short notation

  10. Step 2: components computation2.2 Building the sums (a0b0+a1b1+a2b2+a3b3), … and final results in FP form • Perform the same computation for second pair of pixel quads, obtaining • Building final NON-normalized interpolation sums using HADD • Converting the result to Float Point (FP) and normalizing by multiplication with ‘a’-sum reciprocals from Step 1 ∑(ab)8,9 ∑(ab)10,11 ∑(ab)12,13 ∑(ab)14,15 HADD ∑(ab)0,1 ∑(ab)2,3 ∑(ab)4,5 ∑(ab)6,7 ∑(ab)8,9 ∑(ab)10,11 ∑(ab)12,13 ∑(ab)14,15 ∑(ab)0-3 ∑(ab)4-7 ∑(ab)8-11 ∑(ab)12-15 cvtepi32_ps FP ∑(ab)0-3 FP ∑(ab)4-7 FP ∑(ab)8-11 FP ∑(ab)12-15 mul_ps FP 1/∑(a)0-3 FP 1/∑(a)4-7 FP 1/∑(a)8-11 FP 1/∑(a)12-15 B0 B1 B2 B3 Here we have 4 final ‘b’ values in FP form

  11. Step 2: components computation2.3 Building new B0, B1, B2, B3 • Conversion new ‘B’-s to integer form B0 B1 B2 B3 cvtps_epi32 equivalent as values <= 255 B0 B1 B2 B3 B0 B1 B2 B3 ≡ • Logical shift to 3rd position and logical sum with quad of ‘A’-s from previous step B0 B1 B2 B3 OR A0 A1 A2 A3 B0 A0 B1 A1 B2 A2 B3 A3 Future resulting quad of pixels – A and B are ready

  12. Step 2: components computation 2.4-2.9 Building new quads of G and R and summing final results • Perform sub-steps of 2.1-2.3 for ‘G’-s and ‘R’-s, when the ‘G’-s quad is shifted to 2nd position before logical sum, and ‘R’-s quad is not shifted. G0 G1 G2 G3 G0 G1 G2 G3 OR R0 R1 R2 R3 OR B0 A0 B1 A1 B2 A2 B3 A3 R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3 This final quad of pixels is stored in resulting image

  13. Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions

  14. Benchmarking (1 thread) • Merom core - WC, 2.66GHz • Penryn core – HPTN, 2.88GHz VTune CPI = 0.78 VTune CPI = 0.46 Speed-up on Penryn (7.0x) is 1.5 better than on Merom (4.6x) It is close to theoretical limit for 8-16bit-vector operations ! Overall speed-up Penryn(Vector)/Merom(Ser) = 8.1x

More Related