Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn

Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn Dr. Zvi Danovich, Senior Application Engineer November – December 2007

Agenda • General description of 2x Shrink • Step 1: weights computation • Step 2: components computation • Benchmarks and conclusions

General description • Pixel has 3 components (r,g,b) and 4th, ‘a’ – weight, all are 1byte length • Each pair of pixel lines is interpolated to one line 2x shortened: 2 opposite pixel pairs are combined to 1 pixel in shrunk image • New (interpolated) component C = ∑(ca)0-3 ∕ ∑(a)0-3, where ‘c’ is r, g or b. New weight ‘a’ A = min(255, ½ ∑(a)0-3 ). • Preliminary step: reading (loading) m128i_Ev01, m128i_Ev23, m128i_Od01, m128i_Od23 m128i_Ev01 m128i_Ev23 Sourse: even line r g b a “Shrunk” pixels 0 1 2 3 Sourse: odd line m128i_Od01 m128i_Od23

Step 1: weights computation 1.1 Building the partial sums (a0+a1), (a2+a3) … • Building 8*16bit ‘a’-s by 2 shuffles and logical ‘or’ • Part sum by MADD with 8*16bit ‘1’-s Even line m128i_8a r g b a a0 a1 a2 a3 a4 a5 a6 a7 equivalent Odd line 1 1 1 1 1 1 1 1 MADD a0 a1 a2 a3 a4 a5 a6 a7 a0+a1 a2+a3 a4+a5 a6+a7

Step 1: weights computation (cont)1.2 Building the sums (a0+a1+a2+a3), (a4+a5+a6+a7) … and reciprocals • Perform the same computation for second pair of pixel quads, obtaining • Building final sums using HADD • Converting the result to Float Point (FP) and computation reciprocals Here we have 4 FP ‘a’-sum reciprocals - normalization coefficients a8+a9 a10+a11 a12+a13 a14+a15 HADD a0+a1 a2+a3 a4+a5 a6+a7 a8+a9 a10+a11 a12+a13 a14+a15 0+1+2+3 = ∑(a)0-3 ∑(a)4-7 ∑(a)8-11 ∑(a)12-15 FP 1/∑(a)0-3 FP 1/∑(a)4-7 FP 1/∑(a)8-11 FP 1/∑(a)12-15

Step 1: weights computation (cont) 1.3 Building new A0, A1, A2, A3 • Computing new ‘a’: min(255, ½∑a) • And, finally – logical shift to 4th position SRAI ( (∑a)0 (∑a)1 (∑a)2 (∑a)3 , 1) arithmetic shift 1bit to right: division by 2 ) , MIN ( ½ (∑a)0 ½ (∑a)1 ½ (∑a)2 ½ (∑a)3 255 255 255 255 equivalent as values <= 255 A0 A1 A2 A3 A0 A1 A2 A3 ≡ A0 A1 A2 A3 This is the basis of resulting quad of pixels

Step 2: components computation2.1 Computation 4 ‘b’-s Building the partial sums (a0b0+a1b1), (a2b2+a3b3) … • Building 8*16bit ‘b’-s by 2 shuffles and logical ‘or’ • Part sum by MADD with 8*16bit ‘a’-s Even line 8 8bit ‘b’-s r g b a b0 b1 b2 b3 b4 b5 b6 b7 equivalent Odd line b0 b1 b2 b3 b4 b5 b6 b7 MADD a0 a1 a2 a3 a4 a5 a6 a7 8 16bit ‘a’-s from previous step a0b0+a1b1 a2b2+a3b3 a4b4+a5b5 a6b6+a7b7 ≡ ∑(ab)0,1 ∑(ab)2,3 ∑(ab)4,5 ∑(ab)6,7 short notation

Step 2: components computation2.2 Building the sums (a0b0+a1b1+a2b2+a3b3), … and final results in FP form • Perform the same computation for second pair of pixel quads, obtaining • Building final NON-normalized interpolation sums using HADD • Converting the result to Float Point (FP) and normalizing by multiplication with ‘a’-sum reciprocals from Step 1 ∑(ab)8,9 ∑(ab)10,11 ∑(ab)12,13 ∑(ab)14,15 HADD ∑(ab)0,1 ∑(ab)2,3 ∑(ab)4,5 ∑(ab)6,7 ∑(ab)8,9 ∑(ab)10,11 ∑(ab)12,13 ∑(ab)14,15 ∑(ab)0-3 ∑(ab)4-7 ∑(ab)8-11 ∑(ab)12-15 cvtepi32_ps FP ∑(ab)0-3 FP ∑(ab)4-7 FP ∑(ab)8-11 FP ∑(ab)12-15 mul_ps FP 1/∑(a)0-3 FP 1/∑(a)4-7 FP 1/∑(a)8-11 FP 1/∑(a)12-15 B0 B1 B2 B3 Here we have 4 final ‘b’ values in FP form

Step 2: components computation2.3 Building new B0, B1, B2, B3 • Conversion new ‘B’-s to integer form B0 B1 B2 B3 cvtps_epi32 equivalent as values <= 255 B0 B1 B2 B3 B0 B1 B2 B3 ≡ • Logical shift to 3rd position and logical sum with quad of ‘A’-s from previous step B0 B1 B2 B3 OR A0 A1 A2 A3 B0 A0 B1 A1 B2 A2 B3 A3 Future resulting quad of pixels – A and B are ready

Step 2: components computation 2.4-2.9 Building new quads of G and R and summing final results • Perform sub-steps of 2.1-2.3 for ‘G’-s and ‘R’-s, when the ‘G’-s quad is shifted to 2nd position before logical sum, and ‘R’-s quad is not shifted. G0 G1 G2 G3 G0 G1 G2 G3 OR R0 R1 R2 R3 OR B0 A0 B1 A1 B2 A2 B3 A3 R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3 This final quad of pixels is stored in resulting image

Benchmarking (1 thread) • Merom core - WC, 2.66GHz • Penryn core – HPTN, 2.88GHz VTune CPI = 0.78 VTune CPI = 0.46 Speed-up on Penryn (7.0x) is 1.5 better than on Merom (4.6x) It is close to theoretical limit for 8-16bit-vector operations ! Overall speed-up Penryn(Vector)/Merom(Ser) = 8.1x

Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn

Image 2x Shrink SSE implementation, benchmarking comparison between Merom and Penryn

Presentation Transcript

Find sin 2x, cos 2x, and tan 2x from the given information: {image} Select the correct answer:

Comparison between Venice and Singapore

Comparison between DSDM and XP

Comparison between DSR and AODV

Comparison between Cotton and Bamboo

Comparison between Ashura and Halloween

16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn

COMPARISON BETWEEN MSF AND RO

Comparison between PMBOK4 and PMBOK5

Intel’s Penryn

Intel’s Penryn

Comparison Between Hungary and Liberia

Comparison Between Gerund and Infinitive

SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms

Comparison Between PHP and ASP.Net

Comparison between ReactJs and AngularJs

Comparison Between Adespresso And Poweradspy

Comparison between CodeIgniter and CakePHP

Comparison between DSR and AODV

Comparison Between Australia and Canada

Comparison between binance and gate.io