170 likes | 366 Views
Lecture 18. SSE Instructions and MP4 Ryan Chmiel. Lecture outline. SSE Instruction Overview Code examples using SSE instructions MP4. Streaming SIMD Extensions. Streaming SIMD defines a new architecture for floating point operations Introduced in Pentium III in March 1999
E N D
Lecture 18 SSE Instructions and MP4 Ryan Chmiel
Lecture outline • SSE Instruction Overview • Code examples using SSE instructions • MP4
Streaming SIMD Extensions • Streaming SIMD defines a new architecture for floating point operations • Introduced in Pentium III in March 1999 • Pentium III includes floating point, MMX technology, and XMM registers • Use eight new 128-bit wide general-purpose registers (XMM0 - XMM7) • Operate on IEEE-754 single-precision 32-bit real numbers • Support packed and scalar operations on the new packed single precision floating point data types
Streaming SIMD Extensions • Packedinstructions operate vertically on all four pairs of floating point data elements in parallel • instructions have suffix ps for single-precision, e.g., addps • Scalar instructions operate on the least-significant data elements of the two operands • instructions have suffix ss for single-precision, e.g., addss
Categories of SSE Instructions • Data Movement • movss, movups, movaps, etc. • Arithmetic Instructions • addps, mulss, divps, sqrtss, etc. • Logical Instructions (only operate on packed data) • xorps, orps, andnps, etc. • Compare Instructions • cmpss, cmpps, etc. • Integer-Real Conversion Instructions • cvtps2pi, cvtsi2ss, etc. • Shuffle/Rearrange Instructions • shufps, unpcklps, etc.
SSE Instruction Examples XMM1 4.0 3.0 2.0 1.0 XMM2 5.0 6.0 7.0 8.0 • movups xmm3, xmm1 4.0 3.0 2.0 1.0 • movss xmm1, xmm2 4.0 3.0 2.0 8.0 • movss xmm3, [RealOne] 0.0 0.0 0.0 1.0 • addps xmm1, xmm2 9.0 9.0 9.0 9.0 • subss xmm2, xmm1 5.0 6.0 7.0 7.0 • xorps xmm2, xmm2 0.0 0.0 0.0 0.0
SHUFPS • This instruction allows you to take two floating point numbers from each operand to create a new value with the four numbers • The destination operand (xmmreg1) contributes to the lower two places in the result, and the source operand contributes to the upper two • Usage: • shufps xmmreg1, xmmreg2/mem128, imm8 • The first operand must be an xmm register • The second operand can either be an xmm register or a memory location • The third operand is a bit mask of four two-bit numbers that specifies which values from each operand you’ll be choosing: 127 0 XMMREG 11 10 01 00 • That makes no sense - show me some examples!
SHUFPS Examples XMM1 1.0 2.0 3.0 4.0 XMM2 5.0 6.0 7.0 8.0 • shufps xmm1, xmm2, 11001100b 5.0 8.0 1.0 4.0 • shufps xmm1, xmm2, 01111001b 7.0 5.0 2.0 3.0 • shufps xmm2, xmm1, 01111001b 3.0 1.0 6.0 7.0 • shufps xmm1, xmm1, 10001100b 2.0 4.0 1.0 4.0 • shufps xmm2, xmm2, 10010011b 6.0 7.0 8.0 5.0 • shufps xmm1, xmm1, 00111001b 4.0 1.0 2.0 3.0
SSE Instruction Reference • Available as part of the Intel x86 Instruction Set Reference found on the Resources page of ECE 291 website • Each instruction has a diagram that shows how the instruction manipulates the data and stores the result • Unfortunately the reference is in alphabetical order, not in order of instruction type • As previously mentioned, look for –ps and –ss suffixes to determine which instructions are SSE instructions • Instructions suffixed with –pd and –sd operate on double-precision values and are included in the SSE2 instruction set. These instructions were introduced on the Pentium IV chip.
SSE Instruction Caveats • You cannot push and pop xmm registers like you can do to general purpose registers. This means if you’re using xmm0 in a function, and you call another function that also uses xmm0, the value will be overwritten. Watch out for this! • For some of the SSE instructions, such as the arithmetic and logical instructions, if you specify a memory location as the second operand, that memory location must be on a 16-byte boundary (it’s address must end with 0000h). If it doesn’t, you will get a general protection fault at runtime. To avoid this, always move the value at this memory location into an xmm register and use that xmm register as the second operand. • To avoid more GPF’s, use movups (move unaligned packed single-precision) instead of movaps (move aligned packed single-precision). movaps checks for a 16-byte boundary as mentioned above and will crash your program if the address does not lie on one.
SSE Coding Example 1 Variable1 dd 4.5, 32.0, -16.123, 291.0 Variable2 dd 0.0 … movups xmm0, [Variable1] xorps xmm1, xmm1 mov ecx, 4 .Loop addss xmm1, xmm0 shufps xmm0, xmm0, 00111001b loop .Loop movss [Variable2], xmm1 • What does this do?
SSE Coding Example 1 • It sums the four numbers stored in [Variable1] and stores the result to [Variable2] • Here’s the main part of the code again - is this the most efficient way to perform this operation? .Loop addss xmm1, xmm0 shufps xmm0, xmm0, 00111001b loop .Loop movss [Variable2], xmm1 • A. Yes • B. No • C. I don’t know • D. I don’t care
SSE Coding Example 1 • At least you’re honest… but B is the correct answer • You have four numbers to add, and the addps instruction can add four pairs of floating point numbers at once • Solution: • Line up the four values making two pairs and add both pairs in parallel • Line up the two results into one pair add that pair movups xmm0, [Variable1] movups xmm1, xmm0 shufps xmm1, xmm1, 00001110b ; upper two values do not matter addps xmm1, xmm0 movups xmm0, xmm1 shufps xmm0, xmm0, 00000001b addss xmm0, xmm1 movss [Variable2], xmm0
SSE Coding Example 1 • What is wrong with the first approach? • It does not take advantage of parallelism - this code can be written using the regular FPU instructions • What is the benefit of the second approach? • It saves two add and two shuffle instructions each time the code is run. It does, however, add a move instruction, but this addition is far outweighed by the removal of the other four instructions. • It does not contain any loops or jumps • This will cut down on total program running time • Moral of the story: exploit parallelism whenever you can!
SSE Coding Example 2 movups xmm0, [Vector] movups xmm1, xmm0 mulps xmm1, xmm1 movups xmm2, xmm1 shufps xmm2, xmm2, 00111001b addss xmm1, xmm2 shufps xmm2, xmm2, 00111001b addss xmm1, xmm2 sqrtss xmm1, xmm1 unpcklps xmm1, xmm1 unpcklps xmm1, xmm1 divps xmm0, xmm1 movups [Vector], xmm0 • So now what does this do?
SSE Coding Example 2 movups xmm0, [Vector] ; 0.0 Vz Vy Vx movups xmm1, xmm0 mulps xmm0, xmm0 ; 0.0 Vz*Vz Vy*Vy Vx*Vx movups xmm2, xmm0 shufps xmm2, xmm2, 00111001b ; xxxxxxx 0.0 Vz*Vz Vy*Vy addss xmm0, xmm2 ; xxxxxxx xxxxxxx xxxxxxx Vx*Vx+Vy*Vy shufps xmm2, xmm2, 00111001b ; xxxxxxx xxxxxxx 0.0 Vz*Vz addss xmm0, xmm2 ; xxxxxxx xxxxxxx xxxxxxx Vx*Vx+Vy*Vy+Vz*Vz sqrtss xmm0, xmm0 ; xxxxxxx xxxxxxx xxxxxxx sqrt unpcklps xmm0, xmm0 ; xxxxxxx xxxxxxx sqrt sqrt unpcklps xmm0, xmm0 ; sqrt sqrt sqrt sqrt divps xmm1, xmm0 ; 0.0 Vz/sqrt Vy/sqrt Vx/sqrt movups [Vector], xmm1 • It normalizes a vector and overwrites the vector with its normalization
MP4 • For some reason it is taking many people around five minutes to make their programs • A few can’t get it to work at all - make times out and just sits there • When I do the same thing it takes 15-20 seconds • This doesn’t make any sense! • We’re looking into the problem and hope to have it fixed ASAP • Now, onto the writeup