CS 3214 Computer Systems

CS 3214Computer Systems Godmar Back Lecture 6

Announcements • Exercise 3 due today • Not on Scholar, use submit.pl or submission script • Stay tuned for exercise 4 • Project 1 due Wed, Feb 10 • Please read instructions first • Must be done on McB 124 machines or on rlogin cluster • Auto-fail rule 1: • Need at least phase_4 defused to pass class. CS 3214 Spring 2010

zip_dig pgh[4]; 1 1 1 1 5 5 5 5 2 2 2 2 0 2 1 1 3 1 6 7 76 96 116 136 156 Nested Array Example #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; • Declaration “zip_digpgh[4]” equivalent to “intpgh[4][5]” • Variable pgh denotes array of 4 elements • Allocated contiguously • Each element is an array of 5 int’s • Allocated contiguously • “Row-Major” ordering of all elements guaranteed CS 3214 Spring 2010

0 9 1 2 4 5 2 1 7 1 3 2 3 9 0 a 16 56 36 60 20 40 64 24 44 48 28 68 52 32 72 36 56 76 univ 160 36 b 164 16 c 168 56 Multi-Level Array Example zip_dig a = { 1, 5, 2, 1, 3 }; zip_dig b = { 0, 2, 1, 3, 9 }; zip_dig c = { 9, 4, 7, 2, 0 }; • Variable univ denotes array of 3 elements • Each element is a pointer • 4 bytes • Each pointer points to array of int’s #define UCOUNT 3 int *univ[UCOUNT] = {a, b, c}; CS 3214 Spring 2010

i a p 20 0 4 16 Structures • Concept • Contiguously-allocated region of memory • Refer to members within structure by names • Members may be of different types • Accessing Structure Member struct rec { int i; int a[3]; int *p; }; Memory Layout Assembly void set_i(struct rec *r, int val) { r->i = val; } # %eax = val # %edx = r movl %eax,(%edx) # Mem[r] = val CS 3214 Spring 2010

Generating Pointer to Struct. Member r struct rec { int i; int a[3]; int *p; }; i a p • Generating Pointer to Array Element • Offset of each structure member determined at compile time 0 4 16 r + 4 + 4*idx int * find_a (struct rec *r, int idx) { return &r->a[idx]; } # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4 CS 3214 Spring 2010

c i[0] i[1] c i[0] i[1] v v up+0 up+4 up+8 sp+0 sp+4 sp+8 sp+16 sp+24 Union Allocation • Principles • Overlay union elements • Allocate according to largest element • Can only use one field at a time union U1 { char c; int i[2]; double v; } *up; struct S1 { char c; int i[2]; double v; } *sp; (Windows alignment) CS 3214 Spring 2010

The following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html Part 4 Programs and Data CS 3214 Spring 2010

Today • x86_64 • Advanced compiler use • Extended/inline asm • Vectorization • SIMD Intrinsics • Floating point • Buffer Overflows (Part 1) CS 3214 Spring 2010

x86_64 • 64-bit extension of IA32 • aka EM64T (Intel) • Please read x86_64 supplemental material • http://csapp.cs.cmu.edu/public/docs/asm64-handout.pdf • Don’t confuse with IA64 “Itanium” CS 3214 Spring 2010

x86_64 Highlights • Extends 8 general purpose registers to 64bit lengths • And add 8 more 64bit registers • C Binding: sizeof(int) still 4!; sizeof(anything *), sizeof(long), sizeof(long int) now 8. • NB: sizeof(long long) is 8 both on IA32 and x86_64 • Passing arguments in registers by default CS 3214 Spring 2010

x86_64 See http://www.x86-64.org/documentation.html CS 3214 Spring 2010

Inlined Assembly • asm(“…” : <output> : <input> : <clobber>) • Means to inject assembly into code and link with remained in a controlled manner • Compiler doesn’t “know” what instructions do – thus must describe • a) state compiler must create upon enter: which values must be in which registers, etc. • b) state produced by inline instructions: which registers contain which values, etc. – also: any registers that may be clobbered CS 3214 Spring 2010

Inlined Assembly Example bool imul32x32_64(uint32_t leftop, uint32_t rightop, uint64_t *presult) { uint64_t result; bool overflow; asm("imull %2" "\n\t" "seto %%bl" "\n\t" : "=A" (result), "=b" (overflow) // output constraint : "r" (leftop), "a" (rightop) // input constraint ); *presult = result; return overflow; } Goal: exploit imull’s property to compute 32x32 bit product: imull %ecx means (%edx, %eax) := %ecx * %eax Magic instructions: “r”(leftop) – pick any 32bit register and put leftop in it “a” (rightop) – make sure %eax contains rightop “%2” substitute whichever register picked for ‘leftop’ “=A” result is in (%edx, %eax) “=b” result is in %ebx CS 3214 Spring 2010

imul32x32_64: pushl %ebp movl %esp, %ebp subl $12, %esp movl %ebx, (%esp) movl %esi, 4(%esp) movl %edi, 8(%esp) movl 8(%ebp), %ecx movl 12(%ebp), %eax #APP imull %ecx seto %bl #NO_APP movl %eax, %esi movl 16(%ebp), %eax movl %esi, (%eax) movl %edx, 4(%eax) movzbl %bl, %eax movl (%esp), %ebx movl 4(%esp), %esi movl 8(%esp), %edi movl %ebp, %esp popl %ebp ret Inlined Assembly (2) bool imul32x32_64(uint32_t leftop, uint32_t rightop, uint64_t *presult) { uint64_t result; bool overflow; asm("imull %2" "\n\t" "seto %%bl" "\n\t" : "=A" (result), "=b" (overflow) // output constraint : "r" (leftop), "a" (rightop) // input constraint ); *presult = result; return overflow; } CS 3214 Spring 2010

Floating Point on IA32 • History: • First implemented in 8087 coprocessor • “stack based” – FPU has 8 registers that form a stack %st(0), %st(1), … • Known as ‘x87’ floating point • Weirdness: internal accuracy 80bit (rather than IEEE745 64bit) – thus storing involves rounding • Results depends on how often values are moved out of the FPU registers into memory (which depends on compiler’s code generation strategy/optimization level) – not good! CS 3214 Spring 2010

Floating Point Code Example • Compute Inner Product of Two Vectors • Single precision arithmetic • Common computation pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge .L3 .L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl .L5 .L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } CS 3214 Spring 2010

Floating Point: SSE(*) • Various extensions to x87 were introduced: • SSE, SSE2, SSE3, SSE4, SSE5 • Use 16 128bit %xmm registers • Can be used as 16x8bit, 4x32bit, 2x64bit, etc. for both integer and floating point operations • Use –fpmath=sse –msseswitch to enable (or –msse2, -msse3, -msse4) • All doubles are 64bits internally - gives reproducible results independent of load/stores • Aside: if 80bit is ok, can combine –fpmath=sse,x87 for 24 registers CS 3214 Spring 2010

Floating Point SSE • Same code compiled with:-msse2 -fpmath=sse ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $4, %esp movl 8(%ebp), %ebx movl 12(%ebp), %ecx movl 16(%ebp), %edx xorps %xmm1, %xmm1 testl %edx, %edx jle .L4 movl $0, %eax ; i = 0 xorps %xmm1, %xmm1; result = 0.0 .L5: movss (%ebx,%eax,4), %xmm0 ; t = x[i] mulss (%ecx,%eax,4), %xmm0 ; t *= y[i] addss %xmm0, %xmm1 ; result += t addl $1, %eax ; i = i+1 cmpl %edx, %eax jne .L5 .L4: movss %xmm1, -8(%ebp) flds -8(%ebp) ; %st(0) = result addl $4, %esp popl %ebx popl %ebp ret float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } CS 3214 Spring 2010

Vectorization • SSE* instruction sets can operate on ‘vectors’ • For instance, if 128bit register is treated as (d1, d0) and (e1, e0), can compute (d1+e1, d0+e0) using single instruction – executes in parallel • Also known as “SIMD” • Single instruction, multiple data CS 3214 Spring 2010

Floating Point SSE - Vectorized • Trying to make compiler achieve transformation shown on right float ipf_vector (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i+=4) { p[0] = x[i] * y[i]; p[1] = x[i+1] * y[i+1]; p[2] = x[i+2] * y[i+2]; p[3] = x[i+3] * y[i+3]; result += p[0]+p[1]+p[2]+p[3]; } return result; } float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } Logical transformation, not actual code CS 3214 Spring 2010

Example: GCC Vector Extension magic attribute that tells gcc that v4sf is a type denoting vectors of 4 floats typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; float * v = (float *)&p; // treat vector as float * partialsum = v[0] + v[1] + v[2] + v[3]; result += partialsum; } return result; } CS 3214 Spring 2010

ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $36, %esp movl 16(%ebp), %ebx movl 8(%ebp), %edx movl 12(%ebp), %eax movl $0, %ecx xorps %xmm1, %xmm1 .L5: movaps (%eax), %xmm0 mulps (%edx), %xmm0 movaps %xmm0, -24(%ebp) movss -24(%ebp), %xmm0 addss -20(%ebp), %xmm0 addss -16(%ebp), %xmm0 addss -12(%ebp), %xmm0 addss %xmm0, %xmm1 addl $1, %ecx addl $16, %edx addl $16, %eax cmpl %ebx, %ecx jne .L5 movss %xmm1, -28(%ebp) flds -28(%ebp) addl $36, %esp popl %ebx popl %ebp ret Example: GCC Vector Extensions typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; float * v = (float *)&p; partialsum = v[0] + v[1] + v[2] + v[3]; result += partialsum; } return result; } CS 3214 Spring 2010

Comments • Assembly code on previous slide is slightly simplified (omits first i < n check in case n ==0) • Two problems with it • Problem 1: ‘partialresult’ is allocated on the stack • value is said to be “spilled” to the stack • Problem 2: • Does not use vector unit for computing sum CS 3214 Spring 2010

SSE3: hadd_ps • Treats 128bit as 4 floats (“parallel single”) • Input are 2x128bit (A3, A2, A1, A0) and (B3, B2, B1, B0) • Computes (B3 + B2, B1 + B0, A3 + A2, A1 + A0) – “horizontal” operation “hadd” • Apply twice to compute sum of all 4 elements in lowest element • Use “intrinsics” – look like function calls, but are instructions for the compiler to use certain instructions • Unlike ‘asm’, compiler knows their meaning: no need to specify input, output constraints, or what’s clobbered • Compiler performs register allocation CS 3214 Spring 2010

GCC Vector Extensions + XMM Intrinsics #include <pmmintrin.h> typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; v4sf zero = _mm_setzero_ps(); // intrinsic, produces vector of 4 0.0f for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; _mm_store_ss( &partialsum, _mm_hadd_ps(_mm_hadd_ps(p, zero), zero)); result += partialsum; } return result; } CS 3214 Spring 2010

ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $4, %esp movl 16(%ebp), %ebx movl 8(%ebp), %edx movl 12(%ebp), %eax movl $0, %ecx xorps %xmm2, %xmm2 xorps %xmm1, %xmm1 .L5: movaps (%eax), %xmm0 mulps (%edx), %xmm0 haddps %xmm1, %xmm0 haddps %xmm1, %xmm0 addss %xmm0, %xmm2 addl $1, %ecx addl $16, %edx addl $16, %eax cmpl %ebx, %ecx jne .L5 movss %xmm2, -8(%ebp) flds -8(%ebp) addl $4, %esp popl %ebx popl %ebp ret Example: GCC Vector Extensions + XMM Intrinsics #include <pmmintrin.h> typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; v4sf zero = _mm_setzero_ps(); for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; _mm_store_ss( &partialsum, _mm_hadd_ps(_mm_hadd_ps(p, zero), zero)); result += partialsum; } return result; } CS 3214 Spring 2010

CS 3214 Computer Systems