CS 3214 Introduction to Computer Systems

CS 3214Introduction to Computer Systems Godmar Back Lecture 5

Distinguished Lecture Series Frances Allen, Turing Award Winner 2006 Friday 11:15am Haymarket/Squires CS 3214 Fall 2009

CSRC Fall 2009 Career Fair • http://www.cs.vt.edu/partnering/F09Students • Monday 4-7pm CS 3214 Fall 2009

Announcements • No class on Thursday, instead: • Go to Distinguished Lecture on Friday • Relevant topic: Compilers and High-Performance Computing • Read x86_64 supplemental (and Chapter 3, of course) • Exercise 4 due Thursday, Sep 10 • Project 1 due Wed, Sep 16 • Please read instructions first • Must be done on McB 124 machines or on rlogin cluster • Need at least phase_4 defused to pass class. • Don’t delay • If your keycard does not work in 124, contact helpdesk@cs.vt.edu CS 3214 Fall 2009

The following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html Part 4 Programs and Data CS 3214 Fall 2009

Addendum • For discussion of how IA32 code accesses structs, how structs and unions are laid out and aligned, see last part of Lecture 4 slides CS 3214 Fall 2009

Today • Buffer Overflows (Part 1) • x86_64 • Floating point • Advanced compiler use • Extended/inline asm • Vectorization • SIMD Intrinsics CS 3214 Fall 2009

Buffer Overflows • What is a buffer overflow? • How can it be exploited? • How can it be avoided? • Through programmer measures • Through system measures (and how effective are they?) CS 3214 Fall 2009

String Library Code • Implementation of Unix function gets • No way to specify limit on number of characters to read • Similar problems with other Unix functions • strcpy: Copies string of arbitrary length • scanf, fscanf, sscanf, when given %s conversion specification /* Get string from stdin */ char *gets(char *dest){ int c = getc(); char *p = dest; while (c != EOF && c != '\n') { *p++ = c; c = getc(); } *p = '\0'; return dest; }

Vulnerable Buffer Code /* Echo Line */void echo(){ char buf[4]; /* Way too small! */ gets(buf); puts(buf);} int main(){ printf("Type a string:"); echo(); return 0;}

Buffer Overflow Executions unix>./bufdemo Type a string:123 123 unix>./bufdemo Type a string:12345 Segmentation Fault unix>./bufdemo Type a string:12345678 Segmentation Fault

Stack Frame for main Return Address Saved %ebp %ebp [3] [2] [1] [0] buf Stack Frame for echo Buffer Overflow Stack /* Echo Line */void echo(){ char buf[4]; /* Way too small! */ gets(buf); puts(buf);} echo: pushl %ebp # Save %ebp on stack movl %esp,%ebp subl $20,%esp # Allocate space on stack pushl %ebx # Save %ebx addl $-12,%esp # Allocate space on stackleal -4(%ebp),%ebx # Compute buf as %ebp-4 pushl %ebx # Push buf on stack call gets # Call gets . . .

Stack Frame for main Stack Frame for main Return Address Return Address Saved %ebp %ebp Saved %ebp 0xbffff8d8 [3] [2] [1] [0] buf [3] [2] [1] [0] buf Stack Frame for echo bf Stack Frame for echo ff f8 f8 08 04 86 4d xx xx xx xx unix> gdb bufdemo (gdb) break echo Breakpoint 1 at 0x8048583 (gdb) run Breakpoint 1, 0x8048583 in echo () (gdb) print /x *(unsigned *)$ebp $1 = 0xbffff8f8 (gdb) print /x *((unsigned *)$ebp + 1) $3 = 0x804864d Buffer Overflow Stack Example Before call to gets 8048648: call 804857c <echo> 804864d: mov 0xffffffe8(%ebp),%ebx # Return Point

Stack Frame for main Stack Frame for main Return Address Return Address Saved %ebp 0xbffff8d8 Saved %ebp %ebp [3] [2] [1] [0] buf [3] [2] [1] [0] buf Stack Frame for echo Stack Frame for echo bf ff f8 f8 08 04 86 4d 00 33 32 31 Buffer Overflow Example #1 Before Call to gets Input = “123” No Problem

Stack Frame for main Return Address Stack Frame for main Saved %ebp 0xbffff8d8 [3] [2] [1] [0] buf Stack Frame for echo Return Address Saved %ebp %ebp [3] [2] [1] [0] buf Stack Frame for echo bf ff 00 35 08 04 86 4d 34 33 32 31 Buffer Overflow Stack Example #2 Input = “12345” Saved value of %ebp set to 0xbfff0035 Bad news when later attempt to restore %ebp echo code: 8048592: push %ebx 8048593: call 80483e4 <_init+0x50> # gets 8048598: mov 0xffffffe8(%ebp),%ebx 804859b: mov %ebp,%esp 804859d: pop %ebp # %ebp gets set to invalid value 804859e: ret

Stack Frame for main Return Address Stack Frame for main Saved %ebp 0xbffff8d8 [3] [2] [1] [0] buf Stack Frame for echo Return Address Saved %ebp %ebp [3] [2] [1] [0] buf Invalid address Stack Frame for echo 38 37 36 35 No longer pointing to desired return point 08 04 86 00 34 33 32 31 Buffer Overflow Stack Example #3 Input = “12345678” %ebp and return address corrupted 8048648: call 804857c <echo> 804864d: mov 0xffffffe8(%ebp),%ebx # Return Point

Malicious Use of Buffer Overflow Stack after call to gets() • Input string contains byte representation of executable code • Overwrite return address with address of buffer • When bar() executes ret, will jump to exploit code void foo(){ bar(); ... } foo stack frame return address A B data written by gets() pad void bar() { char buf[64]; gets(buf); ... } exploit code bar stack frame B

Avoiding Overflow Vulnerability /* Echo Line */void echo(){ char buf[4]; /* Way too small! */ fgets(buf, 4, stdin); puts(buf);} • Use Library Routines that check/limit string lengths • fgets instead of gets • strncpy/strlcpyinstead of strcpy • snprintfinstead of sprintf • Don’t use scanf with %s conversion specification • Use fgetsto read the string

Inlined Assembly • asm(“…” : <output> : <input> : <clobber>) • Means to inject assembly into code and link with remained in a controlled manner • Compiler doesn’t “know” what instructions do – thus must describe • a) state compiler must create upon enter: which values must be in which registers, etc. • b) state produced by inline instructions: which registers contain which values, etc. – also: any registers that may be clobbered CS 3214 Fall 2009

Inlined Assembly Example bool imul32x32_64(uint32_t leftop, uint32_t rightop, uint64_t *presult) { uint64_t result; bool overflow; asm("imull %2" "\n\t" "seto %%bl" "\n\t" : "=A" (result), "=b" (overflow) // output constraint : "r" (leftop), "a" (rightop) // input constraint ); *presult = result; return overflow; } Goal: exploit imull’s property to compute 32x32 bit product: imull %ecx means (%edx, %eax) := %ecx * %eax Magic instructions: “r”(leftop) – pick any 32bit register and put leftop in it “a” (rightop) – make sure %eax contains rightop “%2” substitute whichever register picked for ‘leftop’ “=A” result is in (%edx, %eax) “=b” result is in %ebx CS 3214 Fall 2009

imul32x32_64: pushl %ebp movl %esp, %ebp subl $12, %esp movl %ebx, (%esp) movl %esi, 4(%esp) movl %edi, 8(%esp) movl 8(%ebp), %ecx movl 12(%ebp), %eax #APP imull %ecx seto %bl #NO_APP movl %eax, %esi movl 16(%ebp), %eax movl %esi, (%eax) movl %edx, 4(%eax) movzbl %bl, %eax movl (%esp), %ebx movl 4(%esp), %esi movl 8(%esp), %edi movl %ebp, %esp popl %ebp ret Inlined Assembly (2) bool imul32x32_64(uint32_t leftop, uint32_t rightop, uint64_t *presult) { uint64_t result; bool overflow; asm("imull %2" "\n\t" "seto %%bl" "\n\t" : "=A" (result), "=b" (overflow) // output constraint : "r" (leftop), "a" (rightop) // input constraint ); *presult = result; return overflow; } CS 3214 Fall 2009

x86_64 • 64-bit extension of IA32 • aka EM64T (Intel) • Please read x86_64 supplemental material • http://csapp.cs.cmu.edu/public/docs/asm64-handout.pdf • Don’t confuse with IA64 “Itanium” CS 3214 Fall 2009

x86_64 Highlights • Extends 8 general purpose registers to 64bit lengths • And add 8 more 64bit registers • C Binding: sizeof(int) still 4!; sizeof(anything *), sizeof(long), sizeof(long int) now 8. • NB: sizeof(long long) is 8 both on IA32 and x86_64 • Passing arguments in registers by default CS 3214 Fall 2009

x86_64 See http://www.x86-64.org/documentation.html CS 3214 Fall 2009

Floating Point on IA32 • History: • First implemented in 8087 coprocessor • “stack based” – FPU has 8 registers that form a stack %st(0), %st(1), … • Known as ‘x87’ floating point • Weirdness: internal accuracy 80bit (rather than IEEE745 64bit) – thus storing involves rounding • Results depends on how often values are moved out of the FPU registers into memory (which depends on compiler’s code generation strategy/optimization level) – not good! CS 3214 Fall 2009

Floating Point Code Example • Compute Inner Product of Two Vectors • Single precision arithmetic • Common computation pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge .L3 .L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl .L5 .L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } CS 3214 Fall 2009

Floating Point: SSE(*) • Various extensions to x87 were introduced: • SSE, SSE2, SSE3, SSE4, SSE5 • Use 16 128bit %xmm registers • Can be used as 16x8bit, 4x32bit, 2x64bit, etc. for both integer and floating point operations • Use –fpmath=sse –msseswitch to enable (or –msse2, -msse3, -msse4) • All doubles are 64bits internally - gives reproducible results independent of load/stores • Aside: if 80bit is ok, can combine –fpmath=sse,x87 for 24 registers CS 3214 Fall 2009

Floating Point SSE • Same code compiled with:-msse2 -fpmath=sse ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $4, %esp movl 8(%ebp), %ebx movl 12(%ebp), %ecx movl 16(%ebp), %edx xorps %xmm1, %xmm1 testl %edx, %edx jle .L4 movl $0, %eax ; i = 0 xorps %xmm1, %xmm1; result = 0.0 .L5: movss (%ebx,%eax,4), %xmm0 ; t = x[i] mulss (%ecx,%eax,4), %xmm0 ; t *= y[i] addss %xmm0, %xmm1 ; result += t addl $1, %eax ; i = i+1 cmpl %edx, %eax jne .L5 .L4: movss %xmm1, -8(%ebp) flds -8(%ebp) ; %st(0) = result addl $4, %esp popl %ebx popl %ebp ret float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } CS 3214 Fall 2009

Vectorization • SSE* instruction sets can operate on ‘vectors’ • For instance, if 128bit register is treated as (d1, d0) and (e1, e0), can compute (d1+e1, d0+e0) using single instruction – executes in parallel • Also known as “SIMD” • Single instruction, multiple data CS 3214 Fall 2009

Floating Point SSE - Vectorized • Trying to make compiler achieve transformation shown on right float ipf_vector (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i+=4) { p[0] = x[i] * y[i]; p[1] = x[i+1] * y[i+1]; p[2] = x[i+2] * y[i+2]; p[3] = x[i+3] * y[i+3]; result += p[0]+p[1]+p[2]+p[3]; } return result; } float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } Logical transformation, not actual code CS 3214 Fall 2009

Example: GCC Vector Extension magic attribute that tells gcc that v4sf is a type denoting vectors of 4 floats typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; float * v = (float *)&p; // treat vector as float * partialsum = v[0] + v[1] + v[2] + v[3]; result += partialsum; } return result; } CS 3214 Fall 2009

ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $36, %esp movl 16(%ebp), %ebx movl 8(%ebp), %edx movl 12(%ebp), %eax movl $0, %ecx xorps %xmm1, %xmm1 .L5: movaps (%eax), %xmm0 mulps (%edx), %xmm0 movaps %xmm0, -24(%ebp) movss -24(%ebp), %xmm0 addss -20(%ebp), %xmm0 addss -16(%ebp), %xmm0 addss -12(%ebp), %xmm0 addss %xmm0, %xmm1 addl $1, %ecx addl $16, %edx addl $16, %eax cmpl %ebx, %ecx jne .L5 movss %xmm1, -28(%ebp) flds -28(%ebp) addl $36, %esp popl %ebx popl %ebp ret Example: GCC Vector Extensions typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; float * v = (float *)&p; partialsum = v[0] + v[1] + v[2] + v[3]; result += partialsum; } return result; } CS 3214 Fall 2009

Comments • Assembly code on previous slide is slightly simplified (omits first i < n check in case n ==0) • Two problems with it • Problem 1: ‘partialresult’ is allocated on the stack • value is said to be “spilled” to the stack • Problem 2: • Does not use vector unit for computing sum CS 3214 Fall 2009

SSE3: hadd_ps • Treats 128bit as 4 floats (“parallel single”) • Input are 2x128bit (A3, A2, A1, A0) and (B3, B2, B1, B0) • Computes (B3 + B2, B1 + B0, A3 + A2, A1 + A0) – “horizontal” operation “hadd” • Apply twice to compute sum of all 4 elements in lowest element • Use “intrinsics” – look like function calls, but are instructions for the compiler to use certain instructions • Unlike ‘asm’, compiler knows their meaning: no need to specify input, output constraints, or what’s clobbered • Compiler performs register allocation CS 3214 Fall 2009

GCC Vector Extensions + XMM Intrinsics #include <pmmintrin.h> typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; v4sf zero = _mm_setzero_ps(); // intrinsic, produces vector of 4 0.0f for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; _mm_store_ss( &partialsum, _mm_hadd_ps(_mm_hadd_ps(p, zero), zero)); result += partialsum; } return result; } CS 3214 Fall 2009

ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $4, %esp movl 16(%ebp), %ebx movl 8(%ebp), %edx movl 12(%ebp), %eax movl $0, %ecx xorps %xmm2, %xmm2 xorps %xmm1, %xmm1 .L5: movaps (%eax), %xmm0 mulps (%edx), %xmm0 haddps %xmm1, %xmm0 haddps %xmm1, %xmm0 addss %xmm0, %xmm2 addl $1, %ecx addl $16, %edx addl $16, %eax cmpl %ebx, %ecx jne .L5 movss %xmm2, -8(%ebp) flds -8(%ebp) addl $4, %esp popl %ebx popl %ebp ret Example: GCC Vector Extensions + XMM Intrinsics #include <pmmintrin.h> typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; v4sf zero = _mm_setzero_ps(); for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; _mm_store_ss( &partialsum, _mm_hadd_ps(_mm_hadd_ps(p, zero), zero)); result += partialsum; } return result; } CS 3214 Fall 2009

CS 3214 Introduction to Computer Systems