Lecture 3 –SequentialPerformance

Lecture 3 –SequentialPerformance CSCE 569 Parallel Processing • Topics • Measuring Process time • timeval • timespec • Improving performance • Readings January 22, 2014

Overview • Last Time • Overview • Shared Memory Model • Distributed Memory • UMA / NUMA • OpenMP • MPI • POSIX threads • CUDA • Reduction sum (tree of adds) • New • Slides 35-42 of Lecture 01 • Process times • timeval • timespec • getrusage • Slides 28-40 of Lecture 02 • matrix multiply

Slides 35-42 of Lecture 01 • Lecture 01 • time • timevalstructure • gettimeofday • Time Command • Time examples • Time with threads • man getrusage • structrusage

/class/csce569-001 • Linux shared directory • cd /class/csce569-001 • ls • Assignments Code PachecoTexttlpi-dist web

matmul.c - get seconds used • structtimeval { • __kernel_time_ttv_sec; /* seconds */ • __kernel_suseconds_ttv_usec; /* microseconds */ • };

double seconds(intnmode){ • structrusagebuf; • double temp; • getrusage( nmode, &buf); • /* Get system time and user time in micro-seconds.*/ • temp = (double)buf.ru_utime.tv_sec*1.0e6 + • (double)buf.ru_utime.tv_usec + • (double)buf.ru_stime.tv_sec*1.0e6 + • (double)buf.ru_stime.tv_usec; • /* Return the sum of system and user time in SECONDS.*/ • return( temp*1.0e-6 ); • }

Matrix Multiplication • Conisider • A - m x k matrix (m rows k columns) • B - k x n matrix (k rows n columns) • C = A*B • Cij = dotproduct of ith row of A and the jth column of B • Cij=

Matrix Multiplication • tstart= seconds(RUSAGE_SELF); • for(i=0;i<m;++i){ • for(j=0;j<n;++j){ • C[i][j] = 0.0; • for(x=0;x<k;++x){ • C[i][j] = C[i][j] + A[i][x] * B[x][j]; • } • } • } • tend = seconds(RUSAGE_SELF);

allocmatrix • double **allocmatrix(intnrows, intncols) { • double **m; • inti; • m=(double **) malloc((unsigned)(nrows)*sizeof(double*)); • if (!m) nerror("allocation failure 1 in matrix()"); • for(i=0;i<nrows;i++) { • m[i]=(double *) malloc((unsigned)(ncols) * sizeof(double)); • if (!m[i]) nerror("allocation failure 2 in matrix()"); • } • return m; • }

More Notes on Matrix Multiplication • drand48, random for generating elements of arrays randomly • nerror (error_text) – • fprintf(stderr, "Run-time error...\n%s\n ",error_text); • freematrix • m = atoi(argv[1]); • aoti = alpha to integer • atoi( argv[1] ) equivto strtol( argv[1] , (char **) NULL, 10);

struct timespec – finer process times • * /usr/include/time.h contains the timespec definition • * structtimespec{ • * __time_ttv_sec; • * long inttv_nsec; • * }; • */ • intclock_getres(clockid_tclk_id, structtimespec *res); • intclock_gettime(clockid_tclk_id, structtimespec *tp); • intclock_settime(clockid_tclk_id, conststructtimespec *tp);

time0.c • structtimespec start; • structtimespec finish; • intretval; • retval = clock_getres(CLOCK_MONOTONIC, &clk_resolution); • retval = clock_gettime(CLOCK_MONOTONIC, &start); • … do something … • retval = clock_gettime(CLOCK_MONOTONIC, &finish); • dumptimespec(&finish);

dumptimespec(structtimespec *ts){ • printf("seconds=%ld, nanoseconds=%ld\n", • ts->tv_sec, ts->tv_nsec); • }

args0.c – process command line arguments • main(intargc, char *argv[]) • { • inti; • char *tmp; • for(i = 0; i < argc; i++){ • printf("argument[%d] = \"%s\"\n", i, argv[i]); • } • }

Questions on Project00 • Error Log • … • … • …

Slides 28-40 of Lecture 02 • Lecture 02 • Professor Pacheco • Tas • Division of work – data parallelism • Division of work – task parallelism • Div. of work – data parallelism: compute_next_value • Div. of work – task parallelism: compute_next_value • Coordination : communication, load balancing, synchronization • What we’ll be doing: MPI, Pthreads, OpenMP • Type of parallel systems – Shared vs Distributed • Type of parallel systems - diagram • Concurrent, Parallel, and Distributed computing • & 40 Concluding Remarks

Project01 – Improving Serial Programs • Amdahl’s Law % parallelizable • cache performance • Average Memory Access Time (AMAT) • valgrind overview • -26 valgrind • row major strides vs column strides • blocking matrix multiply • cpp macros • Reading arrays - fscanf (file_ptr, “ %e”, &a[i][j] ); • fread(void *ptr, size_t size, size_tnmemb, FILE *stream);

Amdahl’s Law % Parallelizable • Suppose you have an enhancement or improvement in a design component. • The improvement in the performance of the system is limited by the % of the time the enhancement can be used Ref. CAAQA

Cache Performance • . CSAPP – Bryant O’Hallaron

Average Memory Access Time (AMAT)

valgrind –help • man valgrind • MEMCHECK OPTIONS • CACHEGRIND OPTIONS • CALLGRIND OPTIONS • HELGRIND OPTIONS

Preparing Program • gcc –O0 myprog.c • Running the program • ./myprog arg1 arg2 • Checking for Memory Leaks with valgrind • valgrind --leak-check=yes myprog arg1 arg2

valgrind output • ==19182== Address 0x1BA45050 is 0 bytes after a block of size 40 alloc'd • ==19182== at 0x1B8FF5CD: malloc (vg_replace_malloc.c:130) • ==19182== by 0x8048385: f (example.c:5) • ==19182== by 0x80483AB: main (example.c:11)

valgrind --tool=cachegrind prog • Level 1 and Last Level caches

valgrind --tool=cachegrind mm 10 10 10 • ==3574== I refs: 1,531,400 • ==3574== I1 misses: 1,093 • ==3574== LLi misses: 1,077 • ==3574== I1 miss rate: 0.07% • ==3574== LLi miss rate: 0.07% • ==3574== • ==3574== D refs: 843,924 (542,894 rd + 301,030 wr) • ==3574== D1 misses: 2,116 ( 1,866 rd + 250 wr) • ==3574== LLd misses: 1,695 ( 1,487 rd + 208 wr) • ==3574== D1 miss rate: 0.2% ( 0.3% + 0.0% ) • ==3574== LLd miss rate: 0.2% ( 0.2% + 0.0% ) • ==3574== • ==3574== LL refs: 3,209 ( 2,959 rd + 250 wr) • ==3574== LL misses: 2,772 ( 2,564 rd + 208 wr) • ==3574== LL miss rate: 0.1% ( 0.1% + 0.0% )

valgrind callgrind • valgrind --tool=callgrind mm 10 10 10 • callgrind_annotate callgrind.out.3577

row major strides vs column strides • arrays stored in row major order • row major strides • column strides

Matrix Multiply blocking-loop CAAQA page 90 • for (jj=0; jj < N; jj = jj+B) • for (kk=0; kk < N; kk = kk+B) • for(i=0;i<m;++i){ • for(j=jj;j<n;++j){ • C[i][j] = 0.0; • for(k=kk; k<min(kk+B,N); ++k){ • C[i][j] = C[i][j] + A[i][k] * B[x][k]; • } • } • }

cpp macros • cpp – c preprocessor directives • #define MAXLINE 1024 • #define min(a,b) ((a)<(b) ? (a) : (b)) • Well • #define max(a,b) \ • ({ __typeof__ (a) _a = (a); \ • __typeof__ (b) _b = (b); \ • _a > _b ? _a : _b; }) http://stackoverflow.com/questions/3437404/min-and-max-in-c

Pointer Arithmetic • If p is a pointer what is p++ ? • If a is the name of an array, then a is a pointer to the base of the array, i.e. &a[0][0].

Array reference macro • &v = address-of operator • Recall from data structures • &A[i][j] = &A[0][0] + skip i rows + skip j elements • = &A[0][0] + i * rowsize + j * elementsize • = &A[0][0] + i * numcols*elementsize + j * elementsize • A(i,j) = address of macro for A[i][j] with n columns • #define A(i,j) (A + (i)*n + j)

Matrix multiply with address macro • for(i=0;i<m;++i){ • for(j=0;j<n;++j){ • C[i][j] = 0.0; • for(x=0;x<k;++x){ • *C(i,j) = *C(i,j) + *A(i,x) * B(x,j); • } • } • } • Note pointer dereference

Libraries • BLAS – Basic Linear Algebra Subroutines (1970’s) • BLAS2 • BLAS3 • BOOST

Reading Writing matrices • fprintf(fp, “%e “, a[i][j]); • fscanf(fp, “%e”, &a[i][k]); • Problem?

fread and fwrite • #include <stdio.h> • size_tfread(void *ptr, size_t size, size_tnmemb, FILE *str); • size_tfwrite(const void *ptr, size_t size, size_tnmemb, • FILE *str);

Lecture 3 –SequentialPerformance