1 / 35

Lecture 3 –SequentialPerformance

Lecture 3 –SequentialPerformance. CSCE 569 Parallel Processing. Topics Measuring Process time timeval timespec Improving performance Readings. January 22, 2014. Overview. Last Time Overview Shared Memory Model Distributed Memory UMA / NUMA OpenMP MPI POSIX threads CUDA

zelig
Download Presentation

Lecture 3 –SequentialPerformance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3 –SequentialPerformance CSCE 569 Parallel Processing • Topics • Measuring Process time • timeval • timespec • Improving performance • Readings January 22, 2014

  2. Overview • Last Time • Overview • Shared Memory Model • Distributed Memory • UMA / NUMA • OpenMP • MPI • POSIX threads • CUDA • Reduction sum (tree of adds) • New • Slides 35-42 of Lecture 01 • Process times • timeval • timespec • getrusage • Slides 28-40 of Lecture 02 • matrix multiply

  3. Slides 35-42 of Lecture 01 • Lecture 01 • time • timevalstructure • gettimeofday • Time Command • Time examples • Time with threads • man getrusage • structrusage

  4. /class/csce569-001 • Linux shared directory • cd /class/csce569-001 • ls • Assignments Code PachecoTexttlpi-dist web

  5. matmul.c - get seconds used • structtimeval { • __kernel_time_ttv_sec; /* seconds */ • __kernel_suseconds_ttv_usec; /* microseconds */ • };

  6. double seconds(intnmode){ • structrusagebuf; • double temp; • getrusage( nmode, &buf); • /* Get system time and user time in micro-seconds.*/ • temp = (double)buf.ru_utime.tv_sec*1.0e6 + • (double)buf.ru_utime.tv_usec + • (double)buf.ru_stime.tv_sec*1.0e6 + • (double)buf.ru_stime.tv_usec; • /* Return the sum of system and user time in SECONDS.*/ • return( temp*1.0e-6 ); • }

  7. Matrix Multiplication • Conisider • A - m x k matrix (m rows k columns) • B - k x n matrix (k rows n columns) • C = A*B • Cij = dotproduct of ith row of A and the jth column of B • Cij=

  8. Matrix Multiplication • tstart= seconds(RUSAGE_SELF); • for(i=0;i<m;++i){ • for(j=0;j<n;++j){ • C[i][j] = 0.0; • for(x=0;x<k;++x){ • C[i][j] = C[i][j] + A[i][x] * B[x][j]; • } • } • } • tend = seconds(RUSAGE_SELF);

  9. allocmatrix • double **allocmatrix(intnrows, intncols) { • double **m; • inti; • m=(double **) malloc((unsigned)(nrows)*sizeof(double*)); • if (!m) nerror("allocation failure 1 in matrix()"); • for(i=0;i<nrows;i++) { • m[i]=(double *) malloc((unsigned)(ncols) * sizeof(double)); • if (!m[i]) nerror("allocation failure 2 in matrix()"); • } • return m; • }

  10. More Notes on Matrix Multiplication • drand48, random for generating elements of arrays randomly • nerror (error_text) – • fprintf(stderr, "Run-time error...\n%s\n ",error_text); • freematrix • m = atoi(argv[1]); • aoti = alpha to integer • atoi( argv[1] ) equivto strtol( argv[1] , (char **) NULL, 10);

  11. struct timespec – finer process times • * /usr/include/time.h contains the timespec definition • * structtimespec{ • * __time_ttv_sec; • * long inttv_nsec; • * }; • */ • intclock_getres(clockid_tclk_id, structtimespec *res); • intclock_gettime(clockid_tclk_id, structtimespec *tp); • intclock_settime(clockid_tclk_id, conststructtimespec *tp);

  12. time0.c • structtimespec start; • structtimespec finish; • intretval; • retval = clock_getres(CLOCK_MONOTONIC, &clk_resolution); • retval = clock_gettime(CLOCK_MONOTONIC, &start); • … do something … • retval = clock_gettime(CLOCK_MONOTONIC, &finish); • dumptimespec(&finish);

  13. dumptimespec(structtimespec *ts){ • printf("seconds=%ld, nanoseconds=%ld\n", • ts->tv_sec, ts->tv_nsec); • }

  14. args0.c – process command line arguments • main(intargc, char *argv[]) • { • inti; • char *tmp; • for(i = 0; i < argc; i++){ • printf("argument[%d] = \"%s\"\n", i, argv[i]); • } • }

  15. Questions on Project00 • Error Log • … • … • …

  16. Slides 28-40 of Lecture 02 • Lecture 02 • Professor Pacheco • Tas • Division of work – data parallelism • Division of work – task parallelism • Div. of work – data parallelism: compute_next_value • Div. of work – task parallelism: compute_next_value • Coordination : communication, load balancing, synchronization • What we’ll be doing: MPI, Pthreads, OpenMP • Type of parallel systems – Shared vs Distributed • Type of parallel systems - diagram • Concurrent, Parallel, and Distributed computing • & 40 Concluding Remarks

  17. Project01 – Improving Serial Programs • Amdahl’s Law % parallelizable • cache performance • Average Memory Access Time (AMAT) • valgrind overview • -26 valgrind • row major strides vs column strides • blocking matrix multiply • cpp macros • Reading arrays - fscanf (file_ptr, “ %e”, &a[i][j] ); • fread(void *ptr, size_t size, size_tnmemb, FILE *stream);

  18. Amdahl’s Law % Parallelizable • Suppose you have an enhancement or improvement in a design component. • The improvement in the performance of the system is limited by the % of the time the enhancement can be used Ref. CAAQA

  19. Cache Performance • . CSAPP – Bryant O’Hallaron

  20. Average Memory Access Time (AMAT)

  21. valgrind –help • man valgrind • MEMCHECK OPTIONS • CACHEGRIND OPTIONS • CALLGRIND OPTIONS • HELGRIND OPTIONS

  22. Preparing Program • gcc –O0 myprog.c • Running the program • ./myprog arg1 arg2 • Checking for Memory Leaks with valgrind • valgrind --leak-check=yes myprog arg1 arg2

  23. valgrind output • ==19182== Address 0x1BA45050 is 0 bytes after a block of size 40 alloc'd • ==19182== at 0x1B8FF5CD: malloc (vg_replace_malloc.c:130) • ==19182== by 0x8048385: f (example.c:5) • ==19182== by 0x80483AB: main (example.c:11)

  24. valgrind --tool=cachegrind prog • Level 1 and Last Level caches

  25. valgrind --tool=cachegrind mm 10 10 10 • ==3574== I refs: 1,531,400 • ==3574== I1 misses: 1,093 • ==3574== LLi misses: 1,077 • ==3574== I1 miss rate: 0.07% • ==3574== LLi miss rate: 0.07% • ==3574== • ==3574== D refs: 843,924 (542,894 rd + 301,030 wr) • ==3574== D1 misses: 2,116 ( 1,866 rd + 250 wr) • ==3574== LLd misses: 1,695 ( 1,487 rd + 208 wr) • ==3574== D1 miss rate: 0.2% ( 0.3% + 0.0% ) • ==3574== LLd miss rate: 0.2% ( 0.2% + 0.0% ) • ==3574== • ==3574== LL refs: 3,209 ( 2,959 rd + 250 wr) • ==3574== LL misses: 2,772 ( 2,564 rd + 208 wr) • ==3574== LL miss rate: 0.1% ( 0.1% + 0.0% )

  26. valgrind callgrind • valgrind --tool=callgrind mm 10 10 10 • callgrind_annotate callgrind.out.3577

  27. row major strides vs column strides • arrays stored in row major order • row major strides • column strides

  28. Matrix Multiply blocking-loop CAAQA page 90 • for (jj=0; jj < N; jj = jj+B) • for (kk=0; kk < N; kk = kk+B) • for(i=0;i<m;++i){ • for(j=jj;j<n;++j){ • C[i][j] = 0.0; • for(k=kk; k<min(kk+B,N); ++k){ • C[i][j] = C[i][j] + A[i][k] * B[x][k]; • } • } • }

  29. cpp macros • cpp – c preprocessor directives • #define MAXLINE 1024 • #define min(a,b) ((a)<(b) ? (a) : (b)) • Well • #define max(a,b) \ • ({ __typeof__ (a) _a = (a); \ • __typeof__ (b) _b = (b); \ • _a > _b ? _a : _b; }) http://stackoverflow.com/questions/3437404/min-and-max-in-c

  30. Pointer Arithmetic • If p is a pointer what is p++ ? • If a is the name of an array, then a is a pointer to the base of the array, i.e. &a[0][0].

  31. Array reference macro • &v = address-of operator • Recall from data structures • &A[i][j] = &A[0][0] + skip i rows + skip j elements • = &A[0][0] + i * rowsize + j * elementsize • = &A[0][0] + i * numcols*elementsize + j * elementsize • A(i,j) = address of macro for A[i][j] with n columns • #define A(i,j) (A + (i)*n + j)

  32. Matrix multiply with address macro • for(i=0;i<m;++i){ • for(j=0;j<n;++j){ • C[i][j] = 0.0; • for(x=0;x<k;++x){ • *C(i,j) = *C(i,j) + *A(i,x) * B(x,j); • } • } • } • Note pointer dereference

  33. Libraries • BLAS – Basic Linear Algebra Subroutines (1970’s) • BLAS2 • BLAS3 • BOOST

  34. Reading Writing matrices • fprintf(fp, “%e “, a[i][j]); • fscanf(fp, “%e”, &a[i][k]); • Problem?

  35. fread and fwrite • #include <stdio.h> • size_tfread(void *ptr, size_t size, size_tnmemb, FILE *str); • size_tfwrite(const void *ptr, size_t size, size_tnmemb, • FILE *str);

More Related