1 / 55

ECE 1747H : Parallel Programming

ECE 1747H : Parallel Programming. Lecture 1-2: Overview. ECE 1747H. Meeting time: Mon 4-6 PM Meeting place: RS 310 Instructor: Cristiana Amza, http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E. Material. Course notes Web material (e.g., published papers)

zihna
Download Presentation

ECE 1747H : Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 1747H : Parallel Programming Lecture 1-2: Overview

  2. ECE 1747H • Meeting time: Mon 4-6 PM • Meeting place: RS 310 • Instructor: Cristiana Amza, http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E

  3. Material • Course notes • Web material (e.g., published papers) • No required textbook, some recommended

  4. Prerequisites • Programming in C or C++ • Data structures • Basics of machine architecture • Basics of network programming • Please send e-mail to ecehelp@ece.toronto.edu to get an eecg account !! (name, stuid, class, instructor) madalin@cs.toronto.edu to get an MPI account (this is on our research cluster for the purpose of the MPI homework).

  5. Other than that • No written homeworks, no exams • 10% for each small programming assignments (expect 1-2) • 10% class participation • Rest comes from major course project

  6. Programming Project • Parallelizing a sequential program, or improving the performance or the functionality of a parallel program • Project proposal and final report • In-class project proposal and final report presentation • “Sample” project presentation can be posted

  7. Parallelism (1 of 2) • Ability to execute different parts of a single program concurrently on different machines • Goal: shorter running time • Grain of parallelism: how big are the parts? • Can be instruction, statement, procedure, … • Will mainly focus on relative coarse grain

  8. Parallelism (2 of 2) • Coarse-grain parallelism mainly applicable to long-running, scientific programs • Examples: weather prediction, prime number factorization, simulations, …

  9. Lecture material (1 of 4) • Parallelism • What is parallelism? • What can be parallelized? • Inhibitors of parallelism: dependences

  10. Lecture material (2 of 4) • Standard models of parallelism • shared memory (Pthreads) • message passing (MPI) • shared memory + data parallelism (OpenMP) • Classes of applications • scientific • servers

  11. Lecture material (3 of 4) • Transaction processing • classic programming model for databases • now being proposed for scientific programs

  12. Lecture material (4 of 4) • Perf. of parallel & distributed programs • architecture-independent optimization • architecture-dependent optimization

  13. Course Organization • First 2-3 weeks of semester: • lectures on parallelism, patterns, models • small programming assignments (1-2), done individually • Rest of the semester • major programming project, done individually or in small group • Research paper discussions

  14. Parallel vs. Distributed Programming Parallel programming has matured: • Few standard programming models • Few common machine architectures • Portability between models and architectures

  15. Bottom Line • Programmer can now focus on program and use suitable programming model • Reasonable hope of portability • Problem: much performance optimization is still platform-dependent • Performance portability is a problem

  16. ECE 1747H: Parallel Programming Lecture 1-2: Parallelism, Dependences

  17. Parallelism • Ability to execute different parts of a program concurrently on different machines • Goal: shorten execution time

  18. Measures of Performance • To computer scientists: speedup, execution time. • To applications people: size of problem, accuracy of solution, etc.

  19. Speedup of Algorithm • Speedup of algorithm= sequential execution time / execution time on p processors (with the same data set). speedup p

  20. Speedup on Problem • Speedup on problem= sequential execution time of best known sequential algorithm / execution time on p processors. • A more honest measure of performance. • Avoids picking an easily parallelizable algorithm with poor sequential execution time.

  21. What Speedups Can You Get? • Linear speedup • Confusing term: implicitly means a 1-to-1 speedup per processor. • (almost always) as good as you can do. • Sub-linear speedup: more normal due to overhead of startup, synchronization, communication, etc.

  22. Speedup speedup linear actual p

  23. Scalability • No really precise decision. • Roughly speaking, a program is said to scale to a certain number of processors p, if going from p-1 to p processors results in some acceptable improvement in speedup (for instance, an increase of 0.5).

  24. Super-linear Speedup? • Due to cache/memory effects: • Subparts fit into cache/memory of each node. • Whole problem does not fit in cache/memory of a single node. • Nondeterminism in search problems. • One thread finds near-optimal solution very quickly => leads to drastic pruning of search space.

  25. Cardinal Performance Rule • Don’t leave (too) much of your code sequential!

  26. Amdahl’s Law • If 1/s of the program is sequential, then you can never get a speedup better than s. • (Normalized) sequential execution time = 1/s + (1- 1/s) = 1 • Best parallel execution time on p processors = 1/s + (1 - 1/s) /p • When p goes to infinity, parallel execution = 1/s • Speedup = s.

  27. Why keep something sequential? • Some parts of the program are not parallelizable (because of dependences) • Some parts may be parallelizable, but the overhead dwarfs the increased speedup.

  28. When can two statements execute in parallel? • On one processor: statement 1; statement 2; • On two processors: processor1: processor2: statement1; statement2;

  29. Fundamental Assumption • Processors execute independently: no control over order of execution between processors

  30. When can 2 statements execute in parallel? • Possibility 1 Processor1: Processor2: statement1; statement2; • Possibility 2 Processor1: Processor2: statement2: statement1;

  31. When can 2 statements execute in parallel? • Their order of execution must not matter! • In other words, statement1; statement2; must be equivalent to statement2; statement1;

  32. Example 1 a = 1; b = a; • Statements cannot be executed in parallel • Program modifications may make it possible.

  33. Example 2 a = f(x); b = a; • May not be wise to change the program (sequential execution would take longer).

  34. Example 3 a = 1; a = 2; • Statements cannot be executed in parallel.

  35. True dependence Statements S1, S2 S2 has a true dependence on S1 iff S2 reads a value written by S1

  36. Anti-dependence Statements S1, S2. S2 has an anti-dependence on S1 iff S2 writes a value read by S1.

  37. Output Dependence Statements S1, S2. S2 has an output dependence on S1 iff S2 writes a variable written by S1.

  38. When can 2 statements execute in parallel? S1 and S2 can execute in parallel iff there are no dependences between S1 and S2 • true dependences • anti-dependences • output dependences Some dependences can be removed.

  39. Example 4 • Most parallelism occurs in loops. for(i=0; i<100; i++) a[i] = i; • No dependences. • Iterations can be executed in parallel.

  40. Example 5 for(i=0; i<100; i++) { a[i] = i; b[i] = 2*i; } Iterations and statements can be executed in parallel.

  41. Example 6 for(i=0;i<100;i++) a[i] = i; for(i=0;i<100;i++) b[i] = 2*i; Iterations and loops can be executed in parallel.

  42. Example 7 for(i=0; i<100; i++) a[i] = a[i] + 100; • There is a dependence … on itself! • Loop is still parallelizable.

  43. Example 8 for( i=0; i<100; i++ ) a[i] = f(a[i-1]); • Dependence between a[i] and a[i-1]. • Loop iterations are not parallelizable.

  44. Loop-carried dependence • A loop carried dependence is a dependence that is present only if the statements are part of the execution of a loop. • Otherwise, we call it a loop-independent dependence. • Loop-carried dependences prevent loop iteration parallelization.

  45. Example 9 for(i=0; i<100; i++ ) for(j=0; j<100; j++ ) a[i][j] = f(a[i][j-1]); • Loop-independent dependence on i. • Loop-carried dependence on j. • Outer loop can be parallelized, inner loop cannot.

  46. Example 10 for( j=0; j<100; j++ ) for( i=0; i<100; i++ ) a[i][j] = f(a[i][j-1]); • Inner loop can be parallelized, outer loop cannot. • Less desirable situation. • Loop interchange is sometimes possible.

  47. Level of loop-carried dependence • Is the nesting depth of the loop that carries the dependence. • Indicates which loops can be parallelized.

  48. Be careful … Example 11 printf(“a”); printf(“b”); Statements have a hidden output dependence due to the output stream.

  49. Be careful … Example 12 a = f(x); b = g(x); Statements could have a hidden dependence if f and g update the same variable. Also depends on what f and g can do to x.

  50. Be careful … Example 13 for(i=0; i<100; i++) a[i+10] = f(a[i]); • Dependence between a[10], a[20], … • Dependence between a[11], a[21], … • … • Some parallel execution is possible.

More Related