960 likes | 1.38k Views
Advanced Computing Techniques & Applications. Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. Course Profile. Lecturer: Dr. Bo Yuan Contact Phone: 2603 6067 E-mail: yuanb@sz.tsinghua.edu.cn Room: F - 301B Time: 10:25 am – 12:00pm , Friday Venue: CI - 208 Teaching Assistant
E N D
Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
Course Profile • Lecturer: Dr. Bo Yuan • Contact • Phone: 2603 6067 • E-mail: yuanb@sz.tsinghua.edu.cn • Room: F-301B • Time: 10:25 am – 12:00pm, Friday • Venue: CI-208 • Teaching Assistant • Mr. Shiquan Yang
We will study ... • MPI • Message Passing Interface • API for distributed memory parallel computing (multiple processes) • The dominant model used in cluster computing • OpenMP • Open Multi-Processing • API for shared memory parallel computing (multiple threads) • GPU Computing with CUDA • Graphics Processing Unit • Compute Unified Device Architecture • API for shared memory parallel computing in C (multiple threads) • Parallel Matlab • A popular high-level technical computing language and interactive environment
Aims & Objectives • Learning Objectives • Understand the main issues and core techniques in parallel computing. • Obtain first-hand experience in Cloud Computing. • Able to develop MPI based parallel programs. • Able to develop OpenMP based parallel programs. • Able to develop GPU based parallel programs. • Able to develop Matlab based parallel programs. • Graduate Attributes • In-depth Knowledge of the Field of Study • Effective Communication • Independence and Teamwork • Critical Judgment
Learning Activities • Lecture (10) • Introduction (3) • MPI and OpenMP (3) • GPU Computing (3) • Invited Talk (1) • Practice (3) • GPU Programming (1) • Cloud Computing (1) • Parallel Matlab (1) • Others (2) • Industry Tour (1) • Final Exam (1)
Assessment Final Exam (50%) • Assignment 1 • Weight: 20% • Task: Parallel Programming using MPI • Type: Individual • Assignment 2 • Weight: 10% • Task: Parallel Programming using OpenMP • Type: Individual • Assignment 3 • Weight: 20% • Task: Parallel Programming using CUDA • Type: Individual
Learning Resources • Books • http://www.mcs.anl.gov/~itf/dbpp/ • https://computing.llnl.gov/tutorials/parallel_comp/ • http://www-users.cs.umn.edu/~karypis/parbook/ • Journals • http://www.computer.org/tpds • http://www.journals.elsevier.com/parallel-computing/ • http://www.journals.elsevier.com/journal-of-parallel-and-distributed-computing/ • Amazon Cloud Computing Services • http://aws.amazon.com • CUDA • http://developer.nvidia.com
Rules & Policies • Plagiarism • Plagiarism is the act of misrepresenting as one's own original work the ideas, interpretations, words or creative works of another. • Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence. • Presenting as independent work done in collaboration with others. • Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these. • Paraphrasing, summarizing or simply rearranging another person's words, ideas, without changing the basic structure and/or meaning of the text. • Copying or adapting another student's original work into a submitted assessment item.
Rules & Policies • Late Submission • Late submissions will incur a penalty of 10% of the total marks for each day that the submission is late (including weekends). Submissions more than 5 days late will not be accepted. • Assumed Background • Acquaintance with C language is essential. • Knowledge of computer architecture is beneficial. • We have CUDA supported GPU cards available!
Half Adder A: AugendB: Addend S: Sum C: Carry
Electronic Numerical Integrator And Computer • Programming • Programmable • Switches and Cables • Usually took days. • I/O: Punched Cards • Speed (10-digit decimal numbers) • Machine Cycle: 5000 cycles per second • Multiplication: 357 times per second • Division/Square Root: 35 times per second
Personal Computer in 1980s BASIC IBM PC/AT
Top 500 Supercomputers GFLOPS
Complexity of Computing • A: 10×100 B: 100×5 C: 5×50 • (AB)C vs. A(BC) • A: N×N B: N×NC=AB • Time Complexity: O(N3) • Space Complexity: O(1)
Why Parallel Computing? • Why we need every-increasing performance: • Big Data Analysis • Climate Modeling • Gaming • Why we need to build parallel systems: • Increase the speed of integrated circuits Overheating • Increase the number of transistors Multi-Core • Why we need to learn parallel programming: • Running multiple instances of the same program is unlikely to help. • Need to rewrite serial programs to make them parallel.
Sum Example 8 19 7 15 7 13 12 14 0 1 2 3 4 5 6 7 Cores 0 95
Sum Example 8 19 7 15 7 13 12 14 0 1 2 3 4 5 6 7 Cores 0 4 6 27 22 2 20 26 0 4 49 46 0 95
Levels of Parallelism • Embarrassingly Parallel • No dependency or communication between parallel tasks • Coarse-Grained Parallelism • Infrequent communication, large amounts of computation • Fine-Grained Parallelism • Frequent communication, small amounts of computation • Greater potential for parallelism • More overhead • Not Parallel • Giving life to a baby takes 9 months. • Can this be done in 1 month by having 9 women?
Data Decomposition 2 Cores
Granularity 8 Cores
Coordination • Communication • Sending partial results to other cores • Load Balancing • Wooden Barrel Principle • Synchronization • Race Condition
Data Dependency • Bernstein's Conditions • Examples Flow Dependency Output Dependency 1: function Dep(a, b) 2: c = a·b 3: d = 3·c 4: end function 1: function NoDep(a, b) 2: c = a·b 3: d = 3·b 4: e = a+b 5: end function
What is not parallel? Loop-Carried Dependence for (k=5; k<N; k++) { b[k]=DoSomething(K) a[k]=b[k-5]+MoreStuff(k); } Recurrences for (i=1; i<N; i++) a[i]=a[i-1]+b[i]; Atypical Loop-Carried Dependence wrap=a[0]*b[0]; for (i=1; i<N; i++) { c[i]=wrap; wrap=a[i]*b[i]; d[i]=2*wrap; } Solution for (i=1; i<N; i++) { wrap=a[i-1]*b[i-1]; c[i]=wrap; wrap=a[i]*b[i]; d[i]=2*wrap; }
What is not parallel? Induction Variables i1=4; i2=0; for (k=1; k<N; k++) { B[i1++]=function1(k,q,r) i2+=k; A[i2]=function2(k,r,q); } Solution i1=4; i2=0; for (k=1; k<N; k++) { B[k+3]=function1(k,q,r) i2=(k*k+k)/2; A[i2]=function2(k,r,q); }
Types of Parallelism • Instruction Level Parallelism • Task Parallelism • Different tasks on the same/different sets of data • Data Parallelism • Similar tasks on different sets of the data • Example • 5 TAs, 100 exam papers, 5 questions • How to make it task parallelism? • How to make it data parallelism?
Assembly Line • How long does it take to produce a single car? • How many cars can be operated at the same time? • How long is the gap between producing the first and the second car? • The longest stage on the assembly line determines the throughput. 15 20 5
Instruction Pipeline 1: Add 1 to R5. 2: Copy R5 to R6. • IF: Instruction fetch • ID: Instruction decode and register fetch • EX: Execute • MEM: Memory access • WB: Register write back
Computing Models • Concurrent Computing • Multiple tasks can be in progress at any instant. • Parallel Computing • Multiple tasks can be run simultaneously. • Distributed Computing • Multiple programs on networked computers work collaboratively. • Cluster Computing • Homogenous, Dedicated, Centralized • Grid Computing • Heterogonous, Loosely Coupled, Autonomous, Geographically Distributed
Concurrent vs. Parallel Job 1 Job 2 Job 1 Job 2 Job 1 Job 2 Job 3 Job 4 Core Core 1 Core 2 Core 1 Core 2
Process & Thread • Process • An instance of a computer program being executed. • Threads • The smallest units of processing scheduled by OS • Exist as a subset of a process. • Share the same resources from the process. • Switching between threads is much faster than switching between processes. • Multithreading • Better use of computing resources • Concurrent execution • Makes the application more responsive Thread Process Thread
Parallel Processes Process 1 Node 1 Process 2 Program Node 2 Process 3 Node 3 Single Program, Multiple Data
MapReduce vs. GPU • Pros: • Run on clusters of hundreds or thousands of commodity computers. • Can handle excessive amount of data with fault tolerance. • Minimum efforts required for programmers: Map & Reduce • Cons: • Intermediate results are stored in disks and transferred via network links. • Suitable for processing independent or loosely coupled jobs. • High upfront hardware cost and operational cost • Low Efficiency: GFLOPS per Watt, GFLOPS per Dollar