1 / 17

CIS669 Distributed and Parallel Processing

CIS669 Distributed and Parallel Processing. Lecture 2: Parallel System Architectures and Performance Evaluation Yuan Shi Spring 2002. Parallel System Architectures.

blaze
Download Presentation

CIS669 Distributed and Parallel Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CIS669 Distributed and Parallel Processing Lecture 2: Parallel System Architectures and Performance Evaluation Yuan Shi Spring 2002

  2. Parallel System Architectures “Lacking the dignity of a proper discipline, [it] was an orphan in the world of knowledge. The subject became a rag-bag filled with odds and ends of knowledge and pseudo-knowledge, of Biblical dogmas, traveler's tales, and mythical imaginings [Boo83, p.100, textbook:p.15].

  3. Where are the flies? • Computers can be built in many different ways for many different applications. Finding a common criteria to compare architectures is VERY difficult. • Once built, computational performance (speed) varies greatly from applications to applications. It is equally difficult to have a common criteria to measure the “goodness” of a given architecture. • Finally, programming difficulties vary from architecture to architecture. Each “parallel programming” environment dictates a specific programming style that is typically more complex than the serial programming interface.

  4. Most Recent Examples • Cilk (http://supertech.lcs.mit.edu/cilk/) MIT • Passages (http://www.cis.udel.edu/~hiper/hiperspace/projects/gary.htm) Udel • EARTH (http://www.capsl.udel.edu/CURRENTPROJ/EARTH/) Udel

  5. First Battle: Fine v.s. Coarse Grain Parallelism • Fine grain. Pros: • Large degree of parallelism (do many things at one time) • Find grain Cons: • Large communication overhead • Difficult programming model • Less reliable

  6. Coarse Grain Parallelism • Coarse Grain Pros: • Ease of programming • More reliable • Less communication overhead • Coarse Grain Cons: • Less degree of parallelism (do fewer things at the same time)

  7. Question: How to determine the best degree of parallelism? • Timing Models • Timing Model is a method for calculating the performance of running programs on any hardware architecture. • Timing Model can also be used to calculate the scalability of the running programs and architecture. Because, scalability is hardware and application dependent.

  8. Introduction to Timing Models • Time Complexity: T(n) = O(f(n)) ==> The time to run a program on n-sized input is above-bounded by f(n). It really means that it will take no more than f(n) steps to compute n inputs. • Timing Models: • Single Processor: Ts(n) = f(n)/W ==> The time to run a program is approximately equal to the estimated algorithmic steps/single processor power W (algorithmic steps per second).

  9. Timing Model for Multiprocessors • T(n,p) = TCompute + TCommunication + TIO = f(n)/pW + g(n,p)/ + k(n,p)/B • T(n,p) = estimated running time for size n and p processors. • g(n,p) = estimated communication volume (bytes) • k(n,p) = estimated IO volume(bytes) • W= single processor power in algorithmic steps/second •  = interconnection network speed in bytes/second • B = IO speed in bytes/second

  10. How to obtain values for W,  and B? • Each parameter represents a RANGE of values. • Each parameter can be calibrated using computational experiments. • Ts(n)=f(n)/W can be used to derive W=f(n)/Ts(n). • Set p=2, T(n,p) can be used to derive =g(n,p)/(Tp(n)-f(n)/pW) by removing the IO part (easily done). • Instrumenting the sequential source code can derive k(n,p) and B easily.

  11. Practical Example • Matrix Multiplication: A x B ==> C. • Assumptions: • Each matrix is of n x n elements. • Each element is double precision (8 bytes). • Timing Model: (using O(n3) algorithm) • Tp(n)=n3/pW+g(n)/ +k(n)/B • Ignoring IO to simplify: Tp(n)= n3/pW+g(n)/  • Observation: If p=n2, the system could be VERY FAST since each dot-product is computed on an independent processor in parallel with others. Degree of parallelism = n2 (fine grain).

  12. Quantitative Arguments for Coarse Grain Parallelism • What about g(n,p)? A dot-product requires one row of A and one column of B ==>minimal 2 x 8 x n = 16n bytes per processor or 16n3 bytes transmitted across the network. • Comparing g(n,p)/  with f(n)/W, f(n)/W should be smaller (faster) since W(in GHZ) is typically >> (in MBps). P1 P1 P1 P1 Interconnection Network P1 P1 P1 P1 Px

  13. Example II: Massively Parallel Potentials • Fractal calculation involves solving massively many equations in complex plane in order to produce the color indices (number of iterations until diverging outside of a pre-defined box: (http://aleph0.clarku.edu/~djoyce/julia/explorer.html) to make a striking look image. • Ref: http://www.cis.temple.edu/~shi.

  14. Conclusions • We need to calculate proper parallelism BEFORE implementing a software/hardware solution. • Hardware technologies are advancing rapidly, we need a generic architecture platform that will leverage hardware advances without sacrificing programmability.

  15. Idea: Stateless Parallel Processors

  16. A Few Finer Points • The ring must be slotted unidirectional. This allows multiple stations to transmit at the same time. • The ring must be redundant in order to prevent breakage by the loss of a single processor. • The result:

  17. Revised SPP Architecture

More Related