1 / 23

Computer Architecture Parallel Processors

Computer Architecture Parallel Processors. Taxonomy. Flynn’s Taxonomy Classify by Instruction Stream and Data Stream SISD Single Instruction Single Data Conventional processor SIMD Single Instruction Multiple Data One instruction stream Multiple data items Several Examples Produced

lafayette
Download Presentation

Computer Architecture Parallel Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Architecture Parallel Processors

  2. Taxonomy • Flynn’s Taxonomy • Classify by Instruction Stream and Data Stream • SISD Single Instruction Single Data • Conventional processor • SIMD Single Instruction Multiple Data • One instruction stream • Multiple data items • Several Examples Produced • MISD Multiple Instruction Single Data • Systolic Arrays (according to Hwang) • MIMD Multiple Instruction Multiple Data • Multiple Threads of execution • General Parallel Processors

  3. SIMD - Single Instruction Multiple Data • Originally thought to be the ultimatemassively parallel machine! • Some machines built • Illiac IV • Thinking Machines CM2 • MasPar • Vector processors (special category!)

  4. SIMD - Single Instruction Multiple Data • Each PE is asimple ALU(1 bit in CM-1,small processor in some) • Control Procissues sameinstruction toeach PE in eachcycle • Each PE has different data

  5. SIMD • SIMD performance depends on • Mapping problem ð processor architecture • Image processing • Maps naturally to 2D processor array • Calculations on individual pixels trivial • Combining data is the problem! • Some matrix operations also

  6. SIMD Note the B matrix is transposed! • Matrix multiplication • Each PE • * then • + • PEijð Cij

  7. Parallel Processing • Communication patterns • If the system provides the “correct” data paths,then good performance is obtainedeven with slow PEs • Without effective communicationbandwidth,even fast PEs are starved of data! • In a multiple PE system, we have • Raw communication bandwidth • Equivalent processor ó memory bandwidth • Communications patterns • Imagine the Matrix Multiplication problem if the matrices are not already transposed! • Network topology

  8. Systolic Arrays • Arrays of processors which pass data from one to the next at regular intervals • Similar to SIMD systems • But each processor may perform a different operation • Applications • Polynomial evaluation • Signal processing • Limited as general purpose processors • Communication pattern requiredneeds to match hardware links provided(a recurring problem!)

  9. Systolic Array - iWarp • Linear array of processors • Communication links in forward and backward directions

  10. Systolic Array - iWarp • Polynomial evaluation is simple • Use Horner’s rule • PEs - in pairs • multiply input by x, • passes result to right • add aj to result from left • passes result to right y = ((((anx + an-1)*x + an-2)*x + an-3)*x …… a1)*x + a0

  11. Systolic Array - iWarp • Similarly FFT is efficient • DFT • n2 operations needed for n-element DFT • FFT • Divides this into 2 smaller transforms • algorithm with log2n phases of n operations • Total n log2n • Simple strategy with log2n PEs yj = S akwkj yj = S a2mw2mj + wj S a2m+1w2mj n/2 “even” terms n/2 “odd” terms

  12. Systolic Arrays - General • Variations • Connection topology • 2D arrays, Hypercubes • Processor capabilities • Trivial - just an ALU • ALU with several registers • Simple CPU - registers, runs own program • Powerful CPU - local memory also • Reconfigurable • FPGAs, etc • Specialised applications only • Problem “shape” maps to interconnect pattern

  13. Vector Processors - The Supercomputers • Optimised for vector & matrix operations “Conventional” scalar processor section not shown

  14. Vector Processors - Vector operations • Example • Dot product or in terms of the elements • Fetch each element of each vector in turn • Stride • “Distance” between successive elements of a vector • 1 in dot-product case y = A l B y = Sak * bk

  15. Vector Processors - Vector operations • Example • Matrix multiply or in terms of the elements C = A B cij = Saik * bkj

  16. Vector Operations • Fetch data into vector register • Address Generation Unit manages stride Very high effectivebandwidthto memory Long “burst” accesses with AGU managing addresses

  17. Vector Operations • Operation Types (eg CRAY Y-MP) • Vector • Memory Access Vaop Vbç Vc Add two vectors Vaop Vbçsc Scalar result - dot product Vaop sbçVc Scalar operand - scale vector Vaçsb Sum, maximum, minimum Fixed strideElements of a vector (s=1), Column of a matrix (s>1) GatherRead - offsets in vector register ScatterWrite - offsets “ Mask Vector of bits - bit set for non-zero elements

  18. Vector Operations • Memory Access • Scatter • V0 - Data to be stored • V1 - Offset from start of vector

  19. Vector Operations • Memory Access • Scatter • V0 - Data to be stored • V1 - Offset from start of vector • Gather is converse - read from offsets in V1

  20. Vector Operations - Sparse Matrices • Matrices representing physical interactions are often sparse eg Off-diagonal elements are negligible • Mask register bits set for non-zero elements • Enables very large sparse matrices to be stored and manipulated

  21. Vector Processors - Performance • Very high peak MFLOPs • Heavily pipelined • 2ns cycle times possible • Chaining • ImprovesperformanceegA*B + C Result vector (A*B) fed back to avector register

  22. Vector Processors - Limitations • Vector Registers • Fast (expensive) memory • Limited length • Need re-loading • Limits processing rate

  23. Vector Processors - Limitations • Cost!! • Specialised • Limited applications • Low volume • High cost • Fast for scalar operations also but • Not cost effective for general purpose computing • Data paths optimised for vector data • Shape doesn’t match anything else!

More Related