420 likes | 642 Views
SIMD and Associative Computational Models. Parallel & Distributed Algorithms. SIMD and Associative Computational Models. Part I: SIMD Model. Flynn’s Taxonomy. The best known classification scheme for parallel computers. Depends on parallelism they exhibit with Instruction streams
E N D
SIMD and Associative Computational Models Parallel & Distributed Algorithms
SIMD and Associative Computational Models Part I: SIMD Model
Flynn’s Taxonomy • The best known classification scheme for parallel computers. • Depends on parallelism they exhibit with • Instruction streams • Data streams • A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) • The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) • Four combinations: SISD, SIMD, MISD, MIMD
Flynn’s Taxonomy (cont.) • SISD • Single Instruction Stream, Single Data Stream • Most important member is a sequential computer • Some argue other models included as well. • SIMD • Single Instruction Stream, Multiple Data Streams • One of the two most important in Flynn’s Taxonomy • MISD • Multiple Instruction Streams, Single Data Stream • Relatively unused terminology. Some argue that this includes pipeline computing. • MIMD • Multiple Instructions, Multiple Data Streams • An important classification in Flynn’s Taxonomy
The SIMD Computer & Model Consists of two types of processors: • A front-end or control unit • Stores a copy of the program • Has a program control unit to execute program • Broadcasts parallel program instructions to the array of processors. • Array of processors of simplistic processors that are functionally more like an ALU. • Does not store a copy of the program nor have a program control unit. • Executes the commands in parallel sent by the front end.
SIMD (cont.) • On a memory access, all active processors must access the same location in their local memory. • All active processor executes the same instruction synchronously, but on different data • The sequence of different data items is often referred to as a vector.
Alternate Names for SIMDs • Recall that all active processors of a SIMD computer must simultaneously access the same memory location. • The value in the i-th processor can be viewed as the i-th component of a vector. • SIMD machines are sometimes called vector computers [Jordan,et.al.] or processor arrays [Quinn 94,04] based on their ability to execute vector and matrix operations efficiently.
Alternate Names (cont.) • In particular, the Quinn Textbook for this course, Quinn calls a SIMD a processor array. • Quinn and a few others also considers a pipelined vector processor to be a SIMD • This is a somewhat non-standard use of the term. • An example is the Cray-1
How to View a SIMD Machine • Think of soldiers all in a unit. • A commander selects certain soldiers as active – for example, every even numbered row. • The commander barks out an order that all the active soldiers should do and they execute the order synchronously.
SIMD Execution Style • Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel. • When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. • Although the PEs execute in parallel, some units can be allowed to skip any particular instruction.
SIMD Computers • SIMD computers that focus on vector operations • Support some vector and possibly matrix operations in hardware • Usually limit or provide less support for non-vector type operations involving data in the “vector components”. • General purpose SIMD computers • Support more traditional type operations (e.g., other than for vector/matrix data types). • Usually also provide some vector and possibly matrix operations in hardware.
Interconnection Networks for SIMDs • No specific interconnection network is specified. • 2D mesh has been used more more frequently than others. • Even hybrid networks (e.g., cube connected cycles) have been used.
Example of a 2-D Processor Interconnection Network in a SIMD Each VLSI chip has 16 processing elements. Each PE can simultaneously send a value to a specific neighbor (e.g., their left neighbor). PE = processor element
SIMD Execution Style • The traditional (SIMD, vector, processor array) execution style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]: • The sequential processor that broadcasts the commands to the rest of the processors is called the front endor control unit. • The front end is a general purpose CPU that stores the program and the data that is not manipulated in parallel. • The front end normally executes the sequential portions of the program. • Each processing element has a local memory that can not be directly accessed by the host or other processing elements.
SIMD Execution Style • Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel. • When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. • Although the PEs execute in parallel, some units can be allowed to skip any particular instruction.
Masking on Processor Arrays • All the processors work in lockstep except those that are masked out (by setting mask register). • The parallel if-then-else is frequently used in SIMDs to set masks, • Every active processor tests to see if its data meets the negation of the boolean condition. • If it does, it sets its mask bit so those processors will not participate in the operation initially. • Next the unmasked processors, execute the THEN part. • Afterwards, mask bits (for original set of active processors) are flipped and unmasked processors perform the the ELSE part. • Note: differs from the sequential version of “If”
Data Parallelism(A strength for SIMDs) • All tasks (or processors) apply the same set of operations to different data. • Example: • . Accomplished on SIMDs by having all active processors execute the operations synchronously • MIMDs can also handle data parallel execution, but must synchronize more frequently. for i 0 to 99 do a[i] b[i] + c[i] endfor
Functional/Control/Job Parallelism(A Strictly-MIMD Paradigm) • Independent tasks apply different operations to different data elements • First and second statements execute concurrently • Third and fourth statements execute concurrently a 2 b 3 m (a + b) / 2 s (a2 + b2) / 2 v s - m2
SIMD Machines • An early SIMD computer designed for vector and matrix processing was the Illiac IV computer • built at the University of Illinois • See Jordan et. al., pg 7 • The MPP, DAP, the Connection Machines CM-1 and CM-2, MasPar MP-1 and MP-2 are examples of SIMD computers • See Akl pg 8-12 and [Quinn, 94]
SIMD Machines • Quinn [1994, pg 63-67] discusses the CM-2 Connection Machine and a smaller & updated CM-200. • Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPRO • ASPRO is a small second generation STARAN used by the Navy in the spy planes. • Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport.
Today’s SIMDs • Many SIMDs are being embedded in SISD machines. • Others are being build as part of hybrid architectures. • Others are being build as special purpose machines, although some of them could classify as general purpose. • Much of the recent work with SIMD architectures is proprietary.
A Company Building Inexpensive SIMD WorldScape is producing a COTS (commodity off the shelf) SIMD • Not a traditional SIMD as • The PEs are full-fledged CPU’s • the hardware doesn’t synchronize every step. • Hardware design supports efficient synchronization • Their machine is programmed like a SIMD. • The U.S. Navy has observed that their machines process radar a magnitude faster than others. • There is quite a bit of information about their work at http://www.wscape.com
Systola 1024: PC add-on board with 1024 processors • Fuzion 150: 1536 processors on a single chip An Example of a Hybrid SIMD • Embedded Massively Parallel Accelerators • Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan (This and next three slides are due to Prabhakar R. Gudla (U of Maryland) at a CMSC 838T Presentation, 4/23/2003.)
Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 High speed Myrinet switch Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Hybrid Architecture • combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer
RAM NORTH RAM WEST Controller program memory host computer bus ISA Interface processors Architecture of Systola 1024 • Instruction Systolic Array: • 32 32 mesh of processing elements • wavefront instruction execution
SIMDs Embedded in SISDs • Intel's Pentium 4 includes what they call MMX technology to gain a significant performance boost • IBM and Motorola incorporated the technology into their G4 PowerPC chip in what they call their Velocity Engine. • Both MMX technology and the Velocity Engine are the chip manufacturer's name for their proprietary SIMD processors and parallel extensions to their operating code. • This same approach is used by NVidia and Evans & Sutherland to dramatically accelerate graphics rendering.
Special Purpose SIMDs in the Bioinformatics Arena • Parcel • Acquired by Celera Genomics in 2000 • Products include the sequence supercomputer GeneMatcher, which has a high throughput sequence analysis capability • Supported over a million processors earlier • GeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencing • TimeLogic, Inc • Has DeCypher, a reconfigurable SIMD
Advantages of SIMDs • Reference: [Roosta, pg 10] • Less hardware than MIMDs as they have only one control unit. • Control units are complex. • Less memory needed than MIMD • Only one copy of the instructions need to be stored • Allows more data to be stored in memory. • Less startup time in communicating between PEs.
Advantages of SIMDs • Single instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug. • Similar to sequential programming • Control flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions. • MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead.
Advantages of SIMDs • During a communication operation between PEs, • PEs send data to a neighboring PE in parallel and in lock step • No need to create a header with routing information as “routing” is determined by program steps. • the entire communication operation is executed synchronously • A tight (worst case) upper bound for the time for this operation can be computed. • Less complex hardware in SIMD since no message decoder is needed in PEs • MIMDs need a message decoder in each PE.
SIMD Shortcomings(with some rebuttals) • Claims are from our textbook by Quinn. • Similar statements are found in one of our “primary reference book” by Grama, et. al [13]. • Claim 1: Not all problems are data-parallel • While true, most problems seem to have data parallel solutions. • In [Fox, et.al.], the observation was made in their study of large parallel applications that most were data parallel by nature, but often had points where significant branching occurred.
SIMD Shortcomings(with some rebuttals) • Claim 2: Speed drops for conditionally executed branches • Processors in both MIMD & SIMD normally have to do a significant amount of ‘condition’ testing • MIMDs processors can execute multiple branches concurrently. • For an if-then-else statement with execution times for the “then” and “else” parts being roughly equal, about ½ of the SIMD processors are idle during its execution • With additional branching, the average number of inactive processors can become even higher. • With SIMDs, only one of these branches can be executed at a time. • This reason justifies the study of multiple SIMDs (or MSIMDs).
SIMD Shortcomings(with some rebuttals) • Claim 2 (cont): Speed drops for conditionally executed code • In [Fox, et.al.], the observation was made that for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8. • The cost of the extremely simple processors used in a SIMD are extremely low • Programmers used to worry about ‘full utilization of memory’ but stopped this after memory cost became insignificant overall.
SIMD Shortcomings(with some rebuttals) • Claim 3: Don’t adapt to multiple users well. • This is true to some degree for all parallel computers. • If usage of a parallel processor is dedicated to a important problem, it is probably best not to risk compromising its performance by ‘sharing’ • This reason also justifies the study of multiple SIMDs (or MSIMD). • SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research.
SIMD Shortcomings(with some rebuttals) • Claim 4: Do not scale down well to “starter” systems that are affordable. • This point is arguable and its ‘truth’ is likely to vary rapidly over time • WorldScape/ClearSpeed currently sells a very economical SIMD board that plugs into a PC.
SIMD Shortcomings(with some rebuttals) Claim 5:Requires customized VLSI for processors and expense of control units has dropped • Reliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDS • Expense of PCs (with control units) has dropped significantly • However, reliance on COTS has fueled the success of ‘low level parallelism’ provided by clusters and restricted new innovative parallel architecture research for well over a decade.
SIMD Shortcomings(with some rebuttals) Claim 5 (cont.) • There is strong evidence that the period of continual dramatic increases in speed of PCs and clusters is ending. • Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilities • Additionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant.