960 likes | 1.16k Views
SIMD and Associative Computing. Computational Models and Algorithms. Associative Computing Topics. Introduction References SIMD computing & Architecture Motivation for the MASC model The MASC and ASC Models A Language Designed for the ASC Model
E N D
SIMD and Associative Computing Computational Models and Algorithms
Associative Computing Topics • Introduction • References • SIMD computing & Architecture • Motivation for the MASC model • The MASC and ASC Models • A Language Designed for the ASC Model • List of Algorithms and Programs designed for ASC • An ASC Algorithm Examples • ASC version of Prim’s MST Algorithm
Comment on Slides Included • Some of these slides will be covered only lightly or else left for students to read. • The emphasis here is to provide an introduction to material covered, not a deep understanding. • Inclusion of these slides will provide a better survey of this material. • This material is a useful background for the Air Traffic Control example and projects we expect to use in this course.
Associative Computing References Note: Below KSU papers are available on the website: http://www.cs.kent.edu/~parallel/ (Click on the link to “papers”) • Maher Atwah, Johnnie Baker, and Selim Akl, An Associative Implementation of Classical Convex Hull Algorithms, Proc of the IASTED International Conference on Parallel and Distributed Computing and Systems, 1996, 435-438 • Mingxian Jin, Johnnie Baker, and Kenneth Batcher, Timings for Associative Operations on the MASC Model, Proc. of the 15th International Parallel and Distributed Processing Symposium, (Workshop on Massively Parallel Processing, San Francisco, April 2001.
Associative Computing References • Jerry Potter, Johnnie Baker, Stephen Scott, Arvind Bansal, Chokchai Leangsuksun, and Chandra Asthagiri, An Associative Computing Paradigm, Special Issue on Associative Processing, IEEE Computer, 27(11):19-25, Nov. 1994. (Note: MASC is called ‘ASC’ in this article.) • Jerry Potter, Associative Computing - A Programming Paradigm for Massively Parallel Computers, Plenum Publishing Company, 1992.
Alternate Names for SIMDs • Recall that all active processors of a true SIMD computer must simultaneously access the same memory location. • The value in the i-th processor can be viewed as the i-th component of a vector. • SIMD machines are sometimes called vector computers [Jordan,et.al.] or processor arrays [Quinn 94,04] based on their ability to execute vector and matrix operations efficiently.
SIMD Architecture • Has only one control unit. • Scientific applications have data parallelism
Data/instruction Storage • Front end computer • Also called the control unit • Holds and runs program • Data manipulated sequentially • Processor array • Data manipulated in parallel
Processor Array Performance • Performance: work done per time unit • Performance of processor array • Speed of processing elements • Utilization of processing elements
Performance Example 1 • 1024 processors • Each adds a pair of integers in 1 sec (1 microsecond or one millionth of second or 10-6 second.) • What is the performance when adding two 1024-element vectors (one per processor)?
Performance Example 2 • 512 processors • Each adds two integers in 1 sec • What is the performance when adding two vectors of length 600? • Since 600 > 512, 88 processor must add two pairs of integers. • The other 424 processors add only a single pair of integers.
Example of a 2-D Processor Interconnection Network in a Processor Array Each VLSI chip has 16 processing elements. Each PE can simultaneously send a value to a neighbor. PE = processor element
SIMD Execution Style • The traditional (SIMD, vector, processor array) execution style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]: • The sequential processor that broadcasts the commands to the rest of the processors is called the front endor control unit (or sometimes host). • The front end is a general purpose CPU that stores the program and the data that is not manipulated in parallel. • The front end normally executes the sequential portions of the program. • Each processing element has a local memory that can not be directly accessed by the control unit or other processing elements.
SIMD Execution Style • Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel. • Called the parallel memory • When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. • Although the PEs execute in parallel, some units can be allowed to skip any particular instruction.
Masking in Processor Arrays • All the processors work in lockstep except those that are masked out (by setting mask register). • The conditional if-then-else is different for processor arrays than sequential version • Every active processor tests to see if its data meets the negation of the boolean condition. • If it does, it sets its mask bit so those processors will not participate in the operation initially. • Next the unmasked processors, execute the THEN part. • Afterwards, mask bits (for original set of active processors) are flipped and unmasked processors perform the ELSE part.
SIMD Machines • An early SIMD computer designed for vector and matrix processing was the Illiac IV computer • Initial development at the University of Illinois 1965-70 • Moved to NASA Ames, completed in 1972 but not fully functional until 1976. • See Jordan et. al., pg 7 and Wikipedia • The MPP, DAP, the Connection Machines CM-1 and CM-2, MasPar MP-1 and MP-2 are examples of SIMD computers • See Akl pg 8-12 and [Quinn, 94] • The CRAY-1 and the Cyber-205 use pipelined arithmetic units to support vector operations and are sometimes called a pipelined SIMD • See [Jordan, et al, p7], [Quinn 94, pg 61-2], and [Quinn 2004, pg37).
SIMD Machines • Quinn [1994, pg 63-67] discusses the CM-2 Connection Machine (with 64K PEs) and a smaller & updated CM-200. • Our Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPRO • ASPRO is a small second generation STARAN used by the Navy in surveillance planes. • Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport.
Today’s SIMDs • Many SIMDs are being embedded in sequential machines. • Others are being build as part of hybrid architectures. • Others are being build as special purpose machines, although some of them could classify as general purpose. • Much of the recent work with SIMD architectures is proprietary. • Often the fact that a parallel computer is SIMD is not mentioned by company building them.
ClearSpeed’s Inexpensive SIMD • ClearSpeed is producing a COTS (commodity off the shelf) SIMD Board • Not a traditional SIMD as the hardware doesn’t synchronize every step. • PEs are full CPUs • Hardware design supports efficient synchronization • This machine is programmed like a SIMD. • The U.S. Navy has observed that their machines process radar a magnitude faster than others. • There is quite a bit of information about this at www.clearspeed.com and www.wscape.com
Special Purpose SIMDs in the Bioinformatics Arena • Parcel • Acquired by Celera Genomics in 2000 • Products include the sequence supercomputer GeneMatcher, which has a high throughput sequence analysis capability • Supports over a million processors • GeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencing • TimeLogic, Inc • Has DeCypher, a reconfigurable SIMD
Advantages of SIMDs • Reference: [Roosta, pg 10] • Less hardware than MIMDs as they have only one control unit. • Control units are complex. • Less memory needed than MIMD • Only one copy of the instructions need to be stored • Allows more data to be stored in memory. • Less startup time in communicating between PEs.
Advantages of SIMDs (cont) • Single instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug. • Similar to sequential programming • Control flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions. • MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead.
Advantages of SIMDs (cont) • During a communication operation between PEs, • PEs send data to a neighboring PE in parallel and in lock step • No need to create a header with routing information as “routing” is determined by program steps. • the entire communication operation is executed synchronously • SIMDs are deterministic & have much more predictable running time. • Can normally compute a tight (worst case) upper bound for the time for communications operations. • Less complex hardware in SIMD since no message decoder is needed in the PEs • MIMDs need a message decoder in each PE.
SIMD Shortcomings(with some rebuttals) • Claims are from our textbook [i.e., Quinn 2004]. • Similar statements are found in [Grama, et. al]. • Claim 1: Not all problems are data-parallel • While true, most problems seem to have a data parallel solution. • In [Fox, et.al.], the observation was made in their study of large parallel applications at national labs, that most were data parallel by nature, but often had points where significant branching occurred.
SIMD Shortcomings(with some rebuttals) • Claim 2: Speed drops for conditionally executed branches • MIMDs processors can execute multiple branches concurrently. • For an if-then-else statement with execution times for the “then” and “else” parts being roughly equal, about ½ of the SIMD processors are idle during its execution • With additional branching, the average number of inactive processors can become even higher. • With SIMDs, only one of these branches can be executed at a time. • This reason justifies the study of multiple SIMDs (or MSIMDs).
SIMD Shortcomings(with some rebuttals) • Claim 2 (cont): Speed drops for conditionally executed code • In [Fox, et.al.], the observation was made that for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8. • The cost of the extremely simple processors used in a SIMD are extremely low • Programmers used to worry about ‘full utilization of memory’ but stopped this after memory cost became insignificant overall.
SIMD Shortcomings(with some rebuttals) • Claim 3: Don’t adapt to multiple users well. • This is true to some degree for all parallel computers. • If usage of a parallel processor is dedicated to a important problem, it is probably best not to risk compromising its performance by ‘sharing’ • This reason also justifies the study of multiple SIMDs (or MSIMD). • SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research.
SIMD Shortcomings(with some rebuttals) • Claim 4: Do not scale down well to “starter” systems that are affordable. • This point is arguable and its ‘truth’ is likely to vary rapidly over time • ClearSpeed currently sells a very economical SIMD board that plugs into a PC.
SIMD Shortcomings(with some rebuttals) Claim 5:Requires customized VLSI for processors and expense of control units in PCs has dropped. • Reliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDS • Expense of PCs (with control units) has dropped significantly • However, reliance on COTS has fueled the success of ‘low level parallelism’ provided by clusters and restricted new innovative parallel architecture research for well over a decade.
SIMD Shortcomings(with some rebuttals) Claim 5 (cont.) • There is strong evidence that the period of continual dramatic increases in speed of PCs and clusters is ending. • Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilities • Additionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant.
Associative Computers Associative Computer: A SIMD computer with a few additional features supported in hardware. • These additional features can be supported (less efficiently) in traditional SIMDs in software. • The name “associative” is due to its ability to locate items in the memory of PEs by content rather than location.
Associative Models The ASC model (for ASsociative Computing) gives a list of the properties assumed for an associative computer. The MASC (for Multiple ASC) Model • Supports multiple SIMD (or MSIMD) computation. • Allows model to have more than one Instruction Stream (IS) • The IS corresponds to the control unit of a SIMD. • ASC is the MASC model with only one IS. • The one IS version of the MASC model is sufficiently important to have its own name.
ASC & MASC are KSU Models • Several professors and their graduate students at Kent State University have worked on models • The STARAN and the ASPRO fully support the ASC model in hardware. The MPP supports ASC, partly in hardware and partly in software. • Prof. Batcher was chief architect or consultant • He received both the Eckert-Mauchly Award and the Seymour Cray Computer Engineering Award • Dr. Potter developed a language for ASC • Dr. Baker works on algorithms for models and architectures to support models • Dr. Walker is working with a hardware design to support the ASC and MASC models. • Dr. Batcher and Dr. Potter are currently not actively working on ASC/MASC models but still provide advice.
Motivation • The STARAN Computer (Goodyear Aerospace, early 1970’s) and later the ASPRO provided an architectural model for associative computing embodied in the ASC model. • STARAN built to support Air Traffic Control. • ASPRO built to support Air Defense Systems • ASC extends the data parallel programming style to a complete computational model. • ASC provides a practical model that supports massive parallelism. • MASC provides a hybrid data-parallel, control parallel model that supports associative programming. • Descriptions of these models allow them to be compared to other parallel models
The ASC Model C Cells E Memory PE L L · · · IS N E Memory PE T W O R Memory PE K
Basic Properties of ASC • Instruction Stream • The IS has a copy of the program and can broadcast instructions to cells in unit time • Cell Properties • Each cell consists of a PE and its local memory • All cells listen to the IS • A cell can be active, inactive, or idle • Inactive cells listen but do not execute IS commands until reactivated • Idle cells contain no essential data and are available for reassignment • Active cells execute IS commands synchronously
Basic Properties of ASC • Responder Processing • The IS can detect if a data test is satisfied by any of its responder cells in constant time (i.e., any-respondersproperty). • The IS can select an arbitrary responder in constant time (i.e., pick-oneproperty).
Basic Properties of ASC • Constant Time Global Operations (across PEs) • Logical OR and AND of binary values • Maximum and minimum of numbers • Associative searches • Communications • There are at least two real or virtual networks • PE communications (or cell) network • IS broadcast/reduction network (which could be implemented as two separate networks)
Basic Properties of ASC • The PE communications network is normally supported by an interconnection network • E.g., a 2D mesh • The broadcast/reduction network(s) are normally supported by a broadcast and a reduction network (sometimes combined). • See posted paper by Jin, Baker, & Batcher (listed in associative references) • Control Features • PEs and the IS and the networks all operate synchronously, using the same clock
Non-SIMD Properties of ASC • Observation: The ASC properties that are unusual for SIMDs are the constant time operations: • Constant time responder processing • Any-responders? • Pick-one • Constant time global operations • Logical OR and AND of binary values • Maximum and minimum value of numbers • Associative Searches • These timings are justified by implementations using a resolver in the paper by Jin, Baker, & Batcher (listed in associative references and posted).
On lot Color Model Price Year Make PE1 1 red Dodge 1 1994 0 PE2 0 PE3 1 blue 1996 Ford 1 IS PE4 0 1 1998 white Ford PE5 0 0 PE6 0 0 1 Subaru PE7 1997 red Typical Data Structure for ASC Model Busy- idle 1 Make, Color – etc. are fields the programmer establishes Various data types are supported. Some examples will show string data, but they are not supported in the ASC simulator.
Busy- idle On lot Color Model Price Year Make PE1 1 red Dodge 1 1994 0 PE2 0 PE3 1 blue 1996 Ford 1 IS PE4 0 1 1998 white Ford PE5 0 0 PE6 0 0 1 1 Subaru PE7 1997 red The Associative Search IS asks for all cars that are red and on the lot. PE1 and PE7 respond by setting a mask bit in their PE.
PE Interconnection Network Memory PE Instruc-tion Stream (IS) IS Network Memory PE Memory PE Instruc-tion Stream (IS) Memory PE Memory PE Memory PE Instruc-tion Stream (IS) Memory PE Memory PE MASC Model • Basic Components • An array of cells, each consisting of a PE and its local memory • A PE interconnection network between the cells • One or more Instruction Streams (ISs) • An IS network • MASC is a MSIMD model that supports • both data and control parallelism • associative programming
MASC Basic Properties • Each cell can listen to only one IS • Cells can switch ISs in unit time, based on the results of a data test. • Each IS and the cells listening to it follow rules of the ASC model. • Control Features: • The PEs, ISs, and networks all operate synchronously, using the same clock • Restricted job control parallelism is used to coordinate the interaction of the multiple ISs.
Characteristics of Associative Programming • Consistent use of style of programming called data parallel programming • Consistent use of global associative searching and responder processing • Usually, frequent use of the constant time global reduction operations: AND, OR, MAX, MIN • Broadcast of data using IS bus allows the use of the PE network to be restricted to parallel data movement.