520 likes | 696 Views
Models and Terminology. Degree of Parallelism. number of computations that can be executed in parallel. Degree of Parallelism. Suppose an application supports a degree of parallelism A, If the representation (language) of the algorithm allows a degree L of parallelism,
E N D
Degree of Parallelism • number of computations that can be executed in parallel.
Degree of Parallelism • Suppose an application supports a degree of parallelism A, • If the representation (language) of the algorithm allows a degree L of parallelism, • the compiler produces object code with a degree of parallelism C , and the architecture supports a degree of parallelism H , • then for the most efficient processing, what conditions must exist with respect to the degree of parallelism between each of these factors?
Degree of Parallelism • The for the most efficient processing: • which implies that an efficient development of parallel applications must consider all: algorithms, programming languages, compiler, operating system, and hardware structures.
Application characteristics • application is partitioned into a set of tasks that can be executed in parallel; evaluation must consider the following characteristics: • granularity, • degree of parallelism, • level of parallelism, and • data dependencies
Granularity • granularity is determined in function of the execution time R of the task and the communication time C with other tasks. • If R >> C the task granularity is large, i.e, coarse-grained (the least communication overhead). • If C >> R communication overhead dominates and the task is fine-grained. A compromise is a medium-grained task.
Level of Parallelism • The level of parallelism defines granularity: • task level, • An application has multiple tasks that can be executed in parallel • procedure level, • Task consists of procedures that can be executed in parallel • instruction level, • Instructions in a task can execute in parallel • operation level, could be considered same as instruction level and (maybe system code) • microcode level. (chip design?)
Data dependencies are determined by the precedence constraints between tasks
Processing paradigms • applications can be modeled as: serial, serial-parallel-serial w/o dependencies, and serial-parallel-serial with dependencies.
Processing paradigms Degree of parallelism is 1.
Processing paradigms Models the master-slave condition.
Processing paradigms Possible blocking conditions on some processors.
Treleavan and Myers taxonomy • Milutinovic combined the taxonomies of Treleaven and Myers based on what drives the computational flow of the architecture. • Control-flow • Data-flow • Demand-driven
1. Control-driven (control-flow) architecture models • In control-flow the flow of computation is determined by the instruction sequence and flow of data as instructions execute. • This sequential flow of execution is controlled by programmers. • Control-flow architectures; • RISC architectures • CISC architectures • HLL architectures
Control-flow • What motivated the design of RISC processors?
CISC-RISC • The complexity of the CISC control unit left very little chip space for anything else. • By reducing the complexity of the instruction set more real estate become available for other processor functions. • Such as pipelines and larger register files.
CISC-RISC • Historically the x86 instruction set has had too much inertia for RISC processors to catch on in the mass market. • Today the characteristics that distinguish an RISC from a CISC are getting blurred.
HLL Architecture • Direct execution of high-level language by processor • One language per processor. • SYMBOL machine developed in 1971 one example. • Embedded JVM?
2. Data-driven (data-flow) architecture models • instruction execution is determined by data availability instead of a program counter. Instructions should be ready to execute as soon as operands are available. • Computational results (data tokens) are passed directly between instructions. Data tokens, once consumed by an executing instruction, they are not reusable by other instructions. A token consists of data and the address (tag) of a destination node.
data-flow • The token is compared against those in a matching store. If matched, the token is extracted and the instruction is issued for its execution. A data-dependency (data flow) graph is used to represent the program.
3. Reduction architecture (demand driven) models • instructions execute only when results are required as operands for another instruction already enabled for execution.
Consider the expression; • a data-driven mechanism follows a bottom up approach while the demand-driven mechanism uses a top-down approach
Flynn’s Taxonomy • Is based on the degree of parallelism exhibited by architecture in its data and control flow mechanisms.
SISD (Single-Instruction [stream] Single Data [stream]) • Serial machine (typical Von-Newman design). Instructions are executed sequentially. • Execution stages may be overlapped (pipelined). A SISD machine may have more than one functional unit. All functional units are supervised by one control unit.
SIMD (Single-Instruction [stream] Multiple Data [stream]) • Multiple processing units are supervised by the same control unit. The control unit broadcasts the same instruction to all processors (PE’s) which operate on different data. • All PE’s share same memory but memory may be subdivided in different modules. ・ • SIMD systems with n processors can provide an n-fold speedup as long as a high degree of parallelism is supported at the instruction level.
MISD (Multiple-Instruction [stream] Single Data [stream]) • This architecture does not have any literal architectural implementation
MIMD (Multiple-Instruction [stream] Multiple Data [stream]) • The Multiple Instruction Multiple Data architecture is the area in which the vast majority of recent developments in Parallel Architectures have taken place. • In this architecture each processor has it's own set of instructions which are executed under the control of the Control Unit associated within that processor. • Furthermore, each processor will often have an amount of local memory upon which the instructions will primarily operate. • The execution of this architecture cannot therefore be synchronous without the provision of explicit interprocessor synchronisation mechanisms.
MIMD - shared memory • In shared-memory systems, n processors and p memory modules interchange information through an interconnection network; any pair of processors can communicate via shared locations. Ideally, p ≥ n and the interconnection network should allow p simultaneous access to keep all processors busy.
MIMD - shared memory • Shared-memory models: • The uniform-memory-access (UMA) model, • The nonuniform- memory-access (NUMA) model, and • The cache-only architecture (COMA) model.
UMA Model • The UMA model: physical memory is uniformly shared by all the processors; and, all processors have equal access time to all modules. Each processor however, may have its own private cache. Systems with a high degree of sharing are referred to as tightly coupled.
symmetric multiprocessor system • When all processor have equal access to all resources the system is referred to as a symmetric multiprocessor system. • i.e., all processors are equally capable of running executive programs (OS kernel) and I/O service routines.
asymmetric system • An asymmetric system features only one or a subset of processors with executive capabilities; the remaining processor are referred to as attached processors
SMP vs aSMP • Symmetric multiprocessor systems have identical processors and functions. (By contrast, an asymmetric multiprocessor system allocates resources to a specific processor even if that CPU is overloaded and others are relatively free.) The clear advantage is the balancing of the processing load across all resources.
NUMA Model • in this model access time depends on the location of memory items. Shared memory modules are physically distributed to all processors as local memory..
COMA Model • this model assumes cache-only memory. It is a special case of the NUMA model in which the distributed memory are converted to local cache memory modules
cc-NUMA • combines distributed shared memory and cache directories
DMM • Distributed memory multi-computers • Clusters • These systems consist of multiple independent computers nodes interconnected via a message-passing network that provides point-to-point interconnection between nodes
DMM • Advantages: • high throughput, • fault-tolerance, • dynamic reconfiguration in response to processing loads.