170 likes | 184 Views
PARALLEL APPLICATIONS. EE 524/CS 561 Kishore Dhaveji 01/09/2000. Parallelism. First Step towards parallelism - Finding suitable parallel algorithms. Ability of many independent threads of control to make progress simultaneously.
E N D
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000
Parallelism • First Step towards parallelism - Finding suitable parallel algorithms. • Ability of many independent threads of control to make progress simultaneously. • Available Parallelism - Maximum number of independent threads that can work simultaneously.
Amdahl’s law • If only one percent of problems fails to parallelize, then no matter how much parallelism is available for the rest, the problem can never be solved more than hundred times faster than in the sequential case.
Reasons for Parallelizing code • Problem Size • Large problems might require significant memory, which could be expensive and may not be sufficient • Real time requirements • Real time requirements of order of minutes, hours or days. • e.g. Weather forecasting, financial modeling etc • If parallelizing is being done for performance, effect of other users to be noted
Categories of Parallel Algorithm • Regular and Irregular • Synchronous and Asynchronous • Coarse and Fine Grained • Bandwidth Greedy and Frugal • Latency Tolerant and Intolerant • Distributed and Shared Address Spaces • Beowulf Systems & Choices of Parallelism
Regular and Irregular • Regular algorithms • Use data structures that fit naturally into rectangular arrays • Easier to partition into separate parallel processes • Irregular algorithms • Use complicated data structures, e.g., trees, graphs, lists, hash-tables, indirect addressing • Require careful consideration for parallelization • Exploit features like sparseness, hierarchies, unpredictable updates
Synchronous and Asynchronous • Parallel parts in Synchronous algorithms must be in “lockstep”. • In asynchronous algorithms, different parts of algorithms can “get ahead”. • Asynchronous algorithms often require less bookkeeping and hence much easier to parallelize.
Coarse and Fine Grained • Grain size refers to the amount of work performed by each process in a parallel computation. • Large grain size calls for less interprocessor communication • The relative amounts of communication and grain size computation is proportional to the ratio of surface area to volume - this ratio ought to be small for best results.
Bandwidth Greedy and Frugal • tcomm = tlatency+ message length/bandwidth • CPU speed >> memory speed >> Network speed (2400MBps >> 128-182MBps >> 15MBps) • Memory bandwidth and network speeds are the limiting factors in overall system performance. • Algorithms may or may not require bandwidth • Increasing grain size leads to bandwidth frugal and better performing algorithms.
Latency Tolerant and Intolerant • Latency - The time taken for the beginning of the delivery of message. • number of bytes n1/2 = latency x Bandwidth • Longer messages are bandwidth dominated and shorter messages are latency dominated. • High latencies are the most conspicuous shortcoming of Beowulf systems, successful algorithms are usually latency tolerant.
Distributed and Shared Address Spaces • Shared address space • All processors are accessing a common shared address space. • Simplifies design of parallel algorithms • Race-condition and non-determinacy • Distributed address spaces • Message passing procedures for communication • Designing languages and compilers has been difficult.
Choice of Parallelism • Requires large grain size algorithms • Latency tolerant and bandwidth frugal algorithms • Regular and asynchronous algorithms • Tuning of Beowulf system
Process Level Parallelism • Beowulf systems are well suited for process level parallelism, i.e. running multiple independent processes • Requirement for process level parallelism is the existence of sequential code that must be run many times • If process pool is large, processes can self schedule and load balance across the system
Utilities for Parallel Computing • Dispatcher program, which can take arbitrary list of commands and dispatch commands to a set of processors • No more than one command is run on any host at any one time • The order in which commands are run and dispatched to hosts is arbitrary
Overheads - rsh and File I/O • Startup costs of establishing connections, user authentication, executing remote commands • Additional overhead of I/O • Transfer of executable, if it stored on NFS mounted file system, for every invocation • NFS is limited by performance of server and its network connections
Overheads - rsh and File I/O • Start up problems can reduced by pre-staging data and exes to disks and storing the results of computation • Discipline in placement of files in directories to avoid network connections • NFS is sequential system without broadcast. This problem can be overcome by caching data is unchanging, e.g., input files and executables (Swap space and /scratch)
Summary • Process level parallelism is easiest and cost effective for Beowulf systems • Optimization for reducing network traffic and improve I/O performance • Auto load balancing of process level parallelism • Multiple users could be accommodated with variable priorities and scheduling policies • GUI for displaying status of system