370 likes | 539 Views
EECS 570. Notes on Chapter 1– Introduction What is Parallel Architecture? Evolution and convergence of Parallel Architectures Fundamental Design Issues Acknowledgements
E N D
EECS 570 • Notes on Chapter 1– Introduction • What is Parallel Architecture? • Evolution and convergence of Parallel Architectures • Fundamental Design Issues • Acknowledgements • Slides are derived from work by Steve Reinhardt (Michigan), Mark Hill (Wisconsin)Sarita Adve (Illinois), Babak Falsafi (CMU),Alvy Lebeck (Duke), and J. P. Singh (Princeton). Many Thanks! EECS 570: Fall 2003 -- rev3
What is Parallel Architecture? • parallel – OED • parallel pæ;ra lel, a. and sb. • 2. d. Computers. Involving the concurrent or simultaneous performance of certain operations; functioning in this way. • 1948Math. Tables & Other Aids to Computation III. 149 The use of plugboard facilities and punched cards permits parallel operation (as distinguished from sequence operation), with further gain in efficiency. • 1963 W. H. Ware Digital Computer Technol. & Design II. xi. 3 Parallel arithmetic tends to be faster than serial arithmetic because it performs operations in all columns at once, rather than in one column at a time. • 1974 P. H. Enslow Multiprocessors & Parallel Processing i. 1 This book focuses on..the integration of multiple functional units into a multiprocessing or parallel processing system. EECS 570: Fall 2003 -- rev3
Spectrum of Parallelism 591 470 570 370 • Key differences • granularity of operations • frequency/overhead of communication • degree of parallelism • source of parallelism • data vs. control • parts of larger task vs. independent tasks • source of decomposition (hardware, compiler, programmer, OS …) serial pipelining superscalar VLIW multithreading multiprocessing distributed systems EECS 570: Fall 2003 -- rev3
Course Focus: Multithreading & Multiprocessing • High end: many applications where even a fast CPU isn’t enough • Scientific computing: aerodynamics, crash simulation, drug design, weather prediction, materials, … • General-purpose computing: graphics, databases, web servers, data mining, financial modeling, … • Low end: • cheap microprocessors make small MPs affordable • Future: • Chip multiprocessors (almost here) – CMPs • Multithreading supported on the Pentium 4 • see http://www.intel.com/homepage/land/hyperthreading_more.htm EECS 570: Fall 2003 -- rev3
Motivation • N processors in a computer can provide: • Higher Throughput via many jobs in parallel • individual jobs no faster • Cost-Effectiveness may improve: users share a central resource • Lower Latency from shrink-wrapped software(e.g., Photoshop™) • Parallelizing your application (but this is hard) • From reduced queuing delays • Need something faster than today’s microprocessor? • Wait for tomorrow’s microprocessor • Use many microprocessors in parallel EECS 570: Fall 2003 -- rev3
Historical Perspective • End of uniprocessor performance has been frequently predicted due to fundamental limits • spurred work in parallel processing – cf. Thornton’s arguments for ILP in the CDC 6600 http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/cdc.html • No common parallel programming model • unlike the von Neumann Model • many models: data parallel, shared memory, message passing, dataflow, systolic, graph reduction, declarative logic • no pre-existing software to target • no common building blocks – high-performance micros are changing this • result: lots of one-of-a-kind architectures with no software base • architecture defines the programming model EECS 570: Fall 2003 -- rev3
What’s different today? Key: a microprocessor is now the fastest uniprocessor you can build • insurmountable handicap to build on anything else • Amdahl’s law • favorable performance per $ • gpp’s enjoy volume production • small-scale bus-based shared memory is well understood • P6 (Pentium Pro/II/III) supports 4-way “glueless” MP • supported by most common OS’s (e.g. NT, Solaris, Linux) EECS 570: Fall 2003 -- rev3
Technology Trends The natural building block for multiprocessors is now about the fastest EECS 570: Fall 2003 -- rev3
What's different today? (cont'd) • Meanwhile, programming models have converged to a few: • shared memory (better: shared address space) • message passing • data parallel (compiler maps to one of above) • data flow (more as concept than model) • Result: parallel system is microprocessors + memory + interconnection network • Still many design issues to consider EECS 570: Fall 2003 -- rev3
Parallel Architecture Today • Key: abstractions & interfaces for communication and cooperation • Communication Architecture • equivalent to Instruction Set Architecture for uniprocessors • Must consider • Usability (programmability) & performance • Feasibility/complexity of implementation (hw or sw) • Compilers, libraries and OS are important bridges today EECS 570: Fall 2003 -- rev3
Modern Layered Framework EECS 570: Fall 2003 -- rev3
Survey of Programming Models • Shared Address Space • Message Passing • Data Parallel • Others: • Dataflow • Systolic Arrays (see text) Examine programming model, motivation, intended applications, and contributions to convergence EECS 570: Fall 2003 -- rev3
Simple Example int i; double a, x[N], y[N], z[N], sum; /* input a, x[] , y[] */ sum = 0; for (i = 0; i< N; ++i) { z[i] = a * x[i] + y[i]; sum += z[i]; } EECS 570: Fall 2003 -- rev3
X[3] A X[2] A * * Y[3] Y[2] + + Dataflow Graph X[1] X[0] A A * * Y[1] Y[0] + + + + … + 2 + N-1 cycles to execute on N processors what assumptions? EECS 570: Fall 2003 -- rev3
Shared Address Space Architectures Any processor can directly reference any memory location • Communication occurs implicitly as result of loads and stores • Need additional synchronization operations Convenient: • Location transparency • Similar programming model to time-sharing on uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads Within one process (“lightweight” threads): all memory shared Among processes: portions of address spaces shared (mmap, shmat) • In either case, variables may he logically global or per-thread • Popularly known as shared memory machines or model • Ambiguous: memory may be physically distributed among processors EECS 570: Fall 2003 -- rev3
Small-scale Implementation Natural extension of uniprocessor: already have processor, memory modules and I/O controllers on interconnect of some sort • typically a bus • may be crossbar (mainframes) • occasionally multistage network (vector machines, ??) Just add processors! I/O devices mem mem mem mem I/O ctrl I/O ctrl interconnect processor processor EECS 570: Fall 2003 -- rev3
Simple Example: SAS version /* per-thread */ int i, my_start, my_end, me; /* global */ double a, x[N], y[N], z[N], sum; /* my_start, my_end based on N, # nodes */ for (i = my_start; i< my_end; ++i) z[i] = a* x[i] + y[iJ.; BARRIER; if (me == 0) sum = 0; for (i = 0; i< N; ++i) sum += z[i]; EECS 570: Fall 2003 -- rev3
Message Passing Architectures Complete computer as building block • ld/st access only private address space (local memory) • Communication via explicit I/O operations (send/receive) • Specify local memory buffers • Synchronization implicit in msgs Programming interface often more removed from basic hardware • Library and/or OS intervention Biggest machines are of this sort • IBM SP/2 • DoE ASCI program • Clusters (Beowulf etc.) EECS 570: Fall 2003 -- rev3
Simple Example: MP version int i, me; double a, x[N/P], y[N/P], z[N/P] , sum; sum = 0 ; for (i = 0; i < NIP; ++i) { z[i] = a* x[i] + y[i]; sum += z[i]; } if (me != 0) send (sum, 0); else for (i = 0; i< P; ++i) { recv(tmp, i); sum += tmp; } EECS 570: Fall 2003 -- rev3
Convergence: Scaling Up SAS • Problem is interconnect: cost (crossbar) or bandwidth (bus) • Distributed memory or non-uniform memory access (NUMA) • “Communication assist” turns non-local accesses into simple message transactions (e.g., read-request, read-response) • issues: cache coherence, remote memory latency • MP HW specialized for read/write requests mem mem mem interconnect interconnect mem $ mem $ mem $ proc proc proc $ $ $ proc proc proc Distributed memory Dance hall EECS 570: Fall 2003 -- rev3
Separation of Architecture from Model At the lowest level SM sends messages • HW is specialized to expedite read/write messages What programming model/ abstraction is supported at user level? Can I have shared-memory abstraction on message passing HW? Can I have message passing abstraction on shared memory HW? Recent research machines integrate both • Alewife, Tempest/Typhoon, FLASH EECS 570: Fall 2003 -- rev3
Data Parallel Systems Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Synchronization implicit in sequencing • Conceptually, a processor associated with each data element Architectural model • Array of many simple, cheap processing • elements (PE’s, really just datapaths) with no instruction memory, little data memory each. • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synch. pe pe pe pe pe pe pe pe pe EECS 570: Fall 2003 -- rev3
Simple Example: DP version double a, x[N], y[N], z[N], sum; z = a * x+ y; sum = reduce(+, z); Language supports array assignment, global operations Other examples: Document searching, image processing, ... Some recent (within last decade+) machines: • Thinking Machines CM-I, CM-2 (and CM-5) • Maspar MP-1 and MP-2 EECS 570: Fall 2003 -- rev3
DP Convergence with SAS/MP Popular when cost savings of centralized sequencer high • 60’s when CPU was a cabinet • Replaced by vectors in mid-7Os • More flexible w.r.t. memory layout and easier to manage • Revived in mid-80’s when datapath just fit on chip (w/o control) • No longer true with modem microprocessors Other reasons for demise • DP applications are simple, regular • relatively easy targets for compilers • easy to partition across relatively small # of microprocessors • MIMD machines effective for these apps plus many others Contributions to convergence • utility of fast global synchronization, reductions, etc. • high-Level model that can compile to either platform EECS 570: Fall 2003 -- rev3
Dataflow Architectures Represent computation as a graph of essential dependences • Logical processor at each node activated by availability of operands • Message (tokens) carrying tag of next instruction sent to next processor • Tag compared with others in matching store; match fires execution EECS 570: Fall 2003 -- rev3
Basic Dataflow Architecture EECS 570: Fall 2003 -- rev3
DF Evolution and Convergence Key characteristics • high parallelism: no artificial limitations • dynamic scheduling. fully exploit multiple execution units Problems • Operations have locality, nice to group them to reduce communication • No conventional notion of memory – how do you declare an array? • Complexity of matching store: large associative search! • Too much parallelism! ! ! mgmt overhead > benefit Converged to use conventional processors and memory • Group related ops to exploit registers, cache – fine-grain threading • Results communicated between threads via messages Lasting contributions: • Stresses tightly integrated communication & execution (e.g. create thread to handle message) • Remains useful concept for ILP hardware, compilers EECS 570: Fall 2003 -- rev3
Programming Model Design Issues • Naming: How is communicated data and/or partner node referenced? • Operations: What operations are allowed on named data? • Ordering: How can producers and consumers of data coordinate their activities? • Performance • Latency. How long does it take to communicate in a protected fashion? • Bandwidth: How much data can be communicated per second? How many operations per second? EECS 570: Fall 2003 -- rev3
Issue: Naming Single Global Linear-Address-Space (shared memory) Single Global Segmented-Name-Space (global objects/data parallel) • uniform address space • uniform accessibility (load/store) Multiple Local Address/Name Spaces (message passing) • two-level address space (node + memory address) • non-uniform accessibility (use messages if node != me) Naming strategy affects • Programmer/Software • Performance • Design Complexity EECS 570: Fall 2003 -- rev3
Issue: Operations • SAS • Id/st, arithmetic on any item (in source language) • additional ops for synchronization (locks, etc.), usually on memory locations • Message passing • Id/st, arithmetics etc. only on local items • send/recv on (local memory range, remote node ID) tuple • Data parallel • arithmetics etc. • global operations (sum, max, min, etc.) EECS 570: Fall 2003 -- rev3
Ordering • Uniprocessor • program order of instructions (note: specifies effect not reality) • SAS • uniprocessor within thread • implicit memory ordering among threads very subtle • need explicit synchronization operations • Message passing • uniprocessor within node; can't recv before send • Data parallel • program order of operations (just like uni) • all parallelism is within individual operations • implicit global barrier after every step EECS 570: Fall 2003 -- rev3
Issue: Order/Synchronization Coordination mainly takes three forms: • mutual exclusion (e.g., spin-locks) • event notification • point-to-point (e.g.., producer-consumer) • global (e.g., end of pbase indication, all or subset of processes) • global operations (e.g., sum) Issues: • synchronization name space (entire address space or portion) • granularity (per byte, per word. ...=> overhead) • low latency, low serialization (hot spots) • variety of approaches • test&set, compare&swap, ldLocked-stConditional • Full Empty bits and traps • queue-based locks, fetch&op with combining EECS 570: Fall 2003 -- rev3
Communication Performance Performance characteristics determine usage of operations at a layer • Programmer, compilers must choose strategies leading to performance Fundamentally, three characteristics: • Latency: time from send to receive • Bandwidth: max transmission rate (bytes/sec) • Cost: impact on execution time of program If processor does one thing at a time: bandwidth I/latency • But actually more complex in modern systems EECS 570: Fall 2003 -- rev3
Communication Cost Model • Communication time for one n-byte message: Comm Time = latency + n/bandwidth • Latency has two parts: • overhead is time the CPU is busy (protection checks, formatting header, copying data, etc) • rest of latency can be lumped as network delay • Bandwidth is determined by communication bottleneck • occupancy of a component is amount of time that component spends dedicated to one message • in steady state, can't do better than 1/(max occupancy) EECS 570: Fall 2003 -- rev3
Cost Model (cont'd) Overall execution-time impact depends on: • amount of communication • amount of comm. time hidden by other useful work (overlap) comm cost = frequency * (comm time - overlap) Note that: • overlap is limited by overhead • overlap is another form of parallelism EECS 570: Fall 2003 -- rev3
Replication • Very important technique for reducing communication frequency • Depends on naming model • Uniprocessor: caches do it automatically & transparently • SAS: uniform naming allows transparent replication • Caches again • OS can do it at page level (useful with distributed memory) • Many copies for same name: coherence problem • Message Passing: • Send/receive replicates, giving data a new name- not transparent • Software at some level must manage (programmer, library, compiler) EECS 570: Fall 2003 -- rev3
Summary of Design Issues Functional and performance issues apply at all layers Functional: Naming, operations and ordering Performance: Organization, latency, bandwidth, overhead, occupancy Replication and communication are deeply related • Management depends on naming model Goal of architects: design against frequency and type of operations that occur at communication abstraction, constrained by tradeoffs from above or below • Hardware/software tradeoffs EECS 570: Fall 2003 -- rev3