290 likes | 368 Views
Basic message passing benchmarks, methodology and pitfalls. Rolf Hempel C & C Research Laboratories, NEC Europe Ltd. Rathausallee 10 53757 Sankt Augustin Germany email: hempel@ccrl-nece.technopark.gmd.de. SPEC Workshop, Wuppertal, Sept. 13, 1999.
E N D
Basic message passing benchmarks, methodology and pitfalls Rolf Hempel C & C Research Laboratories, NEC Europe Ltd. Rathausallee 10 53757 Sankt Augustin Germany email: hempel@ccrl-nece.technopark.gmd.de SPEC Workshop, Wuppertal, Sept. 13, 1999 Further info.:http://www.ccrl-nece.technopark.gmd.de
Japan C&C Research Laboratories NEC Europe Ltd. (Bonn, Germany) NEC Research Institute, Inc. (Princeton, U.S.A.) C&C Research Laboratories, NEC U.S.A., Inc. (Princeton, San Jose, U.S.A.) NEC R&D Group Overseas Facilities
Earth Simulator SX-5 SX-5M Image courtesy of National Space Development Agency of Japan / Japan Atomic Energy Research Institute Cenju-4 MPI/SX product development (currently for SX-4/5) MPI design for Earth Simulator MPI for Cenju derived from MPI/SX MPI for PC cluster LAMP MPI Implementations at NEC CCRLE:
Most important MPI implementation: MPI/SX SX-5 (in the past SX-4): Commercial product of NEC, Parallel vector supercomputer Since 1997: MPI/SX product development at CCRLE ·Standard-compliant, fully tested interface ·Optimized MPI-1, MPI-2 almost finished ·Maintenance & Customer support
Image courtesy of National Space Development Agency of Japan / Japan Atomic Energy Research institute MPI design for the Earth Simulator Massively parallel computer: Thousands of processors MPI is the basis for all large-scale applications Design and implementation at CCRLE At the moment: design in collaboration with OS group at NEC Fuchu
Behavior of application programs can be modeled The purpose of message-passing benchmarks Goal: measure the time needed for elementary operations (such as send/receive) Problem: difficult to measure time for single operation: ·no global clock ·clock resolution too low ·difficult to differentiate: receive time synchronization delay
Send to 1 Recv from 0 Iterate 1000 times measure time T for entire loop Recv from 1 Send to 0 Standard solution: measure loops over communication operations Simple example: the ping-pong benchmark: Process 0 Process 1 Result: Time for single message = T / 2000
Implicit assumption:The time for a message in an application code will be similar to the benchmark result Why is this usually not the case? 1. Receiver in ping-pong is always ready to receive Þ receive in ‘solicited message’ mode Þ delays or intermediate copies can be avoided 2. Only two processes are active Þ no contention on interconnect system, non-scaling effects are not visible (e.g. locks on global data structures)
Single-threaded MPI implementation: Multi-threaded MPI implementation: Application process Call to MPI_Recv Application thread Call to MPI_Recv MPI process Communication thread constantly check for and service comm. requests MPI library: check for and service outstanding requests 3. Untypical response to different progress concepts:
Bad sender/receiver synchronization Þ bad progress Communication progress independent of sender/receiver synchronization This advantage is not seen in the ping-pong benchmark! Þ single-threaded always better (Comparison of ping-pong latency on Myrinet PC cluster) Single-threaded MPI: progress at receiver only when MPI routine is called Multi-threaded MPI: progress at receiver independent of application thread
MPI process active in ping-pong Much higher bandwidth than in real applications MPI_Recv cache MPI_Send Comparison of ping-pong bandwidth for Myrinet PC cluster cached versus non-cached data 4. Data may be cached between loop iterations:
Processor myrank-1 Processor myrank A(1) a b A(2) · · All transfers A(n) First transfer only Cache This can be even worse on CC-NUMA architectures (e.g. SGI Origin 2000): Operation to be timed: call shmem_get8 (b(1), a(1), n, myrank-1) Consequence: All but first transfers are VERY fast!
Most popular Parameters: Latency: time for 0 byte message Bandwidth: asymptotic throughput for long messages The problem of parametrization Goal: Instead of displaying a performance graph, define a few parameters that characterize the communication behavior
Red fitting: latency = 24 msec Blue fitting: latency = 8.5 msec Problem: This model assumes a very simple communication protocol It is not consistent with most MPI implementations Example: MPI on the NEC Cenju-4 (MPP system, R10000 processors) Something seems to change at message size = 1024 bytes
? It can be even worse: Discontinuities in timing curve Example: NEC SX-4, send / receive within 32 processor node
Sender process Sender process Receiver process Receiver process Short messages Message envelope + data Pre-defined slots Pre-defined slots Pre-defined slots Local copy of message data ‘’Eager’’ messages Message envelope User memory Receive buffer Get data when receiver ready Receiver process Sender process ‘’Long’’ messages Message envelope Synchronization Synchronization Send buffer Receive buffer In most modern MPI implementations:Three different protocols, depending on message length
Good reasons for protocol changes: Short messages:·Copying does not cost much ·Important not to block the sender for too long ·Use pre-allocated slots for intermediate copy Þ avoid time to allocate memory ‘‘Eager’’ messages:·Copying does not cost much ·Important not to block the sender for too long ·Allocate intermediate buffer on the fly Long messages:·Copying is expensive, avoid intermediate copies ·Rendezvous between sender and receiver ·Synchronous data transfer Þ optimal throughput In SX-5 implementation:Protocol changes at 1024 and 100,000 Bytes
a happy customer! Protocol changes are dangerous for curve fittings! Famous example: The COMMS1 - COMMS3 benchmarks Procurement at NEC: A Customer requested a certain message latency, to be measured with COMMS1 Problem:Simplistic curve fitting in COMMS1 Þ SX-4 latency was too high Solution: We made our MPI SLOWER for long messages Þ tilt of fitting line Þ much better latency
The problem of averaging Problem: difficult to measure time for single MPI operation: ·no global clock ·clock resolution too low ·difficult to differentiate: receive time synchronization delay Solution: Averaging over many iterations Implicit assumption: All iterations take the same time Þ Averaging increases accuracy This is not always true!
Example: Remote get operation shmem_get8 on Cray T3E Experiment with message size of 10 Kbytes: ·100 measurement runs (tests) ·In each test, ITER calls to shmem_get8 in a loop ·For each test, time divided by ITER Þ Time for a single operation The graph on the next slide shows the results for 100 tests. The reported results are courtesy Prof. Glenn Luecke Iowa State University Ames, Iowa U.S.A.
There are 7 huge spikes in the graph. Probable reason: operating system interruptions. Should those tests be included in the average time?
A more subtle example: MPI_Reduce Test setup: ·Root process plus six child processes ·Reduction on an 8 byte real ·Time is measured for 1000 iterations Real*8 a, b, t1, t2 · · t1 = MPI_WTIME () Do i = 1, 1000 call MPI_REDUCE (a, b, 1, MPI_DOUBLE_PRECISION, + MPI_MAX, 0, MPI_COMM_WORLD, IERROR) end do t2 = MPI_WTIME () Let’s compare the behavior of two MPI implementations:
First implementation, details: ·Asynchronous algorithm ·Senders are not blocked, independent of receiver status ·Low synchronization delay Þ relatively good algorithm 1. Step: synchronize all procs 2. Step: perform reduction as in first algorithm Second implementation, details: ·Synchronous algorithm ·Senders are blocked, until the last process is ready ·High synchronization delay Þ relatively bad algorithm root = receiver root = receiver senders senders
root = receiver senders What happens at runtime? For the first MPI implementation: Leaf processes are fastest, and are not blocked in single operation Þ They quickly go through loop Þ Messages pile up at intermediate processors Þ Increasing message queues slow down those processors Þ At some point flow-control takes over, but progress is slow This does not happen with the second (bad) MPI implementation! The unnecessary synchronization keeps messages from piling up. Important to remember:·We want to measure the time for ONE reduction operation. ·We only use loop averaging to increase the accuracy.
MPI_Send / MPI_Recv Benchmark driver: ·user interface ·call individual components ·report timings / statistics ·compute effective benchmark parameters MPI_Bsend / MPI_Recv MPI_Isend / MPI_Recv MPI_Reduce MPI_Bcast ·Benchmark suite executed in a single run ·Elegant and flexible to use Compound Benchmarks Can there be any problems? ...... Guess what!
What if the active process sets change between benchmark components? MPI_Send / MPI_Recv MPI_Bsend / MPI_Recv MPI_Isend / MPI_Recv MPI_Reduce MPI_Bcast In a parallel program: Good idea not to print results in every process (or the ordering is up to chance) Alternative: Send messages to process 0, Do all the printing there
We have seen a public benchmark suite in which: ·In every phase, every process sends a message to process 0 ·Timingresults for this phase ·Message ‘‘I am not active in this phase’’ ·At the same time: process 0 is active in the ping-pong with process 1 Result: Process 0 gets bombarded with ‘‘unsolicited messages’’ during the ping-pong. Good MPI implementation: handle incoming comm. Requests as early as possible Þ good progress Bad MPI implementation: handle ‘‘unsolicited messages’’ only if nothing else is to be done Þ bad progress In our case:The bad MPI implementation wins the ping-pong benchmark!
Summary: Important to remember when writing an MPI benchmark: Point-to-point performance is very sensitive to ·relative timing of send / receive ·the protocol (dependent on message length) ·contention / locks ·cache effects Þ Ping-pong results have very limited value Trade-off between progress and message latency (invisible in ping-pong benchmark, can be important in real applications) Parameter fittings usually lead to ‘‘bogus’’ results (they do not take protocol changes into account)
Summary: (continued) Important to remember when writing an MPI benchmark: Loop averaging is highly problematic ·variance in single operation performance invisible ·messages from different iterations may lead to congestion Keep single benchmarks separate ·different stages in a suite may interfere with each other ·‘‘unsolicited message’’ problem It should not happen that making MPI slower Þ better benchmark results