390 likes | 572 Views
Message Passing On Tightly-Interconnected Multi-Core Processors. James Psota and Anant Agarwal MIT CSAIL. Technology Scaling Enables Multi-Cores. Multi-cores offer a novel environment for parallel computing. cluster. multi-core. Traditional Communication On Multi-Processors. Interconnects
E N D
Message Passing On Tightly-Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL
Technology Scaling Enables Multi-Cores Multi-cores offer a novel environment for parallel computing cluster multi-core
Traditional Communication On Multi-Processors Interconnects • Ethernet TCP/IP • Myrinet • Scalable Coherent Interconnect (SCI) Shared Memory • Shared caches or memory • Remote DMA (RDMA) Beowulf Cluster AMD Dual-Core Opteron
On-Chip Networks Enable Fast Communication • Some multi-cores offer… • tightly integrated on-chip networks • direct access to hardware resources (no OS layers) • fast interrupts MIT Raw Processor used for experimentation and validation
Parallel Programming is Hard • Must orchestrate of computation and communication • Extra resources present both opportunity and challenge • Trivial to deadlock • Constraints on message sizes • No operating system support
rMPI’s Approach Goals • robust, deadlock-free, scalable programming interface • easy to program through high-level routines Challenge • exploit hardware resources for efficient communication • don’t sacrifice performance
Outline • Introduction • Background • Design • Results • Related Work
The Raw Multi-Core Processor • 16 identical tiles • processing core • network routers • 4 register-mapped on-chip networks • Direct access to hardware resources • Hardware fabricated in ASIC process Raw Processor
Raw’s General Dynamic Network • Handles run-time events • interrupts, dynamic messages • Network guarantees atomic, in-order messages • Dimension-ordered wormhole routed • Maximum message length: 31 words • Blocking sends/receives • Minimal network buffering
MPI: Portable Message Passing API • Gives programmers high-level abstractions for parallel programming • send/receive, scatter/gather, reductions, etc. • MPI is a standard, not an implementation • many implementations for many HW platforms • over 200 API functions • MPI applications portable across MPI-compliant systems • Can impose high overhead
process 0 private address space MPI Semantics: Cooperative Communication • Data exchanged cooperatively via explicit send and receive • Receiving process’s memory only modified with its explicit participation • Combines communication and synchronization process 1 recv(src=0, tag=42) send(dest=1, tag=17) recv(src=0, tag=17) send(dest=1, tag=42) temp interrupt interrupt private address space communication channel tag=42 tag=17
Outline • Introduction • Background • Design • Results • Related Work
High-Level MPI Layer • Argument checking (MPI semantics) • Buffer prep • Calls appropriate low level functions • LAM/MPI partially ported
Collective Communications Layer • Algorithms for collective operations • Broadcast • Scatter/Gather • Reduce • Invokes low level functions
Point-to-Point Layer • Low-level send/receive routines • Highly optimized interrupt-driven receive design • Packetization and reassembly
Outline • Introduction • Background • Design • Results • Related Work
rMPI Evaluation • How much overhead does high-level interface impose? • compare against hand-coded GDN • Does it scale? • with problem size and number of processors? • compare against hand-coded GDN • compare against commercial MPI implementation on cluster
End-to-End Latency Overhead vs. Hand-Coded (1) • Experiment measures latency for: • sender: load message from memory • sender: break up and send message • receiver: receive message • receiver: store message to memory
End-to-End Latency Overhead vs. Hand-Coded (2) 1 word: 481% packet management complexity overflows cache 1000 words: 33%
Performance Scaling: Jacobi 16x16 input matrix 2048 x 2048 input matrix
Performance Scaling: Jacobi, 16 processors sequential version cache capacity overflow sequential version cache capacity overflow
Overhead: Jacobi, rMPI vs. Hand-Coded many small messages memory access synchronization 16 tiles: 5% overhead
Matrix Multiplication: rMPI vs. LAM/MPI many smaller messages; smaller message length has less effect on LAM
Related Work • Low-latency communication networks • iWarp, Alewife, INMOS • Multi-core processors • VIRAM, Wavescalar, TRIPS, POWER 4, Pentium D • Alternatives to programming Raw • scalar operand network, CFlow, rawcc • MPI implementations • OpenMPI, LAM/MPI, MPICH
Summary • rMPI provides easy yet powerful programming model for multi-cores • Scales better than commercial MPI implementation • Low overhead over hand-coded applications
Thanks! For more information, see Master’s Thesis: http://cag.lcs.mit.edu/~jim/publications/ms.pdf
rMPI messages broken into packets rMPI sender process 1 • GDN messages have a max length of 31 words Receiver buffers and demultiplexes packets from different sources Messages received upon interrupt, and buffered until user-level receive interrupt 1 • rMPI packet format for 65 [payload] word MPI message 2 rMPI receiver process 2 1 3 rMPI sender process 2
rMPI: enabling MPI programs on Raw rMPI… • is compatible with current MPI software • gives programmers already familiar with MPI an easy interface to program Raw • gives programmers fine-grain control over their programs when trusting automatic parallelization tools are not adequate • gives users a robust, deadlock-free, and high-performance programming model with which to program Raw ► easily write programs on Raw without overly sacrificing performance
Packet boundary bookkeeping • Receiver must handle packet interleaving across multiple interrupt handler invocations
Receive-side packet management • Global data structures accessed by interrupt handler and MPI Receive threads • Data structure design minimizes pointer chasing for fast lookups • No memcpy for receive-before-send case
Interrupt handler CFG • logic supports MPI semantics and packet construction
Future work: improving performance • Comparison of rMPI to standard cluster running off-the-shelf MPI library • Improve system performance • further minimize MPI overhead • spatially-aware collective communication algorithms • further Raw-specific optimizations • Investigate new APIs better suited for TPAs
Future work: HW extensions • Simple hardware tweaks may significantly improve performance • larger input/output FIFOs • simple switch logic/demultiplexing to handle packetization could drastically simplify software logic • larger header words (64 bit?) would allow for much larger (atomic) packets • (also, current header only scales to 32 x 32 tile fabrics)
Conclusions • MPI standard was designed for “standard” parallel machines, not for tiled architectures • MPI may no longer make sense for tiled designs • Simple hardware could significantly reduce packet management overhead increase rMPI performance