180 likes | 283 Views
An MPI Approach to High-Performance Computing with FPGAs. Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow Electrical and Computer Engineering, University of Toronto.
E N D
An MPI Approach to High-Performance Computing with FPGAs Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow Electrical and Computer Engineering, University of Toronto SHARCNET Symposium on GPU and CELL Computing 2008
Introduction • Coarse-grained parallelization allows applications to be distributed across hundreds or thousands of nodes • FPGAs can accelerate many computing tasks by 2 or 3 orders of magnitude over a CPU • This work demonstrates a method for combining high performance computer clusters with FPGAs for maximum computational power Many scientific applications can be accelerated by targeting parallel machines
Popular HPC Configurations Interconnection Network Interconnection Network Interconnection Network CPU CPU ASP (GPU/ FPGA) CPU CPU ASP (GPU/ FPGA) CPU FPGA ASP (GPU/ FPGA) CPU FPGA CPU ASP (GPU/ FPGA) CPU CPU GPU … … … … Interconnection Network MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM
How Do You Program This? • FPGAs can speed up applications, however... • High barrier of entry for designing digital hardware • Developing monolithic FPGA designs is very daunting • How does one easily take advantage of FPGAs for accelerating HPC applications?
TMD Task Supercomputer TMD Machine CPU Processor on CPU Node ComputingEngine Embedded Microprocessor Toronto Molecular Dynamics machine is an investigation into high performance computing based on a scalable network of FPGAs Applications are defined as a simple collection of computing tasks A task is roughly equivalent to a software process/thread Major focus is facilitating transition from cluster-based applications to TMD machine
Application Design Flow • Step 1: Application Prototyping • Software prototype of application developed • Profiling identifies compute-intensive routines • Step 2: Application Refinement • Partitioning into tasks communicating using MPI • Communication patterns analyzed to determine network topology • Step 3: TMD Prototyping • Tasks are ported to soft-processors on TMD • On-chip communication network verified • Step 4: TMD Optimization • Intensive tasks replaced with hardware engines • MPE handles communication for hardware engines • Hardware engines easily moved, replicated Application Prototype MPI MPI Process A Process B Process C CPU Cluster FPGA Network TMD-MPI TMD-MPI TMD-MPI TMD-MPI B A C B
Communication Use essential subset of MPI standard Software library for tasks run on processors Hardware Message Passing Engine (MPE) for hardware-based tasks Tasks do not know (or care) whether remote tasks are run as software processes or hardware engines MPI isolation of tasks facilitates C-to-gates compilers
Xilinx ACP The Xilinx Advanced Computing Platform are modules that plug directly into CPU socket Direct access to FSB CPU and FPGA are both peers in system Equal priority main memory access
Xilinx ACP CPU does not have to orchestrate activity of FPGA CPU does not have to relay data to and from FPGAs FPGA not on slow connection to CPU All tasks can run independently
+ d ò - d ® å F Tasks in MD + + U = + +
Final MD Target MEM FSB Quad Core CPU Xilinx ACP Module Xilinx ACP Module Xilinx ACP Module Comm Comm Comm Comm FPGA Comm FPGA Comm FPGA Ewald NBE 1 NBE 5 NBE 6 NBE 2 User FPGA 1 User FPGA 5 User FPGA 3 NBE 7 NBE 3 Ewald NBE 4 NBE 8 User FPGA 6 User FPGA 2 User FPGA 4
Conclusion Target system is a combination of software running on CPUs and FPGA hardware accelerators Key to performance is in identifying hotspots and adding corresponding hardware acceleration Hardware engineer must focus only on small part of overall application MPI facilitates hardware/software isolation, collaboration
SOCRN Acknowledgements TMD Group: Past Members: Prof. Paul Chow Prof. RégisPomès1,2 Danny Gupta AlirezaHeiderbarghi Alex Kaganov Daniel Ly Chris Madill1,2 Daniel Nunes Emanuel Ramalho David Woods David Chui Christopher Comis Sam Lee Daniel Ly Lesley Shannon Mike Yan Arches Computing: Arun Patel Manuel Saldaña 1: Molecular Structure and Function, The Hospital for Sick Children 2: Department of Biochemistry, University of Toronto
TMD-MPI Implementation Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Application Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts. MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation Layer 2: Communication Primitives MPI_Send and MPI_Recv methods are used to transmit data between processes. FSL Hardware Interface Hardware Layer 1: Hardware Interface Low level methods to communicate with FSLs for both on and off-chip communication.
Intra-FPGA Communication • Communication links are based on Fast Simplex Links (FSL) • Unidirectional Point-to-Point FIFO • Provides buffering and flow-control • Can be used to isolate different clock domains • FSLs simplify component interconnects • Standardized interface, used by both hardware engines and processors • Can assemble system modules rapidly • Application-specific network topologies can be defined
Inter-FPGA Communication • Inter-FPGA communication uses abstracted communication links • Communication is independent of physical link • Single serial transceivers (FSL-over-Aurora) • Bonded serial transceivers (FSL-over-XAUI) • Parallel Busses (FSL-over-Wires) • FSL-over-10GbE coming soon…