A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics

A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23, 2005 Supervisor: Professor Paul Chow

Outline • Motivation • System-Level Overview • Protocol Development • Results • Integration into a Programming Model • Conclusions/Questions

What is Molecular Dynamics? • A method of calculating the time-evolution of molecular configurations • Useful in the analysis of protein folding • Many applications in rational drug design

MD is Computationally Challenging • Forces (i.e. F=ma) are calculated between an atom and all other atoms in the system • An O(n2) problem across 10,000+ atoms • Force calculations are performed at femtosecond timesteps • Interesting results may take several μs of simulation (109+ timesteps required) MD simulations are typically run on supercomputers

An FPGA-based MD Accelerator • An ongoing collaborative project involves the development of an FPGA-based MD Accelerator • Advantages to an FPGA-based approach: • Massive parallel computation • Forces can be parallelized • Force computations can be accelerated ~88x • High-speed Serial I/O (SERDES) may be leveraged

Area of Focus • Develop communication protocol using high-speed SERDES links • Requirements: • Reliability • Light-weight • Minimal trip-time for small packets • Must be abstracted at the hardware and software levels

A Partial MD Simulator • Computation blockscan be hardwareor software executedon MicroBlazesoft processors • Software must be writtenusing a programming model Blocks → computationArrows → communication

System-Level Overview • The MD simulator is simplified to a Producer/Consumer model

System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development

System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented

System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented • An FSL (FIFO) is used as an abstracted method of data transport with SERDES logic

System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented • An FSL is used as an abstracted method of data transport with SERDES logic • An OPB bus interface is added for register access of components

System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented • An FSL is used as an abstracted method of data transport with SERDES logic • An OPB bus interface is added for register access of components • Deep FIFOs are added for logging high-speed data

Protocol Overview • A synchronous acknowledgement-based protocol was chosen • Simple and predictable • An inherent delay in waiting for acknowledgements • To mask this delay: • Multiple producers are connected to the SERDES interface • The link is time-multiplexed across multiple producers

Protocol Overview • All data has a word width of 4 bytes • Data packets: • Variable size (between 32 and 2016 bytes) • A 32-bit CRC is appended • Acknowledgements: • 8 bytes in size • Can interrupt transmission of data packets

Transmit Logic • Transmitter consists mainly of two components • Dual-port buffers: • The start address of the packet is kept in case a resend is necessary • Scheduler: • Schedules ready packets in a round-robin fashion From Producer via FSL To Scheduler of SERDES Link

Receive Logic • Receiver consists mainly of two components: • Dual-port buffers: • The start address of the packet is kept in case errors occur • Three-stage Dataflow Pipeline: Stage 1: Determine if incoming data is properly formatted Stage 2: Evaluate incoming data against all possible errors Stage 3: Pass results to acknowledgement handler From SERDES Link To Consumer via FSL

Design Effort • Majority of design effort was in error handling: • Transmitter: • Determine which packet combinations corrupt the system • Establish a priority among conflicting packet types • Receiver: • Handle all possible combinations of transmission errors

Test Environment • All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs • Ribbon cables were used to transfer serial data between non-impedance controlled connectors

Reliability and Sustainability • Verification test environment: • Send data concurrently from three producers to three respective consumers • Pseudo-random packet length • Consumers read from FSL at variable rates • Reliability: • Run this test under extremely poor line conditions • Sustainability: • Run this test under normal line conditions for a long period of time

Reliability • Reliability: 128-second Test Results

Sustainability • Sustainability: 8-hour Test Results

Comparison Against Other Communication Mechanisms • Two configurations are used • Configuration A: Saturate the channel with packets • Configuration B: Loop-back test • Compare against: • Simple FPGA-based 100BaseT Ethernet • TCP/IP FPGA-based 100BaseT Ethernet • TCP/IP Cluster-based Gigabit Ethernet

Throughput Results

One-way Trip Time Results

Area Consumption • Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 • Debug logic substantially increases area consumption: • FF usage increases 68% • LUT usage increases 43%

Integration into a Programming Model • Hardware abstraction: FSL • Software abstraction: An MPI-based Programming Model • Modified MPI_Send and MPI_Recv function calls while (1) { MPI_Send(data_outgoing, 64, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Recv(data_incoming, 64, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); }

Integration into a Programming Model • Replaced producers and consumers with a MicroBlaze processor • Several communication scenarios were tested

Outline • Motivation • System-Level Overview • Protocol Development • Results • Incorporation into a Programming Model • Conclusions/Questions

Conclusions • Final Results: • Reliable and sustainable • Abstracted at the software and hardware level • 2074 FFs and 2244 LUTs required for SERDES logic only • Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps • Minimum packet trip-time of 1.23μs

Acknowledgements Y. Gu, T. VanCourt, M. C. Herbordt, FPGA Acceleration of Molecular Dynamics Computations, To appear: Proceedings of Field Programmable Logic and Applications, August 2005. • Professor Régis Pomès, Chris Madill • Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl References

Transmitter Packet Collision Handling • Packets are enclosed by 8B/10B control characters (K-characters) • The type of packet is distinguished by the K-characters used • Certain combinations of control characters cannot be nested • Clock correction has priority over acknowledgement • Acknowledgement cannot interrupt the end of a data packet • Clock correction must avoid the beginning and end of a data packet

Receiver Error Handling • All combinations of errors at the receiver are handled correctly • Data errors (CRC errors) • Disparity errors or invalid characters (soft errors) • Errors in framing (frame errors) • Channel failures (hard errors) • Lost acknowledgements/repeat packets • Receiver buffers full

Test Configuration A • Send data concurrently from three producers to three respective consumers • Producers write to FSL as fast as possible • Consumers read from FSL as fast as possible • Analyze best-case throughput results

Test Configuration B • Send data from a producer to a consumer • Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA • A communication loop results that determines round-trip trip time (and therefore one-way trip time)

A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics

A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics

Presentation Transcript

Cryptanalysis on FPGA Based Hardware

Inter-Process Communication

Inter Process Communication

Multicore versus FPGA in the Acceleration of Discrete Molecular Dynamics* +

Summary of inter-process communication

Inter-process Communication

Inter-Process Communication

Process – inter-process communication

Inter-process Communication

FPGA based Acceleration of Linear Algebra Computations.

Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration

Software for development and communication with FPGA based hardware

Inter Process Communication

A Configurable Architecture for High-Speed Communication Systems

Inter-Process Communication

Inter-Process Communication

Inter Process Communication

Inter-process Communication

A FPGA-based Parallel Architecture for Scalable High-Speed Packet Classification

FPGA-based acceleration platform for chip verification

Multicore versus FPGA in the Acceleration of Discrete Molecular Dynamics* +

Unix Inter-Process Communication