390 likes | 541 Views
A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics. Presented by: Chris Comis September 23, 2005 Supervisor: Professor Paul Chow. Outline. Motivation System-Level Overview Protocol Development Results
E N D
A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23, 2005 Supervisor: Professor Paul Chow
Outline • Motivation • System-Level Overview • Protocol Development • Results • Integration into a Programming Model • Conclusions/Questions
What is Molecular Dynamics? • A method of calculating the time-evolution of molecular configurations • Useful in the analysis of protein folding • Many applications in rational drug design
MD is Computationally Challenging • Forces (i.e. F=ma) are calculated between an atom and all other atoms in the system • An O(n2) problem across 10,000+ atoms • Force calculations are performed at femtosecond timesteps • Interesting results may take several μs of simulation (109+ timesteps required) MD simulations are typically run on supercomputers
An FPGA-based MD Accelerator • An ongoing collaborative project involves the development of an FPGA-based MD Accelerator • Advantages to an FPGA-based approach: • Massive parallel computation • Forces can be parallelized • Force computations can be accelerated ~88x • High-speed Serial I/O (SERDES) may be leveraged
Area of Focus • Develop communication protocol using high-speed SERDES links • Requirements: • Reliability • Light-weight • Minimal trip-time for small packets • Must be abstracted at the hardware and software levels
Outline • Motivation • System-Level Overview • Protocol Development • Results • Integration into a Programming Model • Conclusions/Questions
A Partial MD Simulator • Computation blockscan be hardwareor software executedon MicroBlazesoft processors • Software must be writtenusing a programming model Blocks → computationArrows → communication
System-Level Overview • The MD simulator is simplified to a Producer/Consumer model
System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development
System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented
System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented • An FSL (FIFO) is used as an abstracted method of data transport with SERDES logic
System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented • An FSL is used as an abstracted method of data transport with SERDES logic • An OPB bus interface is added for register access of components
System-Level Overview • The MD simulator is simplified to a Producer/Consumer model • The model is then adapted for SERDES development • Producers and consumer hardware blocks are implemented • An FSL is used as an abstracted method of data transport with SERDES logic • An OPB bus interface is added for register access of components • Deep FIFOs are added for logging high-speed data
Outline • Motivation • System-Level Overview • Protocol Development • Results • Integration into a Programming Model • Conclusions/Questions
Protocol Overview • A synchronous acknowledgement-based protocol was chosen • Simple and predictable • An inherent delay in waiting for acknowledgements • To mask this delay: • Multiple producers are connected to the SERDES interface • The link is time-multiplexed across multiple producers
Protocol Overview • All data has a word width of 4 bytes • Data packets: • Variable size (between 32 and 2016 bytes) • A 32-bit CRC is appended • Acknowledgements: • 8 bytes in size • Can interrupt transmission of data packets
Transmit Logic • Transmitter consists mainly of two components • Dual-port buffers: • The start address of the packet is kept in case a resend is necessary • Scheduler: • Schedules ready packets in a round-robin fashion From Producer via FSL To Scheduler of SERDES Link
Receive Logic • Receiver consists mainly of two components: • Dual-port buffers: • The start address of the packet is kept in case errors occur • Three-stage Dataflow Pipeline: Stage 1: Determine if incoming data is properly formatted Stage 2: Evaluate incoming data against all possible errors Stage 3: Pass results to acknowledgement handler From SERDES Link To Consumer via FSL
Design Effort • Majority of design effort was in error handling: • Transmitter: • Determine which packet combinations corrupt the system • Establish a priority among conflicting packet types • Receiver: • Handle all possible combinations of transmission errors
Outline • Motivation • System-Level Overview • Protocol Development • Results • Integration into a Programming Model • Conclusions/Questions
Test Environment • All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs • Ribbon cables were used to transfer serial data between non-impedance controlled connectors
Reliability and Sustainability • Verification test environment: • Send data concurrently from three producers to three respective consumers • Pseudo-random packet length • Consumers read from FSL at variable rates • Reliability: • Run this test under extremely poor line conditions • Sustainability: • Run this test under normal line conditions for a long period of time
Reliability • Reliability: 128-second Test Results
Sustainability • Sustainability: 8-hour Test Results
Comparison Against Other Communication Mechanisms • Two configurations are used • Configuration A: Saturate the channel with packets • Configuration B: Loop-back test • Compare against: • Simple FPGA-based 100BaseT Ethernet • TCP/IP FPGA-based 100BaseT Ethernet • TCP/IP Cluster-based Gigabit Ethernet
Area Consumption • Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 • Debug logic substantially increases area consumption: • FF usage increases 68% • LUT usage increases 43%
Outline • Motivation • System-Level Overview • Protocol Development • Results • Integration into a Programming Model • Conclusions/Questions
Integration into a Programming Model • Hardware abstraction: FSL • Software abstraction: An MPI-based Programming Model • Modified MPI_Send and MPI_Recv function calls while (1) { MPI_Send(data_outgoing, 64, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Recv(data_incoming, 64, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); }
Integration into a Programming Model • Replaced producers and consumers with a MicroBlaze processor • Several communication scenarios were tested
Outline • Motivation • System-Level Overview • Protocol Development • Results • Incorporation into a Programming Model • Conclusions/Questions
Conclusions • Final Results: • Reliable and sustainable • Abstracted at the software and hardware level • 2074 FFs and 2244 LUTs required for SERDES logic only • Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps • Minimum packet trip-time of 1.23μs
Acknowledgements Y. Gu, T. VanCourt, M. C. Herbordt, FPGA Acceleration of Molecular Dynamics Computations, To appear: Proceedings of Field Programmable Logic and Applications, August 2005. • Professor Régis Pomès, Chris Madill • Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl References
Transmitter Packet Collision Handling • Packets are enclosed by 8B/10B control characters (K-characters) • The type of packet is distinguished by the K-characters used • Certain combinations of control characters cannot be nested • Clock correction has priority over acknowledgement • Acknowledgement cannot interrupt the end of a data packet • Clock correction must avoid the beginning and end of a data packet
Receiver Error Handling • All combinations of errors at the receiver are handled correctly • Data errors (CRC errors) • Disparity errors or invalid characters (soft errors) • Errors in framing (frame errors) • Channel failures (hard errors) • Lost acknowledgements/repeat packets • Receiver buffers full
Test Configuration A • Send data concurrently from three producers to three respective consumers • Producers write to FSL as fast as possible • Consumers read from FSL as fast as possible • Analyze best-case throughput results
Test Configuration B • Send data from a producer to a consumer • Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA • A communication loop results that determines round-trip trip time (and therefore one-way trip time)