1 / 41

Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

Jeremy Espenshade May 7, 2009. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Motivation. High Performance Computing Performing difficult computations in an acceptable time period Example Areas of Interest: Cryptanalysis Bioinformatics Molecular Dynamics

ashlyn
Download Presentation

Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jeremy Espenshade May 7, 2009 Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

  2. Motivation • High Performance Computing • Performing difficult computations in an acceptable time period • Example Areas of Interest: • Cryptanalysis • Bioinformatics • Molecular Dynamics • Image Processing • Specialized Hardware • Architectural differences can provide orders of magnitude speedup on suitable applications • GPGPUs: Visualization, Linear Algebra, etc through data parallelism • Cell Processor: Video Encoding, Linear Algebra, etc through data parallelism • FPGAs: DSP, Image Processing, Cryptography, etc through bit-level parallelism

  3. Outline • Background • Cluster Computing • FPGA Technology • Commercial FPGA Supercomputers • Proposed Framework • Requirements and Motivation • Hardware and Software Organization • HW/SW Interaction • Application Case Studies • DES Cryptanalysis • Step-by-Step Design Flow • Demonstration FPGA Cluster • Performance Comparison • Matrix Multiplication • Platform Performance Characterization • Conclusion and Future Work

  4. Cluster Computing • Historical monolithic supercomputers have given way to networks of smaller computers • “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” - Seymour Cray • Middleware technologies have made cluster construction and programming easy and efficient • Message Passing – MPI, PVM • Shared Memory – OpenMP • Remote Procedure Call – Java RMI, CORBA • Grid Organization – Condor, Globus

  5. Message Passing Interface • De facto standard API for inter-process communication over distributed memory • Language-independent library with point-to-point and collective operations • MPI_Send/MPI_Recv • MPI_Bcast/MPI_Reduce • MPI_Scatter/MPI_Gather • MPI_Barrier • OpenMPI • Open source implementation in native C

  6. Message Passing Interface • Creates “Virtual Topology” of computation environment with process ranks Process Tree: MPI_Send(data, child) MPI_Recv(data,parent) Process Ring: MPI_Send(data, (myrank +1) % size) MPI_Recv(data, (myrank-1)%size) Master/Slave: MPI_Bcast(data, to slaves) MPI_Reduce(data, to master) R0 R0 R0 R3 R1 R3 R1 R3 R2 R1 R2 R2

  7. FPGAs Basic Logic block • Field-Programmable Gate Arrays • Devices in which function can be specified through a hardware description language (VHDL/Verilog) • Slower than custom ASICs but much more flexible • Large Degree of fine-grained parallelism Configurable I/0 Block Interconnects

  8. Configurable Logic Blocks • FPGAs realize computation over a network of CLBs • Each CLB contains: • Eight 6-Input Look-Up- Tables • Eight Flip-Flops • Control Multiplexors • Arithmetic and Carry Logic • LUTs preconfigured to implement any 6-input logical function • ‘A = B ⊕ C ⊕ D ⊕ E ⊕ F ⊕ G’ can be calculated in a single cycle even over large operand lengths • Bit-Level Parallelism

  9. Xilinx Virtex-5 FXT • Hybrid FPGA • HardwiredPowerPC Cores • 2-15k CLBs • Arithmetic • DSP Slices • Ethernet • MAC Units

  10. FPGA Supercomputers • Cray XT5h • AMD Opteron and Xilinx Virtex-4 FPGAs on single blade • Cray SeaStar2+™ Interconnect • Custom API for RPUs http://www.cray.com/Assets/PDF/products/xt/CrayXT5hBrochure.pdf • SRC6/7 • Altera Stratix II FPGAs and Intel Xeon or AMD Opteron Processors • SRC HI-BAR® Interface • Carte® Programming Environment http://www.srccomp.com/products/src7_hardware.asp

  11. Framework Motivation • Motivating Concepts • FPGAs have great performance potential • Especially applications with high bit, instruction, and data level parallelism • Many FPGAs working together would allow even better parallelism exploitation • Increased data parallelism through multi-node partitioning • Task parallelism through independent heterogeneous nodes • FPGA cluster frameworks are currently limited to proprietary supercomputers • Commodity clusters will reduce the barrier to entry and promote FPGA integration and use • Occam’s Razor applies to programming environments - Simple is good.

  12. Framework Requirements • Easily Programmable • Common API for both parallel programming and hardware access • Modular hardware supported without modification • Hardware/Software Design Independent • Interface access independent of application and implementation • Minimal Framework Overhead • System should exhibit acceptable performance • Commodity technologies • Ethernet networking, Linux OS, open software • Scalable and Flexible • Additional FPGAs easily integrated • Heterogeneous nodes seamlessly supported • Extensible • Future improvements possible without harsh restrictions

  13. Physical Organization • Ethernet Network • Hard-wired MAC on chip … … HW HW HW 10/100/1000T Ethernet Compact Flash Ethernet MAC

  14. Single-Node Software Environment • Embedded Linux OS • Root File System on Compact Flash (256MB) • BusyBox Utilities • Minimal Libraries • OpenMPI • OpenSSH: Certificate-based security for shell access • OpenSSL: TCP/IP security • Various support libraries (zlib,etc) • Special Device Files in /dev • Hardware devices mapped to character devices • fopen, fwrite, fread, and fclose commands • Dynamics major number reported in /proc/devices • FILE * AllocateResource(char * base_name) • Lock-based Arbitration

  15. Interaction Stack MPI Application Software Design User Application MPI_Init MPI_Recv MPI_Send MPI_Finalize fopen fwrite fread fclose open write read release Kernel Driver Enable Interrupts Constant Interface Reset FIFOs PLB Interrupt Controller Write FIFO Read FIFO HW FIFO State Machines Hardware Design HW Unit Application-Specific Hardware

  16. Driver Details • Xilinx provides minimal set of drivers • SysAce CF, Ethernet MAC (MII/GMII), etc • Custom FIFO Driver for HW abstraction • Boot time • Registers platform device and character driver • Maps physical memory address and IRQ to virtual space • Constructs address offsets for registers • Device Open • Resets hardware accelerator and FIFOs • Enable Interrupts • Interrupt Handling • Read waits on data in Read FIFO • Hardware generates interrupt

  17. Virtual Organization • Each hardware unit can behave as an independent node • Hardware units can perform different functions and similar functions at different speeds • Each FPGA can host as many units as fit in the device • Configurations with multiple hardware units/process or other exotic setups are also inherently supported MAC MAC MAC MAC CF CF CF CF P0 P1 P2 P3 P4 P4 P5 P6 P7 P8 P9

  18. Application Case Study • The Data Encryption Standard (DES) is a widely used and longstanding block cypher • Due to insufficient cryptographic strength resulting from a 56 bit key, DES has been successfully broken by brute force exhaustive key searches in the past decade • Past approaches: • Distributed computing (a la Folding@Home) • Custom DES ASICs in a cluster • A hybrid of the above two approaches • Custom system of 120 Spartan-IIe FPGAs

  19. DES Algorithm

  20. System Development Process Hardware Design Software Design • Independent HW/SW Design • Hardware • Algorithm Implementation • Search keys as fast as possible • Software • Partition Key Space • Coordinate Results

  21. Embedded Platform • Xilinx Base System Builder • Generates PowerPC, DDR, Compact Flash and Ethernet Interfaces

  22. Hardware Device Creation • Xilinx Peripheral Creation Wizard • PLB Slave Interface • Read/Write FIFOs • Interrupt Controller • Software Reset • Software Accessible Registers • State Machine centric design • FIFO Access • Interface/Accelerator Interaction • Interrupt Generation

  23. DES Hardware Implementation • Key Guesser top level • Contains 2X DES encryption engines • 18 stage pipeline (16 rounds plus 1 input, 1 output) • Initialized with known plaintext and ciphertext • 24 high bits of key expected as input • Each DES engine checks lowest (32 – X) bits with assigned middle bits based on component number • If key is found, return it, otherwise return zero after key space is checked High 24 Mid X Low 32 - X

  24. FIFO Interface and State Machine DES Encrypt 000 DES Encrypt 001 Correct Guess DES Encrypt 010 Plain Text DES Encrypt 011 Search Complete Cypher Text DES Encrypt 100 High 24 bits DES Encrypt 101 DES Encrypt 110 Result Key DES Encrypt 111

  25. State Machine Design • Xilinx ISE • VHDL model of interface and DES guessing unit interaction • Simulate & synthesize • Timing and Resource Utilization For Each Configuration Parameter Receive Configuration Read Req Read Ack While key not found and still searching Start Searching and Wait for Guessing Results For Second Key Half Return Result or Failure Notification Write Ack Write Req Generate Interrupt

  26. System Integration • XPS Interrupt Controller Connection • Processor Local Bus Connection (PLB) • Physical Address Assignment

  27. Integrated Bus Structure DDR RAM PowerPC Ethernet MAC Compact Flash Arbiter Processor Local Bus (PLB) HAU_0 HAU_1 HAU_N … User Logic User Logic User Logic

  28. Device Tree Structure • Linux kernel build targets DTS file created as Xilinx library • Driver extracts memory addresses, IRQs, and name information • Example unit description: plb_des_1: plb-des@c9c20000 { compatible = “xlnx,plb-des-1.00.a”; interrupt-parent = <&xps_intc_0>; interrupts = < 1 2 >; reg = < 0xc9c20000 0x10000 >; xlnx,family = “virtex5”; xlnx,include-dphase-timer = <0x1>; };

  29. Deployment • Linux Kernel • Build targeting specific platform and DTS • Include device driver for hardware accelerators • make ARCH=powerpc CONFIG_FIFO_DRIVER=y • Generates .ELF programming file • Bit-stream Generation • Synthesize, Map, Place, and Route Design • Generates .BIT configuration file • Create a System ACE file • Merges .BIT and .ELF into .ACE file • Place .ACE file onto compact flash and boot

  30. Application Development • Master-Slave Structure • Master coordinates dynamic work queue • Slaves wait for work or stop condition • Program Flow • Master: • Send 24-bit key space indicator (KSI) to each slave • Wait for a response: • If key found, break out, report results and distribute stop conditions to all slaves • If key not found, send next KSI to slave • Slave • Allocate and initialize hardware unit • Wait for work or stop condition • If new work arrives, send KSI to hardware and send back the result M … S1 S2 Sn

  31. Pseudo-Code Structure #include <stdio.h> #include “mpi.h” #include “fpga_util.h” int main(intargc, char * argv[]){ FILE * my_dev; MPI_Init(argc, argv); MPI_Rank(MPI_COMM_WORLD, &rank); if(rank ==0){ //Master Process // For each slave MPI_Send(key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD); // While work remains in queue and key not found MPI_Recv(result, 3, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send(new_key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD); // Once key found printf(“The answer is %d!\n”, data_in); } else if(rank ==1){ my_dev = AllocateResource(“des_unit”); setvbuf(my_dev, null, _IONBF, sizeof(int)); // Until stop condition received MPI_Recv(KSI, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); fwriteKSI, sizeof(int), 1, my_dev); fread(result, sizeof(int), 2, my_dev); MPI_Send(result, 3,MPI_INT, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); }

  32. Testbed • Four Virtex-5 devices • PowerPC 440 Processor @ 400 MHz • Two ML507 and two ML510 boards • 256/512 MB DDR2 • One Virtex-4 device • PowerPC 405 Processor @ 300 MHz • ML410 board • 256 MB DDR2 • 100 MHz Processor Local Bus (PLB) • RIT Ethernet Network • DES Resource Usage • ML510: 3 Units each, ML507: 2 Units each, ML410: 1 Unit • 11 Units total over 5 FPGAs

  33. DES Application Scalability • Hardware DES unit can guess 8 keys/cycle*100MHz = 800 Million keys/sec • 11 Distributed Hardware Units => 8800 M keys/sec ideally • Actual performance is 8548.55 M keys/sec = 97.14%

  34. Performance Comparison • DES search application developed for cluster of 2.33 GHz Xeon processors with same program structure • Single node performance = 0.735 M Keys/sec • Scales to 7.722 M Keys/sec across 11 cores @ 95.4% efficiency

  35. Matrix Multiplication • A standard test of computational capability is matrix multiplication • A*B=C • Highly data parallel • Each index of the result matrix can be computed independently: C[i][j] = A[i][] * B[][j] • Hardware Design • Store multiple rows of A and compute a running dot product for each row when receiving a column of B • Software Design • Statically partition work across available FPGAs and aggregate results

  36. Single Node Results • FPGAs fare poorly in comparison to Xeon GPP • Largest FPGA with most available hardware comparable to single core, but poor price/performance • With greater concurrency, the FPGA should have performed better. Why didn’t it?

  37. Analysis • Bandwidth is the limiting factor • Data is communicated word by word over a 32-bit arbitrated bus (PLB) • Processor independent DMA required for improved performance

  38. Scalability Results • Problem scales well across Virtex-5 devices • 83% across 4 FPGAs • Virtex-4 addition causes worse performance • Static Partitioning

  39. Conclusions • Scalable framework for clustering of FPGAs developed and demonstrated • Flexible application development with decoupled HW/SW Co-design • Standard MPI programming interface and simple hardware interaction abstractions/APIs • Commodity hardware technologies and open software allows low barrier to entry for further research and development • Application Case Studies • Well-suited applications like cryptanalysis perform admirably • Performance improvements of > 1100x • Price/Performance improvements of > 370x • Current bandwidth limitations holding back other applications

  40. Future Work • Framework Infrastructure • Hardware Communication • DMA data transfer from PowerPC to HW • Fast HW->HW Interconnection on single FPGA • Dynamic Reconfiguration of Hardware Accelerators • Cluster Management • Job Submission and Resource Matching with Condor or similar • Monitoring with Ganglia or similar • Robustness • Fault Tolerance • Correct HW Usage Enforcement • New Applications • Performance study with more inter-process communication • Image Processing could be a good place to start • Neural Simulation Platform (Dmitri Yudanov, CE Dept) • Expressed interest from EE and Astrophysics departments • Design Flow Improvements • Integrated tool chain and improved deployment procedure

  41. Questions? • Thank you for listening Contact: Jeremy Espenshade jke2553@rit.edu RIT Computer Engineering Hardware Design Lab Primary Advisor: Dr. Marcin Lukowiak Espenshade, Jeremy. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Master’s thesis, Rochester Institute of Technology, 2009. Cray Inc. Cray XD1 Supercomputer Outscores Competition in HPC Challenge Benchmark Tests. Business Wire. Feb 15 2005. http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=674199&highlight=. Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang, Kris Gaj, VolodymyrKindratenko, and Duncan Buell. The Promise of High-Performance Reconfigurable Computing. IEEE Computer Magazine,41(2):69–76, 2008. Xilinx Corp. Virtex-5 Multi-Platform FPGAs, 2009. http://www.xilinx.com/products/virtex5/.

More Related