Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

Jeremy Espenshade May 7, 2009 Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

Motivation • High Performance Computing • Performing difficult computations in an acceptable time period • Example Areas of Interest: • Cryptanalysis • Bioinformatics • Molecular Dynamics • Image Processing • Specialized Hardware • Architectural differences can provide orders of magnitude speedup on suitable applications • GPGPUs: Visualization, Linear Algebra, etc through data parallelism • Cell Processor: Video Encoding, Linear Algebra, etc through data parallelism • FPGAs: DSP, Image Processing, Cryptography, etc through bit-level parallelism

Outline • Background • Cluster Computing • FPGA Technology • Commercial FPGA Supercomputers • Proposed Framework • Requirements and Motivation • Hardware and Software Organization • HW/SW Interaction • Application Case Studies • DES Cryptanalysis • Step-by-Step Design Flow • Demonstration FPGA Cluster • Performance Comparison • Matrix Multiplication • Platform Performance Characterization • Conclusion and Future Work

Cluster Computing • Historical monolithic supercomputers have given way to networks of smaller computers • “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” - Seymour Cray • Middleware technologies have made cluster construction and programming easy and efficient • Message Passing – MPI, PVM • Shared Memory – OpenMP • Remote Procedure Call – Java RMI, CORBA • Grid Organization – Condor, Globus

Message Passing Interface • De facto standard API for inter-process communication over distributed memory • Language-independent library with point-to-point and collective operations • MPI_Send/MPI_Recv • MPI_Bcast/MPI_Reduce • MPI_Scatter/MPI_Gather • MPI_Barrier • OpenMPI • Open source implementation in native C

Message Passing Interface • Creates “Virtual Topology” of computation environment with process ranks Process Tree: MPI_Send(data, child) MPI_Recv(data,parent) Process Ring: MPI_Send(data, (myrank +1) % size) MPI_Recv(data, (myrank-1)%size) Master/Slave: MPI_Bcast(data, to slaves) MPI_Reduce(data, to master) R0 R0 R0 R3 R1 R3 R1 R3 R2 R1 R2 R2

FPGAs Basic Logic block • Field-Programmable Gate Arrays • Devices in which function can be specified through a hardware description language (VHDL/Verilog) • Slower than custom ASICs but much more flexible • Large Degree of fine-grained parallelism Configurable I/0 Block Interconnects

Configurable Logic Blocks • FPGAs realize computation over a network of CLBs • Each CLB contains: • Eight 6-Input Look-Up- Tables • Eight Flip-Flops • Control Multiplexors • Arithmetic and Carry Logic • LUTs preconfigured to implement any 6-input logical function • ‘A = B ⊕ C ⊕ D ⊕ E ⊕ F ⊕ G’ can be calculated in a single cycle even over large operand lengths • Bit-Level Parallelism

Xilinx Virtex-5 FXT • Hybrid FPGA • HardwiredPowerPC Cores • 2-15k CLBs • Arithmetic • DSP Slices • Ethernet • MAC Units

FPGA Supercomputers • Cray XT5h • AMD Opteron and Xilinx Virtex-4 FPGAs on single blade • Cray SeaStar2+™ Interconnect • Custom API for RPUs http://www.cray.com/Assets/PDF/products/xt/CrayXT5hBrochure.pdf • SRC6/7 • Altera Stratix II FPGAs and Intel Xeon or AMD Opteron Processors • SRC HI-BAR® Interface • Carte® Programming Environment http://www.srccomp.com/products/src7_hardware.asp

Framework Motivation • Motivating Concepts • FPGAs have great performance potential • Especially applications with high bit, instruction, and data level parallelism • Many FPGAs working together would allow even better parallelism exploitation • Increased data parallelism through multi-node partitioning • Task parallelism through independent heterogeneous nodes • FPGA cluster frameworks are currently limited to proprietary supercomputers • Commodity clusters will reduce the barrier to entry and promote FPGA integration and use • Occam’s Razor applies to programming environments - Simple is good.

Framework Requirements • Easily Programmable • Common API for both parallel programming and hardware access • Modular hardware supported without modification • Hardware/Software Design Independent • Interface access independent of application and implementation • Minimal Framework Overhead • System should exhibit acceptable performance • Commodity technologies • Ethernet networking, Linux OS, open software • Scalable and Flexible • Additional FPGAs easily integrated • Heterogeneous nodes seamlessly supported • Extensible • Future improvements possible without harsh restrictions

Physical Organization • Ethernet Network • Hard-wired MAC on chip … … HW HW HW 10/100/1000T Ethernet Compact Flash Ethernet MAC

Single-Node Software Environment • Embedded Linux OS • Root File System on Compact Flash (256MB) • BusyBox Utilities • Minimal Libraries • OpenMPI • OpenSSH: Certificate-based security for shell access • OpenSSL: TCP/IP security • Various support libraries (zlib,etc) • Special Device Files in /dev • Hardware devices mapped to character devices • fopen, fwrite, fread, and fclose commands • Dynamics major number reported in /proc/devices • FILE * AllocateResource(char * base_name) • Lock-based Arbitration

Interaction Stack MPI Application Software Design User Application MPI_Init MPI_Recv MPI_Send MPI_Finalize fopen fwrite fread fclose open write read release Kernel Driver Enable Interrupts Constant Interface Reset FIFOs PLB Interrupt Controller Write FIFO Read FIFO HW FIFO State Machines Hardware Design HW Unit Application-Specific Hardware

Driver Details • Xilinx provides minimal set of drivers • SysAce CF, Ethernet MAC (MII/GMII), etc • Custom FIFO Driver for HW abstraction • Boot time • Registers platform device and character driver • Maps physical memory address and IRQ to virtual space • Constructs address offsets for registers • Device Open • Resets hardware accelerator and FIFOs • Enable Interrupts • Interrupt Handling • Read waits on data in Read FIFO • Hardware generates interrupt

Virtual Organization • Each hardware unit can behave as an independent node • Hardware units can perform different functions and similar functions at different speeds • Each FPGA can host as many units as fit in the device • Configurations with multiple hardware units/process or other exotic setups are also inherently supported MAC MAC MAC MAC CF CF CF CF P0 P1 P2 P3 P4 P4 P5 P6 P7 P8 P9

Application Case Study • The Data Encryption Standard (DES) is a widely used and longstanding block cypher • Due to insufficient cryptographic strength resulting from a 56 bit key, DES has been successfully broken by brute force exhaustive key searches in the past decade • Past approaches: • Distributed computing (a la Folding@Home) • Custom DES ASICs in a cluster • A hybrid of the above two approaches • Custom system of 120 Spartan-IIe FPGAs

DES Algorithm

System Development Process Hardware Design Software Design • Independent HW/SW Design • Hardware • Algorithm Implementation • Search keys as fast as possible • Software • Partition Key Space • Coordinate Results

Embedded Platform • Xilinx Base System Builder • Generates PowerPC, DDR, Compact Flash and Ethernet Interfaces

Hardware Device Creation • Xilinx Peripheral Creation Wizard • PLB Slave Interface • Read/Write FIFOs • Interrupt Controller • Software Reset • Software Accessible Registers • State Machine centric design • FIFO Access • Interface/Accelerator Interaction • Interrupt Generation

DES Hardware Implementation • Key Guesser top level • Contains 2X DES encryption engines • 18 stage pipeline (16 rounds plus 1 input, 1 output) • Initialized with known plaintext and ciphertext • 24 high bits of key expected as input • Each DES engine checks lowest (32 – X) bits with assigned middle bits based on component number • If key is found, return it, otherwise return zero after key space is checked High 24 Mid X Low 32 - X

FIFO Interface and State Machine DES Encrypt 000 DES Encrypt 001 Correct Guess DES Encrypt 010 Plain Text DES Encrypt 011 Search Complete Cypher Text DES Encrypt 100 High 24 bits DES Encrypt 101 DES Encrypt 110 Result Key DES Encrypt 111

State Machine Design • Xilinx ISE • VHDL model of interface and DES guessing unit interaction • Simulate & synthesize • Timing and Resource Utilization For Each Configuration Parameter Receive Configuration Read Req Read Ack While key not found and still searching Start Searching and Wait for Guessing Results For Second Key Half Return Result or Failure Notification Write Ack Write Req Generate Interrupt

System Integration • XPS Interrupt Controller Connection • Processor Local Bus Connection (PLB) • Physical Address Assignment

Integrated Bus Structure DDR RAM PowerPC Ethernet MAC Compact Flash Arbiter Processor Local Bus (PLB) HAU_0 HAU_1 HAU_N … User Logic User Logic User Logic

Device Tree Structure • Linux kernel build targets DTS file created as Xilinx library • Driver extracts memory addresses, IRQs, and name information • Example unit description: plb_des_1: plb-des@c9c20000 { compatible = “xlnx,plb-des-1.00.a”; interrupt-parent = <&xps_intc_0>; interrupts = < 1 2 >; reg = < 0xc9c20000 0x10000 >; xlnx,family = “virtex5”; xlnx,include-dphase-timer = <0x1>; };

Deployment • Linux Kernel • Build targeting specific platform and DTS • Include device driver for hardware accelerators • make ARCH=powerpc CONFIG_FIFO_DRIVER=y • Generates .ELF programming file • Bit-stream Generation • Synthesize, Map, Place, and Route Design • Generates .BIT configuration file • Create a System ACE file • Merges .BIT and .ELF into .ACE file • Place .ACE file onto compact flash and boot

Application Development • Master-Slave Structure • Master coordinates dynamic work queue • Slaves wait for work or stop condition • Program Flow • Master: • Send 24-bit key space indicator (KSI) to each slave • Wait for a response: • If key found, break out, report results and distribute stop conditions to all slaves • If key not found, send next KSI to slave • Slave • Allocate and initialize hardware unit • Wait for work or stop condition • If new work arrives, send KSI to hardware and send back the result M … S1 S2 Sn

Pseudo-Code Structure #include <stdio.h> #include “mpi.h” #include “fpga_util.h” int main(intargc, char * argv[]){ FILE * my_dev; MPI_Init(argc, argv); MPI_Rank(MPI_COMM_WORLD, &rank); if(rank ==0){ //Master Process // For each slave MPI_Send(key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD); // While work remains in queue and key not found MPI_Recv(result, 3, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send(new_key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD); // Once key found printf(“The answer is %d!\n”, data_in); } else if(rank ==1){ my_dev = AllocateResource(“des_unit”); setvbuf(my_dev, null, _IONBF, sizeof(int)); // Until stop condition received MPI_Recv(KSI, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); fwriteKSI, sizeof(int), 1, my_dev); fread(result, sizeof(int), 2, my_dev); MPI_Send(result, 3,MPI_INT, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); }

Testbed • Four Virtex-5 devices • PowerPC 440 Processor @ 400 MHz • Two ML507 and two ML510 boards • 256/512 MB DDR2 • One Virtex-4 device • PowerPC 405 Processor @ 300 MHz • ML410 board • 256 MB DDR2 • 100 MHz Processor Local Bus (PLB) • RIT Ethernet Network • DES Resource Usage • ML510: 3 Units each, ML507: 2 Units each, ML410: 1 Unit • 11 Units total over 5 FPGAs

DES Application Scalability • Hardware DES unit can guess 8 keys/cycle*100MHz = 800 Million keys/sec • 11 Distributed Hardware Units => 8800 M keys/sec ideally • Actual performance is 8548.55 M keys/sec = 97.14%

Performance Comparison • DES search application developed for cluster of 2.33 GHz Xeon processors with same program structure • Single node performance = 0.735 M Keys/sec • Scales to 7.722 M Keys/sec across 11 cores @ 95.4% efficiency

Matrix Multiplication • A standard test of computational capability is matrix multiplication • A*B=C • Highly data parallel • Each index of the result matrix can be computed independently: C[i][j] = A[i][] * B[][j] • Hardware Design • Store multiple rows of A and compute a running dot product for each row when receiving a column of B • Software Design • Statically partition work across available FPGAs and aggregate results

Single Node Results • FPGAs fare poorly in comparison to Xeon GPP • Largest FPGA with most available hardware comparable to single core, but poor price/performance • With greater concurrency, the FPGA should have performed better. Why didn’t it?

Analysis • Bandwidth is the limiting factor • Data is communicated word by word over a 32-bit arbitrated bus (PLB) • Processor independent DMA required for improved performance

Scalability Results • Problem scales well across Virtex-5 devices • 83% across 4 FPGAs • Virtex-4 addition causes worse performance • Static Partitioning

Conclusions • Scalable framework for clustering of FPGAs developed and demonstrated • Flexible application development with decoupled HW/SW Co-design • Standard MPI programming interface and simple hardware interaction abstractions/APIs • Commodity hardware technologies and open software allows low barrier to entry for further research and development • Application Case Studies • Well-suited applications like cryptanalysis perform admirably • Performance improvements of > 1100x • Price/Performance improvements of > 370x • Current bandwidth limitations holding back other applications

Future Work • Framework Infrastructure • Hardware Communication • DMA data transfer from PowerPC to HW • Fast HW->HW Interconnection on single FPGA • Dynamic Reconfiguration of Hardware Accelerators • Cluster Management • Job Submission and Resource Matching with Condor or similar • Monitoring with Ganglia or similar • Robustness • Fault Tolerance • Correct HW Usage Enforcement • New Applications • Performance study with more inter-process communication • Image Processing could be a good place to start • Neural Simulation Platform (Dmitri Yudanov, CE Dept) • Expressed interest from EE and Astrophysics departments • Design Flow Improvements • Integrated tool chain and improved deployment procedure

Questions? • Thank you for listening Contact: Jeremy Espenshade jke2553@rit.edu RIT Computer Engineering Hardware Design Lab Primary Advisor: Dr. Marcin Lukowiak Espenshade, Jeremy. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Master’s thesis, Rochester Institute of Technology, 2009. Cray Inc. Cray XD1 Supercomputer Outscores Competition in HPC Challenge Benchmark Tests. Business Wire. Feb 15 2005. http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=674199&highlight=. Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang, Kris Gaj, VolodymyrKindratenko, and Duncan Buell. The Promise of High-Performance Reconfigurable Computing. IEEE Computer Magazine,41(2):69–76, 2008. Xilinx Corp. Virtex-5 Multi-Platform FPGAs, 2009. http://www.xilinx.com/products/virtex5/.

Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

Presentation Transcript

A Distributed Clustering Framework for MANETS

Biosurveillance of emerging biothreats using scalable genotype clustering

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

Performance Tools for GPU-Powered Scalable Heterogeneous Systems

Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs

Compact Trie Forest: Scalable architecture for IP Lookup on FPGAs

A flexible, scalable genomics framework for integrating heterogeneous vector sequence data

Scalable Clustering using Multiple GPUs

OpenCL Framework for Heterogeneous CPU/GPU Programming

Scalable Group Communication In Heterogeneous Cluster

Scalable, Behavior-Based Malware Clustering

Scalable Data Clustering with GPUs

Congestion-Driven Re-Clustering for Low-cost FPGAs

Scalable Supervised Dimensionality Reduction using Clustering

Clustering of Large Designs for Channel-Width Constrained FPGAs

SCALE: a scalable framework for efficiently clustering transactional data

Scalable Clustering for Vision using GPUs

Scalable Clustering on the Data Grid

Scalable Web Server Clustering Technologies

Scalable Group Communication In Heterogeneous Cluster

HeAP: Heterogeneous Analytical Placement for FPGAs

Scalable Clustering for Vision using GPUs