220 likes | 345 Views
Novo-G : Adaptively Custom Reconfigurable Supercomputer. Dr. Alan D. George Professor of ECE University of Florida Dr. Herman Lam Assoc. Professor of ECE University of Florida. Abhijeet Lawande Carlo Pascoe Research Assistants CHREC University of Florida. December 3-4, 2009.
E N D
Novo-G : Adaptively Custom Reconfigurable Supercomputer Dr. Alan D. George Professor of ECE University of Florida Dr. Herman Lam Assoc. Professor of ECE University of Florida Abhijeet Lawande Carlo Pascoe Research Assistants CHREC University of Florida December 3-4, 2009
High Performance Computing • Uses supercomputers / distributed computers to solve advanced computation problems • Where? • Computational Fluid Dynamics • Astrophysical Simulations • Climate Modeling • … • How Big? • 100’s of nodes,1000’s of processors
HPC Marketplace HPC practitioners often more reactive than proactive Understandably conservative, risk-averse Looking for quick fixes (not always best approach for long-term) Accelerators (e.g. GPU, Cell) popular @ SC09 But these consume more (energy) to get more (performance) Performance promising for subset of apps (on fixed-logic spectrum) Productivity a significant challenge (common in Age of Parallelism) Sustainability a major concern (single devices approaching 300W!) But better solutions borne from better methods Goal: high performance, productivity, & sustainability Change in paradigm, mindset, approach “Every generation needs a new revolution” – Jefferson Smarter device and system architectures Adaptive hardware parallelism, more (performance) with less (energy) Better models & solutions apply more broadly than only HPC 3
Reconfigurable Computing • Why RC? • Performance (parallelism) • Power • Price • So what’s the problem? • New computing model: revolutionary, potent, complex • Adaptive hardware offers many challenges & opportunities • Still relatively new and immature field • Many open R&D issues, from prog. model to device arch.
Novo-G Concept Goals Investigate, develop, evaluate, & showcase: Most powerful RC machine ever fielded for research Innovative suite of productivity tools for app development Impactful set of scalable kernels/apps in key science areas Project & machine name: Novo-G “Novo” is Latin: "to make anew, refresh, revive, change, alter," essence of RC “G” for Genesis (first of a series of Novo machines) or Green Focus on experimental research challenges of RC spanning HPC to HPEC Motivations Design productivity is foremost need/challenge for widespread use of RC Challenges accentuated as scale increases (devices, systems, apps) Powerful experimental testbed to support R&D addressing these challenges Emphases Performance (system), Productivity (concepts/tools), Impact (apps) 5
Novo-G Machine Cluster of 24+1 servers (compute + head node) 96 Altera Stratix-III E260 FPGAs for app acceleration Each w/ 768 18x18 multipliers, 254K logic elements, 204K registers, power <20W e.g. Per E260: 768 Integer, 192 SPFP, or 85 DPFP multipliers @ ~300MHz (Altera FPC) FPGAs housed in four quad-FPGA PCIe x8 GiDEL boards Embedded-style boards; supports both HPEC- & HPC-oriented research 4.25GB memory attached to each app FPGA, 576GB total RAM in Novo-G 24 boards housed in 24 Linux compute servers + head node 20Gb/s non-blocking DDR InfiniBand; Gigabit Ethernet 26 (24+2) quad-core 2.26GHz Intel Nehalem Xeon processors w/ QPI Funded by U. Florida w/ generous help from Altera & GiDEL UPDATE: Novo-G will soon double in RC capacity, growing to 192 top-end FPGAs in 48 quad-FPGA boards 6
Novo-G Machine * Our cluster vendor is Ace Computers KVM/LCD unit for head node • 1 head-node server with: • 1U rackmount chassis • Dual Xeon E5520 quad-core CPUs @ 2.26 GHz, 4MB Cache, 5.86 GT/s QPI • 24GB ECC DDR3, 1333 MHz • Integrated dual-GigE ports & video • ICH10R controller for 6 SATA drives • 3 x 1TB Enterprise SATA2 drives • 24 compute servers, each with: • 4U rackmount chassis with 645W P/S • Intel Xeon E5520 quad-core CPU • 6GB ECC DDR3, 1333 MHz • Integrated dual-GigE ports & video • 2 GiDEL ProcStar-III PCIe x8 cards • Mellanox DDR InfiniBand PCIe card • 250GB SATA2 drive • Not visible (IB & GigE switches, PDUs) 7
Novo-G ProcStar-III Board (one of 48) JTAG for SignalTap debug 2×2GB = 4GB DDR2 RAM per FPGA 25.6 GB/s inter-FPGA bandwidth 120.8 mm 110 lines bi-directional / 110 lines bi-directional / 110 lines bi-directional / 256MB DDR2 256MB DDR2 256MB DDR2 256MB DDR2 312 mm PCIe x8 interface (4GB/s) Altera Stratix-III E260 FPGA 254,400 Logic Elements 768 multipliers (18×18) 14,688Kbits of embedded memory 50% less power than Stratix-II 65nm technology GiDEL ProcStar-III Board Typical frequencies 100-325MHz DMA channels 32 DDR2 module slots 8 8
Novo-G Memory & Connectivity Head node 24 GB DDR3 GigE 6 GB DDR3 Compute node Compute nodes Infiniband 2x 2GB DDR2 SODIMM 2x 2GB DDR2 SODIMM PCI-Express x8 FPGA2 FPGA1 FPGA2 + memory FPGA3 + memory FPGA4 + memory Memory Bus 256 MB DDR2 667MHz Main bus
Novo-G Energy (each of 24 servers) Smith-Waterman application Quad-core E5520 Xeon CPU 2 GiDEL ProcStar-III boards 8 Stratix-III E260 FPGAs total 40GB (17×2+6) DDR2/3 RAM After capacity doubled, total power of Novo-G @ max. load 8KW 10
Novo-G Tools Commercial and open-source tools Digital design tools: Altera, GiDEL, Aldec, Synopsys Cores and libraries: Altera, GiDEL, et al. High-level device design: Altera FP Compiler, Impulse-C, Mitrion-C, LabVIEW (2010) High-level system design: MPI, UPC, SHMEM Additional options in review (ROCCC, et al.) Variety of CHREC tools being ported & used for Novo-G Strategic design & prediction: RCML, RCSE, RAT, CMD High-level system design: SHMEM+, SCF Hardware virtualization for fast PAR: IFET App verification & performance analysis: ReCAP Proposed OpenCL over CHREC-IF Assorted kernel & app cores Industry Partners RCML 11
Impulse-C Platform Support Package • Impulse-C • Allows software written in Impulse-C programming language to run in Novo-G FPGAs • H/W – S/W partitioning approach • S/W processes compiled to executable using GCC • H/W processes converted to synthesizable VHDL/Verilog • Platform Support Package (PSP) • Provides interface between Impulse-C generated H/W and S/W customized for Impulse-C application • Currently supports streams and registers • Future Work: • Provide support for shared memory • Extend PSP to support Multi-FPGA system • Impulse-C apps on Novo-G • Smith-Waterman • Back-projection • European Option Pricing Novo-G Node Impulse-C code CPU S/W application Generate S/W Impulse-C API PSP PCIe x8 Generate H/W Impulse Generated H/W Stream Stream … … Stream Stream Register Register … … Register Register Stratix-III E260 @ 125MHz 12 12
Impulse-C PSP Hardware Software H/W - S/W partitioned Impulse-C code From PSP From PSP Gidel - Impulse Interface (VHDL) H/W process code (VHDL) ProcWizard project (PCAF) S/W interface code (C/C++) S/W process code (C) Quartus II Synthesis GCC Compile Gidel wrapper Header file Bitfile (rbf) Compiled API GCC Compile Executable
Mitrionics Virtual Processor (MVP) Novo-G Node • Mitrion on Novo-G • Operational: • Hardware interface • Mithal API support for GiDEL • Currently working on: • Performance optimization • Additional functionality • Future work: • Expand API support for multiple FPGAs on single Novo-G node • Support for all 24 nodes and 192 FPGAs • Massively parallel MVP • Provides abstraction layer between software and FPGA hardware • Allows software written in Mitrion-C programming language to run in Novo-G FPGAs • Has unique architecture that adapts hardware in FPGAs to each program to maximize its performance. E5520 Nehalem Quad-Core Xeon Mitrion Accelerated Application Mitrion Host Abstraction Layer PCIe x8 128bit I/OStreams 2GB Mem 2GB Mem Hardware Interface MVP 4 64Bit In-Regs 4 64Bit Out-Regs Stratix-III E260 @ 125MHz • Mitrion-C apps on Novo-G • AES app for SC09 • Fully pipelined • Fully unrolled • Full performance to the theoretical limit of bandwidth • Planned apps / app areas • Bioinformatics • Information retrieval and search engines • Database acceleration 14 14
Smith-Waterman (S-W) is an algorithm used to compute the optimal local sequence alignment of two or more character strings. Needleman-Wunsch (N-W) is for the computation of the optimal global sequence alignment. In biology, alignments are performed in search of sequence similarities under the assumption that they imply functional, structural, or evolutionary relationships between sequences and their sources. Contemporary implementations of optimal sequence alignment (whether global, local, or anything in-between) are based on a computation-intensive dynamic programming algorithm that breaks down the process of alignment into a set of recursive computations. Sequence Alignment in Bioinformatics
Sequence Alignment in Bioinformatics • Algorithms involve calculation of optimal alignment for all possible subsequences, then choosing the final sequence alignment from set of sub-alignments. • Equivalent to populating a score matrix and selecting the appropriate cell based on the type of alignment desired Example of local alignment (S-W) Query Sequence= “ACGTATGC” Database Sequence = “ACGAACCCTTGC”
Sequence Alignment in Bioinformatics • For two sequences of length A and B, optimum alignment requires the calculation of A∙B scores, with serial implementations operating in O(A∙B) time and O(min{A,B}) space complexity. • As amount of sequence data grows exponentially, the need for faster sequence alignment has fuelled the development of hardware accelerators. Hardware Approach:
Novo-G Apps: Smith-Waterman (S-W) First completed app: S-W Kernel for use in Bio Apps Locally/Optimally align DNA, RNA, or protein sequences Identify regions of similarity; dominant & vital app. in comp. biology Optimal alignment Ideal but often replaced with much faster heuristics Design: systolic array spanning 4 FPGAs per board 512 PE/FPGA, 2048 PE/board, 1 board/server, 125MHz, see app-note Execution times for 34MB chromosome sequence aligned with 16K 128-character sequences Novo-G achieves in ~12 seconds what takes a fast CPU core nearly 9 days! • Speed of S-W on Novo-G comparable to two largest machines on NSF TeraGrid • After our 2x upgrade, fast as both combined! • Yet, Novo-G is 100s of times lower in energy, cooling, cost, size, weight, etc. than TeraGrid • Future Plans: • Use S-W Kernel in SHRiMP application as replacement for BLAST heuristic 18
Novo-G Apps: Needleman-Wunsch (N-W) • N-W Kernel for use in ICBR’s ESPRIT application • Globally/Optimally align DNA sequences, then computes edit distance • Edit distances used to group sequences into operational taxonomic units (OTU) • OTUs grouped into tree; tree represents species richness and taxonomy • Design: systolic array of PEs with I/O FIFOs for streaming • Current Design: 250 PEs/FPGA, 1000 PEs/board, 2 boards/server, 125MHz • Design only consumes 68% of chip; number of PEs will be increased • Compared to S-W, overall design more simple in terms of control signals but N-W PEs vastly more complex • Uses special encoding scheme that allows N-W to approach S-W performance USED IN ACTUAL METOGENOMICS RESEARCH! • Future Plans: • Fully Integrate N-W Kernel into ESPRIT and Create Web App for use by Scientific Community Execution times for distance calculation of 16,777,216 pairs of length 250. Note: Red cells extrapolated values, obtainable with larger data sets.
Novo-G Apps: Real-Time Adaptive Filtering • Increasing IP window size (i, j) results in smoother and faster convergence but computation increases as O(n2) • Novo-G implementation advantages • SW implementation (MATLAB) cant operate in RT for large window size • All summed exponential terms independent; HW can compute in parallel • Fist design iteration can compute window size up to (50, 50) on 1 FPGA in single clock cycle • Clockrate/Speedup limited by sample frequency for RT filtering Sample frequencies based on simulated time to calculate a single weight (computation of IP). Does not include FPGA transfer time. • Use of ITL optimization in real-time adaptive filtering • Filter weights change with every sample through feedback by optimizing value of cost function • Current filters minimize mean squared error (MSE) cost functions; ITL cost function minimize error entropy (EE) yielding better results. • Minimizing EE equivalent to calculating gradient of information potential (IP) :
Novo-G Apps: Filtered Back-Projection (FBP) BP for use in CT image reconstruction 2D object is reconstructed from several 1-D projections Projections obtained by bombarding object with X-ray beam from multiple angles Each pixel on projected image represents total absorption of X-ray along path from source to detector Mathematically, transformation from projection-space into Cartesian coordinates Design: 512 pipelined processing engines per FPGA • Embarrassingly parallel w.r.t. computation of each pixel as well as projections for each • Processing engines iterate over all pixels and compute partial sum; final image formed in software • Software time complexity is O(n3); Hardware design reduces complexity to O(n2) • H/W implementation uses 16-bit fixed point arithmetic; results are visually indistinguishable from DPFP FPGA1 FPGA2 FPGA3 FPGA4 • Design implemented in both Impulse-C and VHDL to compare performance and productivity • Performance loss of 1.29x but estimated productivity gain considerably greater + S/W baseline: C code executed with fixed point on Intel E5520 Nahalem Quad Core Xeon @ 2.26GHz 21