360 likes | 634 Views
R econfigurable A pplication S pecific C omputing. Steve Modica RASC Product Manager. Altix 350. PC2100 DDR SDRAM. PC2700 DDR SDRAM. PC2100 DDR SDRAM. PC2700 DDR SDRAM. 4 Channels SDRAM
E N D
Reconfigurable ApplicationSpecificComputing Steve Modica RASC Product Manager SGI Proprietary
Altix 350 PC2100 DDR SDRAM PC2700 DDR SDRAM PC2100 DDR SDRAM PC2700 DDR SDRAM 4 Channels SDRAM 10.8 – 12.8 GB/s Itanium2 NUMAlink4 SHUB NUMAlink4 Front Side Bus 6.4 GB/s 2 Channels NUMAlink 12.8 GB/s Itanium2 PIC 4 Slots / 2 PCI-X Busses 2 GB/s PCI-X BASE I/O Ethernet SCSI Disk SGI Proprietary
ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM ddr1 SDRAM SGI Altix™ 3700 Bx2 Platform Introduction: CR-Brick - Components CR-Brick IP57 Node Board Node 0 P P I/O A Processor (Intel Madison 9M ASIC (Shub1.2) Node Board 6.4GB/s P P R O U T E R R O U T E R Processor (Intel Madison 9M A Node Board NL4 Network NL4 Network P P I/O 6.4GB/s Full Duplex 6.4GB/s Full Duplex 2.4GB/s Full Duplex A Node Board NL4 NL4 P P A I/O Node Board SGI Confidential Slide 3
M-brickMemory R-brick Router interconnect IX-brick Base I/O module PA-brick, PX-brick PCI-X expansion D-brick2 Disk expansion SGI Altix™ 3700 Bx2 Platform Introduction: Building Blocks SGI® Advanced Linux Environment With SGI ProPack Itanium® 2 CR-brickCPU and memory SGI Confidential Slide 4
SGI Altix™ 3700 Bx2 Platform Introduction: System Topology Example Router Plane 1 Router Plane 2 SGI Confidential Slide 5
Reconfigurable Application Specific ComputingAccelerating Interaction Compute Compute Speedup interactive analysis and modeling • CPUs are often the bottleneck in computations • Goal is to insert faster elements Style 1 -- Traditional FPGAs • Work with traditional FPGAs in PCI / PCI-X slots • Nallatech, Clearspeed, Annapolis Micro et al • Development environments relatively advanced • All driving to same goal of “write in C, run on FPGA” • Leverages other industry efforts • Cray, PCs, Clusters Style 2 -- Tightly coupled • Athena --- FPGA + memory for computation at high b/w • Daytona --- FPGA + spigots for fast network • Both being proto’d by a few customers IO Access to data is critical Memory bandwidth is the key to success IO Compute Specialist Elements Graphics Specialist Elements Graphics Confidential
The 3 Single-Paradigm Architectures App-Specific Graphics - GPU Signals - DSP Prog’ble - FPGA Other ASICs Scalar Intel Itanium SGI MIPS IBM Power Sun SPARC HP PA Vector Cray X1 NEC SX SGI Proprietary
Application-specific Application-specific Scalar Vector Paradigms to Applications Low Compute high Intensity Low Data locality High SGI Proprietary
Architectural Challenges • Ease of Use • Languages • Compilers • Debuggers • APIs • Performance • Bandwidth to/from System • Scalability SGI Proprietary
Ease of Use • Leverage 3rd Party Std Language Tools • Celoxica, Impulse Acceleration, Mitrion, Viva • In discussions with other HLL tool vendors • Developed an FPGA aware version of GDB • Capable of debugging the FPGA and System Software • Capable of multiple CPUs and multiple FPGAs • Developed RASC Abstraction Layer (RASCAL) • Provide for HDL modules • Integrated environment with debugger • Highest performance SGI Proprietary
Contrasting ISVs Software Hardware Mitrion C VIVA Impulse-C Handel-C HDL SGI Proprietary
Ease of Use v. Efficiency x VHDL Verilog Low Efficiency High x x x x Easy Ease of Use Difficult SGI Proprietary
ISV Features • Handel-C • Runs on Windows only • Plans to port to Linux in June of 2005 • Most efficient procedural language • Starbridge VIVA • Extremely easy to learn, Graphical, Object-oriented • Develop on Windows only, execute anywhere. • Easiest language to program, creates very efficient cores • Large library of packaged algorithm primitives • Mitrion C • Runs natively on Altix • Utilizes a processor abstraction • Most useful debugging environment • Impulse-C • Runs on Windows • Highly optimized for Streaming Applications • Fastest language to port legacy C code SGI Proprietary
Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb) .c Altix Bitstream Generation using High Level Language Tools HLL Design Entry (Handel-C, Impulse C, Mitrion C, Viva) Design Verification RTL Generation and Integration with Core Services .v, .vhd Behavioral Simulation (VCS, Modelsim) .v, .vhd .v, .vhd IA-32 Linux Machine Design Synthesis (Synplify Pro, Amplify) Metadata Processing (Python) .edf Static Timing Analysis (ISE Timing Analyzer) .ncd, .pcf Design Implementation (ISE) .cfg .bin SGI Proprietary
Ease of Use • Leverage 3rd Party Std Language Tools • Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva • In discussions with other HLL tool vendors • Developed an FPGA aware version of GDB • Capable of debugging the FPGA and System Software • Capable of multiple CPUs and multiple FPGAs • Developed RASC Abstraction Layer (RASCAL) • Provide for HDL modules • Integrated environment with debugger • Highest performance SGI Proprietary
FPGA Aware Debugger • Based on Open Source Gnu Debugger (GDB) • Uses extensions to current command set • Can debug host application and FPGA • Provides notification when FPGA starts or stops • Supplies information on FPGA characteristics • Can “single-step” or “run N steps” of the algorithm • Can HLL line step / step per C-line source • Dumps data regarding the set of “registers” that are visible when the FPGA is active SGI Proprietary
Optimal Debugging Environment tmp = a & b; d = tmp | c; Algorithm.c (gdb) fpgastep (gdb) p/x $a $6 = 0x444433 (gdb) p/x $b $7 = 0x111122 (gdb) p/x $tmp $8 = 0x555533 (gdb) fpgastep (gdb) p/x $tmp $9 = 0x555533 (gdb) p/x $c $10 = 0x331222 (gdb) p/x $d $11 = 0x111022 Debugger running in real time a COP FPGA tmp & b d | c SGI Proprietary
Ease of Use • Leverage 3rd Party Std Language Tools • Celoxica, Impulse Acceleration, Mitrion • In discussions with other HLL tool vendors • Developed an FPGA aware version of GDB • Capable of debugging the FPGA and System Software • Capable of multiple CPUs and multiple FPGAs • Developed RASC Abstraction Layer (RASCAL) • Provide for HDL modules • Integrated environment with debugger • Highest performance SGI Proprietary
Open|Speedshop Pro|Speedshop Debugger (GDB) Download Utilities Application User Space Abstraction Layer Library Device Manager Download Driver Algorithm Device Driver Linux Kernel Co-Processor FPGA ( RASC Hardware ) Hardware Application Programming Interface Overview SGI Proprietary
Algorithm Application Input Data COP Input Data Algorithm Output Data Output Data COP COP COP COP Application Abstraction Layer: Algorithm API The Abstraction Layer’s algorithm API mirrors the COP API with a few additions that enable wide scaling, • and deep scaling. SGI Proprietary
Ease of Use • Leverage 3rd Party Std Language Tools • Celoxica, Impulse Acceleration, Mitrion • In discussions with other HLL tool vendors • Developed an FPGA aware version of GDB • Capable of debugging the FPGA and System Software • Capable of multiple CPUs and multiple FPGAs • Developed RASC Abstraction Layer (RASCAL) • Provide for HDL modules • Integrated environment with debugger • Highest performance SGI Proprietary
Verilog / VHDL Module Support • Templates for Verilog • Fast start to algorithm coding • Templates for VHDL • Fast start to algorithm coding • Provide a system simulation stub • Allows both simulation debug or system debug • Provide source code for core service • Allows user to modify to meet special needs • Extractor tools supports GDB meta-data • Application and FPGA debugging SGI Proprietary
Proto-type Configuration NUMAlink4 Altix 350 MOATB SGI Proprietary
Performance • Direct Connection to NUMAlink4 6.4GB/s/connection • Fast System Level Reprogramming of FPGA FPGA load at memory speeds • Atomic Memory Operations Same set as System CPUs • Hardware Barriers Dynamic Load Balancing • Configurations to 128 NUMA/FPGA Nodes Scalability SGI Proprietary
MOATB Block Diagram 2MB QDR SRAM Addr & Ctrl NUMAlink Connectors 36 36 Addr & Ctrl Algorithm FPGA TIO 72 36 SSP 2MB QDR SRAM 36 72 NUMAlink 12.8 GB/s SSP 6.4 GB/s QDR SRAM 9.6GB/s 3 reads @ 1.6GB/s 3 writes @ 1.6GB/s 36 36 Addr & Ctrl 2MB QDR SRAM PCI 66MHz Select Map Programming Interface Loader FPGA SGI Proprietary
System Configuration SRAM 0 PC2100 DDR SDRAM 2MB QDR SRAM PC2700 DDR SDRAM PC2100 DDR SDRAM PC2700 DDR SDRAM NUMAlink Itanium2 Addr & Ctrl 36 36 Addr & Ctrl Algorithm FPGA TIO SHUB 72 36 SSP 2MB QDR SRAM 36 72 Itanium2 PIC SRAM 1 36 36 Addr & Ctrl 2MB QDR SRAM PCI 66MHz Select Map Programming Interface PCI-X SRAM 2 Loader FPGA BASE I/O MOATB Altix 350 SGI Proprietary
MOATB Data Performance SSP System Interface Performance Measured performance MOATB with SSP test card bitstream • DMA Read => 2.548 GB/s • DMA Write => 2.607 GB/s Measured performance MOATB with MBCS bitstream • DMA Read => 1.588 GB/s • DMA Write => 1.589 GB/s Limited by 1.6 GB/s of external SSRAMs MOATB Core Services Core Clock Frequency 200MHz SGI Proprietary
FPGA Architecture Overview QDR-II SRAM Bank 0 Reads @ 1.6GB/s Writes @ 1.6GB/s Write port 0 Read port 0 3.2 GB/s Write port 1 Core Services Block QDR-II SRAM Bank 1 Algorithm Block SSP Read port 1 3.2 GB/s Read port 2 Write port 2 QDR-II SRAM Bank 2 SGI Proprietary
Algorithm Block as Submodule alg_clk do_step Algorithm controller alg_rst Algorithm Block step_flag alg_done debug0 debug63 sram_rd_addr[17:0] sram_rd_cmd_vld Debug port sram_wr_addr[17:0] sram_wr_data[63:0] sram_wr_be[7:0] sram_wr_req sram_rd_dvld sram_rd_req sram_rd_data sram_wr_gnt sram_wr_dvld sram_rd_gnt SRAM controller (one bank shown) SGI Proprietary
MOATB Sample Application Performance • Bit Manipulation (Crypto) • 79x 1.5GHz Itanium-2 (single MOATB) • 119x 1.5GHz Itanium-2 (dual MOATB) • DOD Bit Matrix Multiply Benchmark • TBDx 1.5GHz Itanium-2 (single MOATB) • Graphics Edge Detection • 42x 1.5GHz Itanium-2 (single MOATB) • (DEMO at NAB) SGI Proprietary
Reconfigurable Application Specific Processing MOATB Proof of Concept V2 - 6000 Athena Computation Brick V2 - 6000 Abacus Computation Blade V4 LX 200 V4 FX200 Virtex 5 Daytona Ingest/Egress Blade V2 Pro 100 V4 FX100 Virtex 5 System Interface NL4 / SSP NL5 / SSP2 Systems Altix 3700/350 UV SHUB2 BX2 2004 2005 2006 2007 2008 SGI Proprietary
Athena Computation Blade 2MB QDR SRAM NUMAlink Connectors Algorithm FPGA TIO 2MB QDR SRAM SSP 2MB QDR SRAM 2MB QDR SRAM PCI 66MHz Loader FPGA Algorithm FPGA Virtex2 6000 -6 SGI Proprietary
Abacus Computation Blade SSRAM SSRAM SSRAM SSP NL4 V4LX200 TIO SSRAM PCI SSRAM Selmap NL4 Loader SSRAM Selmap SSRAM NL4 SSP TIO V4LX200 SSRAM SSRAM SSRAM SGI Proprietary
RASC 3U Chassis Blade Slots TPS Power Supply Slots 5.128” high x 17.39” w SGI Proprietary
Investigations Underway Additional 3rd Party Partnerships • Pull in additional “Best in Industry Features” • Help drive openFPGA.org direction • Pull in IO and additional scalability features New High Level Languages • Matlab – Working with a RASC partner to add tool as module generator Library Support for Matlab*P C-Code Improvement Tools • FPGA aware Speedshop enhancements • Source to source code optimizer targeted at 3rd party tools SGI Proprietary