1 / 29

apeNEXT

apeNEXT. Piero Vicini (piero.vicini@roma1.infn.it) INFN Roma. APE keywords. Parallel system Massively parallel 3D array of computing nodes with periodic boundary conditions Custom system Processor: extensive use of VLSI

jaeger
Download Presentation

apeNEXT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. apeNEXT Piero Vicini(piero.vicini@roma1.infn.it) INFN Roma Piero Vicini - SciParC workshop

  2. APE keywords • Parallel system • Massively parallel 3D array of computing nodes with periodic boundary conditions • Custom system • Processor: extensive use of VLSI • Native implementation of the complex type a x b + c (complex numbers) • Large register file • VLIW microcode • Node interconnections • Optimized for nearest-neighbors communication • Software tools • Apese, TAO, OS, Machine simulator • Dense system • Reliable and safe HW solution • Custom mechanics for “wide” integration • Cheap system • 0.5 €/MFlops • Very low cost maintenance Piero Vicini - SciParC workshop

  3. The APE family Our line of Home Made Computers Piero Vicini - SciParC workshop

  4. APE (‘88) 1 GFlops Piero Vicini - SciParC workshop

  5. APE100 (1993) - 100 GFlops PB (8 nodes) ~400 MFlops Piero Vicini - SciParC workshop

  6. APEmille – 1 TFlops • 2048 VLSI processing nodes (0.5 MFlops) • SIMD, synchronous communications • Fully integrated ”Host computer”, 64 PCs cPCI based “Torre” 32 PB, 128GFlops “Processing Board” (PB) 8 nodes, 4GFlops Computing node Piero Vicini - SciParC workshop

  7. APEmille installations • Bielefeld 130 GF (2 crates) • Zeuthen 520 GF (8 crates) • Milan 130 GF (2 crates) • Bari 65 GF (1 crates) • Trento 65 GF (1 crates) • Pisa 325 GF (5 crates) • Rome 1 650 GF (10 crates) • Rome 2 130 GF (2 crates) • Orsay 16 GF (1/4 crates) • Swansea 65 GF (1 crates) • Gr. Total ~1966 GF Piero Vicini - SciParC workshop

  8. X+(cables) 12 14 13 15 DDR-MEM 8 10 X+ … … Z- J&T 9 11 4 6 5 7 Y+(bp) 0 2 1 3 Z+(bp) The apeNEXT architecture • 3D mesh of computing nodes • Custom VLSI processor - 200 MHz (J&T) • 1.6 GFlops per node (complex “normal”) • 256 MB (1 GB) memory per node • First neighbor communication network “loosely synchronous” • YZ internal, X on cables • r = 8/16 => 200 MB/s per channel • Scalable 25 GFlops -> 6 Tflops • Processing Board 4 x 2 x 2~ 26 GF • Crate (16 PB) 4 x 8 x 8 ~ 0.5 TF • Rack (32 PB) 8 x 8 x 8 ~ 1 TF • Large systems (8*n) x 8 x8 • Linux PCs as Host system Piero Vicini - SciParC workshop

  9. Design metodology • VHDL incremental model of the (almost) whole system • Custom (VLSI and/or FPGA) components derived from VHDL synthesis tool • Stand-alone simulation of components VHDL model + simulation of the “global” VHDL model • Powerful test-bed for test vectors generation First-Time-Right-Silicon • Simplified but complete model of HW-Host interaction • Test environment for development of compilation chain, OS • performance (architecture) evaluation at design time Software design env. Piero Vicini - SciParC workshop

  10. Assembling apeNEXT… J&T Asic J&T module PB Rack BackPlane Piero Vicini - SciParC workshop

  11. Overview of the J&T Architecture • Peak floating point performance of about 1.6GFlops • IEEE compliant double precision • Integer arithmetic performance of about 400 MIPS • Link bandwidth of about 200 MByte/sec each direction • full duplex • 7 links: X+,X-,Y+,Y-,Z+,Z-, “7th” (I/O) • Support for current generation DDR memory • Memory bandwidth of 3.2 GByte/sec • 400 Mword/sec Piero Vicini - SciParC workshop

  12. J&T: Top Level Diagram Piero Vicini - SciParC workshop

  13. 4 multipliers 4 adder/sub The J&T Arithmetic BOX • Pipelined complex “normal” a*b+c (8 flops) per cycle At 200 MHz (fully piped) = 1.6 GFlops Piero Vicini - SciParC workshop

  14. The J&T remote IO • fifo-based communication: • LVDS • 1.6 Gb/s per link (8 bit @ 200MHz) • 6 (+1) independent links Piero Vicini - SciParC workshop

  15. J&T summary • CMOS 0.18u, 7 metal (ATMEL) • 200 MHz • Double Precision Complex Normal Operation • 64 bit AGU • 8 KW program cache • 128 bit local memory channel • 6+1 LVDS 200 MB/s links • BGA package, 600 pins Piero Vicini - SciParC workshop

  16. PB • Collaboration with NEURICAM spa • 16 Nodes 3D-Interconnected • 4x2x2 Topology 26 Gflops, 4.6 GB Memory • Light System: • J&T Module connectors • Glue Logic (Clock tree 10Mhz) • Global signal interconnection (FPGA) • DC-DC converters (48V to 3.3/2.5) • Dominant Technologies: • LVDS: 1728 (16*6*2*9) differential signals 200Mb/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance layers • High-Speed differential connectors: • Samtec QTS (J&T Module) • Erni ERMET-ZD (Backplane) Piero Vicini - SciParC workshop

  17. J&T Module • J&T • 9 DDR-SDRAM, 256Mbit (x16) • 6 Link LVDS up to 400MB/s • Host Fast I/O Link (7th Link) • I2C Link (slow control network) • Dual Power 2.5V+1.8V, 7-10W estimated • Dominant technologies: • SSTL-II (memory interface) • LVDS (network interface + I/O) Piero Vicini - SciParC workshop

  18. NEXT BackPlane • 16 PB Slots + Root Slot • Size 447x600 mm2 • 4600 LVDS differential signals, point-to-point up to 600 Mb/s • 16 controlled-imp. layers (32) • Press-fit only • Erni/Tyco connectors • ERMET-ZD • Providers: • APW (primary) • ERNI (2nd src) connector kit cost:7KEuro (!) PB Insertion force:80-150 Kg(!) Piero Vicini - SciParC workshop

  19. a1 AIR-FLOW CHANNEL 1 DC/DC b1 AIR-FLOW CHANNEL 3 a3 b3 J&T Module AIR-FLOW CHANNEL 3 AIR-FLOW CHANNEL 2 b2 J&T Module a2 Frame TOP VIEW ( local ) PB Mechanics PB constraints: • Power consumption: up to 340W • PB-BP insertion force: 80-150 Kg (!) apeNEXT PB • Fully populated PB weight: 4-5 Kg Board-to-Board Connector Detailed study of airflow Custom design of card frame and insertion tool Piero Vicini - SciParC workshop

  20. Rack mechanics • Problem: • PB weight: 4-5 Kg, • PB consumption: 340W (est.) • 32 PB + 2 Root Board • Power supply: (<48Vx150A per crate) • Integrated Host PCs • Forced air cooling, • Robust, expandable/modular, CE, EMC .... • Solution: • 42U rack (h: 2,10 m): • EMC proof, • efficient cables routing • 19”-1U slots per 9 “host PCs” (rack mounted) • Hot-swap power supply cabinet (modular) • Custom design of “card cage” and “tie bar” • Custom design of cooling system Piero Vicini - SciParC workshop

  21. Piero Vicini - SciParC workshop

  22. I2C: bootstrap & control 7th-Link (200MB/s) Host I/O Architecture Piero Vicini - SciParC workshop

  23. QDR Mem Bank QDR Mem Ctrl PCI Master Ctrl Fifo Fifo Fifo 7Link Ctrl 7Link Ctrl PCITarget Ctrl I2C Ctrl Altera APEXII PCI Interface PLDA Host I/O Interface • PCI Board, Altera APEX II based • QuadDataRateMemory (x32) • 7th Link: 1(2) bidir. Chan. • I2C: 4 independent ports • PCI Interface 64bit, 66Mhz • PCI Master Mode for 7th Link • PCI Target Mode for I2C Piero Vicini - SciParC workshop

  24. Status and expected schedule • J&T ready to test September 03 • We will receive between 300 to 600 chips • We need 256 processor to assemble a crate !! • We expect them to work !! • The same team designed 7 ASIC of the same complexity • Impressive full-detailed simulations of multiple J&T systems • More one simulate less one has to test !! • PB,J&T Module, BackPlane, Mechanics were built and tested • Within days/weeks the first working apeNEXT computer should operate • Mass production will follow asap • End 2003 mass production will start…. • INFN requirements is 8-12 TFlops of computing power !! Piero Vicini - SciParC workshop

  25. Software • TAO compilers and linker ….. READY • All existing APE program will run with no change • Physical code already been run on the simulator • Kernel of PHYSICS codes • used to benchmark the efficiencies of the FP unit • C COMPILER • gcc (2.93) and lcc have be retargeted • lcc WORKS (almost). http://www.cs.princeton.edu/software/lcc/ Piero Vicini - SciParC workshop

  26. Project Costs • Total development cost of 1700 k€uro • 1050 k€uro for VLSI development • 550 k€uro non VLSI • Manpower involved = 20 man/year • Mass production cost ~ 0.5 €uro/MFlops Piero Vicini - SciParC workshop

  27. Future R&D activities • Computing node architecture • Adaptable/reconfigurable computing node • Fat operators, short/custom FP data, multiple node integration • Evaluation/integration of commercial processor in APE system • Interconnection architecture and technologies • Custom ape-like network • Interface to host, PCs interconnection • Mechanics assemblies (Perf/Volume,reliability) • Rack, Cables, Power distributions etc… • Software • Standard languages (C) full support (compiler, linker…) • Distributed OS • APE system integration in “GRID” environment Piero Vicini - SciParC workshop

  28. Conclusions • J&T in fab, ready Summer 03 (300….600 chips) • Everything else ready and tested !!! • If tests ok • mass production starting 4Q03 • All components over-dimensioned • Cooling, LVDS tested @ 400 Mb/s, power supply on boards … • Makes possible a technology step with no extra design and relatively low test effort • Installation plans • INFN theoretical group requires 8-12 TFlops (10-15 cabinets)(on delivering of a working machine…) • DESY considering between 8 TFlops to 16 Tflops • Paris…. Piero Vicini - SciParC workshop

  29. APE in SciParC • APE is the actual (“de-facto”) European computing platform for big volume LQCD applications. But…. • “Interdisciplinarity” is on our pathway (i.e. APE is not only QCD): • Fluid dynamics (lattice boltzman, weather forecast) • Complex Systems (spin glasses, real glasses, protein folding) • Neural networks • Seismic migration • Plasma physics (astrophysics, thermonuclear engines) • …… • So, in our opinion, it’s strategic to build “general purpose” massively parallel computing platform dedicated to approach large-scale computational problem coming from different fields of research. • The APE group can (want) contribute in development of such future machines Piero Vicini - SciParC workshop

More Related