290 likes | 736 Views
Superconducting Section 100 GHz Processors. P 0. P n. P n-1. CRAM 0. CRAM n. CRAM n-1. INTERCONNECT. L. i. q. u. i. d. N. R. e. g. i. m. e. Liquid N 2 Region. 2. Buffer. Buffer. Buffer. SRAM. SRAM. SRAM. OPTICAL PACKET SWITCH. DRAM. DRAM. DRAM. OPTICAL STORAGE.
E N D
Superconducting Section 100 GHz Processors P0 Pn Pn-1 CRAM0 CRAMn CRAMn-1 INTERCONNECT L i q u i d N R e g i m e Liquid N2 Region 2 Buffer Buffer Buffer SRAM SRAM SRAM OPTICAL PACKET SWITCH DRAM DRAM DRAM OPTICAL STORAGE Hybrid Technology Petaflops System • New device technologies • New component designs • New subsystem architecture • New system architecture • New latency management paradigm and mechanisms • New algorithms/applications • New compile time and runtime software WSCCG - Thomas Sterling
Complementing Technologies Yield Superior Power/Price/Performance Sense Amps Sense Amps Memory Stack Memory Stack Decode Basic Silicon Macro Sense Amps Sense Amps Node Logic Sense Amps Sense Amps Memory Stack Memory Stack Sense Amps Sense Amps Single Chip Superconductor RSFQ logic provides X100 Performance Processor in Memory (PIM) High Memory Bandwidth and Low Power Data Vortex Optical Communication. Very High Bi-section Bandwidth with Low Latency Holographic Storage High Capacity with Low Power at Moderate speeds WSCCG - Thomas Sterling
DIVA PIM Smart Memory for Irregular Data Structures and Dynamic Databases • Processor in Memory • Merges memory & logic on single chip • Exploits high internal memory bandwidth • Enables row-wide in-place memory operations • Reduces memory access latencies • Significant power reduction • Efficient fine grain parallel processing • DIVA PIM Project • DARPA sponsored $12.2M USC ISI prime with Caltech ($2.4M over 4 years), Notre Dame, U of Del • Greatly accelerate scientific computing of irregular data structures and commercial dynamic databases • 0.25 m 256 Mbit part delivered 4Qtr 00 • 4 processor/memory nodes • Key innovation of Multithreaded execution for high efficiency through latency management • Active message driven object oriented computation • Direct PIM to PIM interaction without host processor intervention WSCCG - Thomas Sterling
HTMT Percolation Model CRYOGENIC AREA DMA to CRAM Split-Phase Synchronization to SRAM start done C-Buffer A-Queue I-Queue Parcel Dispatcher & Dispenser Parcel Assembly & Disassembly Parcel Invocation & Termination Re-Use D-Queue T-Queue Run Time System SRAM-PIM DMA to DRAM-PIM WSCCG - Thomas Sterling
From Toys to Teraflops Bridging the Beowulf Gap Thomas Sterling California Institute of Technology NASA Jet Propulsion Laboratory September 3, 1998
Death of Commercial High-End Parallel Computers? • No market for high end computers • minimum growth in last five years • The Great Extinction • KSR, Alliant, TMC, Intel, CRI, CCC, Multiflow, Maspar, BBN, Convex, ... • Must use COTS • fabrication costs skyrocketing • development lead times too short • Federal Agencies Fleeing • NSF, DARPA, NIST, NIH • No New Good IDEAS WSCCG - Thomas Sterling
BEOWULF-CLASS SYSTEMS • Cluster of PCs • Intel x86 • DEC Alpha • Mac Power PC • Pure M2COTS • Unix-like O/S with source • Linux, BSD, Solaris • Message passing programming model • PVM, MPI, BSP, homebrew remedies • Single user environments • Large science and engineering applications WSCCG - Thomas Sterling
Emergence of Beowulf Clusters WSCCG - Thomas Sterling
Focus Tasks for Beowulf R&D • Applications • Scalability to high end • Low level enabling software technology • Grendel: Middle-ware for managing ensembles • Technology transfer WSCCG - Thomas Sterling
Beowulf at Work WSCCG - Thomas Sterling
Beowulf Scalability WSCCG - Thomas Sterling
A 10 Gflops Beowulf Center for Advance Computing Research 172 Intel Pentium Pro microprocessors California Institute of Technology WSCCG - Thomas Sterling
Avalon architecture and price. WSCCG - Thomas Sterling
The Back-Ground WSCCG - Thomas Sterling
Network Topology Scaling Latencies (s) WSCCG - Thomas Sterling
Petaflops Clusters at POWR David H. Bailey * James Bieda Remy Evard Robert Clay Al Geist Carl Kesselman David E. Keyes Andrew Lunsdaine James R. McGraw Piyush Mehrotra Daniel Savarese Bob Voigt Michael S. Warren WSCCG - Thomas Sterling
Critical System Software • A cluster node Unix-based OS (i.e. Linux or the like), scalable to 12,500+ nodes. • Fortran-90, C and C++ compilers, generating maximum performance object code, usable under the Linux OS. • An efficient implementation of MPI, scalable to 12,500+ nodes. • System management and job management tools, usable for systems of this size. WSCCG - Thomas Sterling
System Software Research Tasks • Can a stripped down Linux-like operating system be designed that is scalable to 12,500+ nodes? • Can vendor compilers be utilized in a Linux node environment? • If not, can high-performance Linux-compatible compilers be produced by third party vendors, keyed to needs of scientific computing? • Can MPI be scaled to 12,500+ nodes? • Can system management and batch submission (i.e. PBS or LSF) tools be scaled to 12,500+ nodes? • Can an effective performance management tool be produced for systems with 12,500+ nodes? • Can an effective debugger be produced for systems with 12,500+ nodes? Can the debugger being specified by the Parallel Tools consortium be adapted for these systems? WSCCG - Thomas Sterling
Technology Transfer • Information-hungry neo-users • how to implement • how to maintain • how to apply • Web based assembly and how-to information • Redhat CD-ROM including Extreme Linux • Tutorials • MIT Press book: “How to Build a Beowulf” • DOE and NASA workshops • JPC4: joint personal computer cluster computing conference • so many talks WSCCG - Thomas Sterling
Godzilla meets BambiNT versus Linux • Not in competition, complements each other • Linux was not created by suits • created by people who wanted to create it • distributed by people who wanted to share it • used by people who want to use it • If Linux dies • it will not be killed by NT • it will be buried by Linux users • Linux provides • Unix-like O/S which has been mainstream of scientific computing • Open source code • Low/no cost WSCCG - Thomas Sterling
Have to Run Big Problems on Big Machines? • Its work, not peak flops • A user’s throughput over application cycle • Big machines yield little slices • due to time and space sharing • But data set memory requirements • wide range of data set needs, three order of magnitude • latency tolerant algorithms enable out-of-core computation • What is Beowulf breakpoint for price-performance? WSCCG - Thomas Sterling
Alternative APIs • Mostly MPI • PVM, also • custom messaging for performance • BSP • SPMD, global name space, implicit messaging • Hrunting • software supported distributed shared memory • EARTH • Guang Gao, Un. Of Delaware • software supported multithreaded WSCCG - Thomas Sterling
Grendel Suite • Targets effective management of ensembles • Embraces “NIH” (nothing in-house) • Surrogate customer for Beowulf community • Borrow software products from research projects • Capabilities required: • communication layers • numerical libs • program development tools • scheduling and runtime • debug and availability • external I/O • secondary/mass storage • general system admin WSCCG - Thomas Sterling
Towards the Future:what can we expect • 2 GFLOPS peak processors • $1000 per processor • 1 Gbps at < $250 per port • new backplane performance e.g. PCI++ • Light-weight communications, < 10 usec latency • Optimized math libraries • 1 Gbyte main memory per node • 24 Gbyte disk storage per node • defecto standardized middle-ware WSCCG - Thomas Sterling
Million $$ Teraflops Beowulf? • Today, $3M peak Tflops • < year 2002 $1M peak Tflops • Performance efficiency is serious challenge • System integration • does vendor support of massive parallelism have to mean massive markup • System administration, boring but necessary • Maintenance without vendors; how? • New kind of vendors for support • Heterogeneity will become major aspect WSCCG - Thomas Sterling