440 likes | 591 Views
Introduction to SDSC/NPACI Architectures. NPACI Summer Computing Institute August, 2003 Donald Frederick frederik@sdsc.edu Scientific Computing Services Group SDSC. CPU. CPU. CPU. CPU. M. M. M. M. NETWORK. CPU. CPU. CPU. CPU. BUS/CROSSBAR. MEMORY.
E N D
Introduction to SDSC/NPACI Architectures NPACI Summer Computing Institute August, 2003 Donald Frederick frederik@sdsc.edu Scientific Computing Services Group SDSC
CPU CPU CPU CPU M M M M NETWORK CPU CPU CPU CPU BUS/CROSSBAR MEMORY Shared and Distributed Memory Systems • Multicomputer (Distributed memory) • Each processor has it’s own local • memory. • Examples: CRAY T3E, IBM SP2, • PC Cluster • Multiprocessor (Shared memory) • Single address space. All processors • have access to a pool of shared memory. • Examples: SUN HPC, CRAY T90, NEC SX-6 • Methods of memory access : • - Bus • - Crossbar
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Interconnect Interconnect Interconnect Network MEMORY MEMORY MEMORY Hybrid (SMP Clusters) Systems Hybrid Architecture – Processes share memory on-node, may/must use message-passing off-node, may share off-node memory Example: IBM SP Blue Horizon, SGI Origin, Compaq Alphaserver, TeraGrid Cluster, SDSC DataStar
System Interconnect Topologies Send Information among CPUs through a Network - Best choice would be a fully connected network in which each processor has a direct link to every other processor – Fully Connected Network. This type of network would be very expensive and difficult to scale ~N*N. Instead, processors are arranged in some variation of a mesh, torus, hypercube, etc. 2-D Mesh 2-D Torus 3-D Hypercube
Network Terminology • Network Latency : Time taken to begin sending a message. Unit is microsecond, millisecond etc. Smaller is better. • Network Bandwidth : Rate at which data is transferred from one point to another. Unit is bytes/sec, Mbytes/sec etc. Larger is better. • May vary with data size For IBM Blue Horizon:
Network Terminology • Network Latency : Time taken to begin sending a message. Unit is microsecond, millisecond etc. Smaller is better. • Network Bandwidth : Rate at which data is transferred from one point to another. Unit is bytes/sec, Mbytes/sec etc. Larger is better. • May vary with data size For IBM Blue Horizon:
Network Terminology Bus • Shared data path • Data requests require exclusive access • Complexity ~ O(N) • Not scalable – Bandwidth ~ O(1) Crossbar Switch • Non-blocking switching grid among network elements • Bandwidth ~ O(N) • Complexity ~ O(N*N)
Network Terminology Multistage Interconnection Network (MIN) • Hierarchy of switching networks – e.g., Omega network for N CPUs, N memory banks: complexity ~ O(ln(N))
Current SDSC/NPACI Compute Resource IBM Blue Horizon SP • 1,152 POWER3 II 375 MHz CPUs • Grouped in 8-way nodes with 4 GB RAM – 128 nodes • 5 TB GPFS file system • 1.7 TFlops peak • AIX 5.1L, PSSP 3.4 – 64-bit MPI, checkpoint/restart • Compilers – IBM, KAI – C/C++, Fortran; gnu – C/C++ • New interactive service – 15 4-way POWER3 nodes, 2 GB/node • Queues – up to 36 hours on up to 128 nodes • Queues – normal, high, low – each up to 128 nodes (except low) • Dedicated runs with longer times can be scheduled – contact frederik@sdsc.edu • Grid runs across multiple machines – NPACKage Grid-enabling software installed
Current SDSC Archival Resource High Performance Storage System (HPSS) • ~.9 PB capacity – soon up to 6 PB • ~350 TB currently stored • ~ 20 million files • Data added at ~ 8 TB/month
Near-Term SDSC Compute Resources • TeraGrid Machine • ~4 Tflops computing power • IA-64 – 128 Madison 2-way nodes /256 “Madison+” 2-way nodes • Configuration - 2 cpus per node • MyriNet interconnect • Production – January, 2004 • Some early access via AAP • ETF IBM POWER4 Machine • ~7 Tflops compute power • POWER4+ cpus • Production – April 2004 • Federation switch/interconnect
SDSC TeraGrid Components • IBM Linux clusters • Open source software and community • Intel Itanium Processor Family™ nodes • IA-64 ISA – VLIW, ILP • Madison processors • Very high-speed network backplane • Bandwidth for rich interaction and tight coupling • Large-scale storage systems • Hundreds of terabytes for secondary storage • Grid middleware • Globus, data management, … • Next-generation applications • Beyond “traditional” supercomputing
OC-12 vBNS Abilene MREN OC-12 OC-3 TeraGrid: 13.6 TF, 6.8 TB memory, 79 TB internal disk, 576 network disk ANL 1 TF .25 TB Memory 25 TB disk Caltech 0.5 TF .4 TB Memory 86 TB disk Extreme Blk Diamond 574p IA-32 Chiba City 256p HP X-Class 32 32 24 32 32 128p HP V2500 128p Origin 24 32 24 92p IA-32 32 HR Display & VR Facilities 5 4 8 5 8 HPSS HPSS NTON OC-48 OC-12 Calren ESnet HSCC MREN/Abilene Starlight Chicago & LA DTF Core Switch/Routers Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) Juniper M160 OC-12 ATM OC-48 OC-12 GbE NCSA 6+2 TF 4 TB Memory 240 TB disk SDSC 4.1 TF 2 TB Memory 500 TB SAN vBNS Abilene Calren ESnet OC-12 OC-12 OC-12 OC-3 Myrinet 4 8 HPSS 1000 TB UniTree 2 Myrinet 4 1024p IA-32 320p IA-64 10 1176p IBM SP 1.7 TFLOPs Blue Horizon 14 15xxp Origin 16 Sun F15K
SDSC “node” configured to be “best” site for data-oriented computing in the world Argonne 1 TF 0.25 TB Memory 25 TB disk Caltech 0.5 TF 0.4 TB Memory 86 TB disk TeraGrid Backbone (40 Gbps) vBNS Abilene Calren ESnet NCSA 8 TF 4 TB Memory 240 TB disk HPSS 1000 TB Myrinet Clos Spine SDSC 4.1 TFLOP 2 TB Memory ~25 TB internal disk ~500 TB network disk Blue Horizon IBM SP 1.7 TFLOPs Sun F15K
TeraGrid Wide Area Network Abilene Chicago DTF Backbone (proposed) ANL Urbana Los Angeles San Diego OC-48 (2.5 Gb/s, Abilene) Multiple 10 GbE (Qwest) Multiple 10 GbE (I-WIRE Dark Fiber) • Solid lines in place and/or available by October 2001 • Dashed I-WIRE lines planned for Summer 2002
SDSC local data architecture: a new approach for supercomputing with dual connectivity to all systems LAN (multiple GbE, TCP/IP) Local Disk (50TB) Blue Horizon WAN (30 Gb/s) HPSS Linux Cluster, 4TF Sun F15K SAN (2 Gb/s, SCSI) SCSI/IP or FC/IP 30 MB/s per drive 200 MB/s per controller FC GPFS Disk (50TB) FC Disk Cache (150 TB) Database Engine Data Miner Vis Engine SDSC design leveraged at other TG sites Silos and Tape, 6 PB, 52 tape drives
Draft TeraGrid Data Architecture CalTech 0.5TF Potential forGrid-wide backups Every nodecan access every disk Cache Manager LAN SAN Data/Vis Engines NCSA SDSC LAN LAN 40 Gb/s WAN 4TF 6TF SAN SAN LAN SAN GPFS 50TB Cache 150TB GPFS 200TB Local 50TB Local 50TB Tape Backups TG Data architectureis a work in progress 1TF Argonne
SDSC TeraGrid Data Management (Sun F15K) • 72 processors, 288 GB shared memory,16 Fiber channel SAN HBAs (>200 TB disk),10 GbE • Many GB/s I/O capability • “Data Cop” for SDSC DTF effort • Owns Shared Datasets and Manages shared file systems • Serves data to non-backbone sites • Receives incoming data • Production DTF database server • SW – Oracle, SRB, etc.
Future Systems • DataStar 7.9 Tflops Power4 system • 8 x 32 Regatta H+ 1.7GHz nodes • 128 x 8 p655+’s 1.5GHz nodes • 3.2 TB total memory • 80 – 100 TB GPFS disk (supports parallel I/O) • Smooth transition from Blue Horizon to DataStar • April 2004 production • Early access to POWER4 system
Comparison: DataStar and Blue Horizon Blue Horizon DataStar Processor Power3-II 375MHz Power4+ 1.7, 1.5GHz Node type 8-way Nighthawks 32-way P690 & 8-way P655 Proc/nodes 1,152/144 1,024/128 & 256/8 Switch Colony Federation Peak speed (TF) 1.7 7.8 Memory (TB) 0.6 3+ GPFS (TB) 15 100+
Status of DataStar • 8 P690, 32-way, 1.7GHz nodes are on the floor • 10 TB disk attached to each node (80 TB total) • No High Speed interconnect yet • Expect 128 P655 nodes & Federation switch Nov/Dec this year • P655 nodes and switch will be built simultaneously with ASCI Purple by IBM
DataStar Orientation • 1024 processors P655 (1.5 GHz, identical to ASCI purple) will be available for batch runs • P655 nodes will have 2 GB/proc • P690 nodes (256 1.7 GHz proc) will be available for pre/post processing and data-intensive apps • P690 nodes have 4 GB/proc or 8 GB/proc
DataStar Networking • Every DataStar node (136) will be connected by 2 Gbps Fibre Channel to the Storage Area Network • All 8 P690 nodes are now connected by GbE • Eventually (Feb ’04) most (5) P690 nodes will have 10 GbE • All DataStar nodes will be on the Federation (2 GB/s) switch
DataStar Software • High Performance Computing – compilers, libraries, profilers, numerical libraries • Grid – NPACKage – contains Grid middleware, Globus, APST, etc. • Data Intensive Computing – Regatta nodes configured for DIC. DB2, Storage Resource Broker, SAS, Data Mining tools
Power4 Processor Features • 64-bit Architecture • Super Scalar, Dynamic Scheduling • Speculative Superscalar • Out-of-Order execution, In-Order completion • “8 Instruction Fetch” but instructions are grouped for execution • sustains five-issues per clock and 1 branch, up to 215 in flight. • 2 LSU, 2 FXU, 2 FPU, 1 BXU, 1CRLXU • 8 Prefetching Streams
Processor Features (cont.) • 80 General Purpose Registers, 72 Float Registers • Rename registers for pipelining • Aggressive Branch Prediction • 4KB or 16MB Page Sizes • 3-Level Cache • 1024 TLB entry • Hardware Performance Counters
Processor Features: FPU • 2 Floating Point Multiply/Add (FMA) Units 4 Flops/CP6 CP FMA Pipeline • 128-bit Intermediate Results (no rounding, default) • IEEE Arithmetic • 32 Floating Point Registers + 40 rename regs • Hardware Square Root 38 CPs, Divide 32 CPs
Processor Features: Cache L1 32KB/data 2-way assoc. (write through) 64KB/instruction direct mapped L2 1.44MB (unified) 8-way assoc. (write-in) L3 32MB 8-way assoc. 128/128/4x128 Byte Lines for L1/L2/L3
Processor Features: Cache/Memory Memory 8GB/MCM 13.86GB/sec 2 W* CP 4 W CP 0.87 W CP Regs. L2 L3 L1 Data L2 32KB 0.87 W CP L2 32MB L1 Instr. 1.4MB 64KB ~4 CP Latencies ~14 CP ~100 CP W PF Word (64 bit) Int Integer (64 bit) CP Clock Period Line size L1/L2/L3 =16/16/4x16 W 2 reads, 1 read & 1 write, 1 write ~250 CP to Memory
Processor Features: Memory Fabric Processor Core 1 Processor Core 2 Ifetch Store Load Ifetch Store Load Trace & Debug 8B 8B SP Controller 32B 32B 32B 32B CIU Switch BIST Engine POR Sequencer 8 8 8B 32B 32 32 Perf Monitor Error Detect & Logging L2 Cache L2 Cache L2 Cache 32 32 32B 32B 32 32 chip-chip Fabric (2:1) chip-chip Fabric (2:1) 16B 16B Fabric Controller 16B 16B 16B 16B MCM-MCM (2:1) MCM-MCM (2:1) 8B 8B 16B 4B L3 Directory L3 Controller Mem Controller L3/Mem Bus (3:1) GX Bus (n:1) GX Controller 16B 4B
Processor Features: Costs of New Features • Increased FPU & pipeline depth (dependencies hurt, uses more registers) • Reduced L1 cache size • Higher latency on higher level caches
Processor Features: Relative Performance Performance Factor
Power4 Multi-Chip Module (MCM) • 4-way SMP on Multi-Chip Module (MCM) • >41.6 GB/sec chip-to-chip interconnect & MCM-MCM • Logically shared L2 and L3’s in MCM • Distributed Switch Design (on chip) features • Low Latency of bus-based system • High Bandwidth of switched based system • Fast I/O Interface: (GX bus) • Dual-plane Switch: Two independent switch fabrics; each node has two adapters, one to each fabric. • Point-to-Point Bandwidth ~350MB/sec; 14 usec latency. • MPI on-node (shared memory) Bandwidth ~1.5GB/sec
Power4 MCM Four POWER4 chips assembled onto a Multi-Chip Module (MCM) (left) to create an 4-way SMP building block for the Regatta HPC configuration. The die of a single chipset is magnified on the right-- 170 million transistors.
Power4 MCM 125 watts / die x 4 HOT!!!
Power4 Node: Multiple MCMs M E M M E M M E M M E M
cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 Power4 Node: Network of Buses MCM IO Bus Memory Bus cpu L1,L2 cpu L1,L2 L3 L3 X 4 Memory Memory L3 cpu L1,L2 cpu L1,L2 L3 Memory 4-Way 8GB memory IO Inter-MCM Memory Paths 16-Way 32GB memory
IOCC IOCC IOCC IOCC C C C C C L3 32MB L3 32MB L3 32MB L3 32MB L2 L2 L2 L2 Mem. 2GB Mem. 2GB Mem. 2GB Mem. 2GB L3 Ctl/Dir L3 Ctl/Dir L3 Ctl/Dir L3 Ctl/Dir MCM Memory Access: Local C C C
c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. MCM Memory Access
Status of TeraGrid • 128 2-way Madison nodes in place • Myrinet upgrade in August • System testing underway • Software testing underway • 256 Madison+ 2-way nodes scheduled for October 2003 • Production – January 2004
SDSC Machine Room • .5 PB disk • 6 PB archive • 1 GB/s disk-to-tape • Optimized support for DB2 /Oracle • Enabled for extremely large and rapid data handling LAN (multiple GbE, TCP/IP) Local Disk (50TB) Power 4 DB Blue Horizon WAN (30 Gb/s) Power 4 HPSS Sun F15K Linux Cluster, 4TF SAN (2 Gb/s, SCSI) SCSI/IP or FC/IP 30 MB/s per drive 200 MB/s per controller FC GPFS Disk (100TB) FC Disk Cache (400 TB) Database Engine Data Miner Vis Engine Silos and Tape, 6 PB, 1 GB/sec disk to tape 52 tape drives
NPACI Production Software NPACI Applications web-page: www.npaci.edu/Applications • Applications in variety of research areas: • Biomolecular Structure • Molecular Mechanics/Dynamics • Quantum Chemistry • Eng. Structural Analysis • Finite Element Methods • Fluid Dynamics • Numerical Libraries • Linear Algebra • Differential Equations • Graphics/Scientific Visualization • Grid Computing
Intro to NPACI Architectures - References • NPACI User Guides • www.npaci.edu/Documentation • POWER4 Info • POWER4 Processor Introduction and Tuning Guide http://publib-b.boulder.ibm.com • IA-64 Info • Sverre Jarp CERN - IT Division http://nicewww.cern.ch/~sverre/SJ.htm • Intel Tutorial www.intel.com/design/itanium/archSysSoftware/