510 likes | 711 Views
Scalable Scientific Computing at Compaq. CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA martin.walker@compaq.com. Agenda of the entertainment. From EV4 to EV7: four implementations of the Alpha microprocessor over ten years
E N D
Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA martin.walker@compaq.com
Agenda of the entertainment • From EV4 to EV7: four implementations of the Alpha microprocessor over ten years • Performance on a few applications, including numerical weather forecasting • The Terascale Computing System at the Pittsburgh Supercomputing Center • Marvel: the next (and last) AlphaServer • Grid Computing
Scientific basis for vector processor choice for Earth Simulator project • Comparison of Cray T3D and Cray Y-MP/C90 J.J. Hack, et al, “Computational design of the NCAR community climate model”, J. Parallel Computing 21 (1995) 1545-1569 • Fraction of peak performance achieved • 1-7% on Cray T3D • 30% on Cray Y-MP/C90 • Cray T3D used the Alpha EV4 processor from 1992
Key ratios that determine sustained application performance (U.S. DoD/DoE)
Alpha EV6 Architecture FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Reg Map Int Issue Queue (20) Branch Predictors Reg File (80) Exec Addr Exec L1 Data Cache 64KB 2-Set Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Addr Exec Next-Line Address 4 Instructions / cycle L1 Ins. Cache 64KB 2-Set FP ADD Div/Sqrt Reg File (72) FP Issue Queue (15) FP Reg Map Victim Buffer FP MUL Miss Address
Weather Forecasting Benchmark • LM = local model, German Weather Service (DWD) Current version is RAPS 2.0 • Grid size is 325 325 35; predefined INPUT set dwd used for all benchmarks • First forecast hour timed (contains more I/O than subsequent forecast hours) • Machines • Cray T3E/1200 (EV5/600 MHz) in Jülich, Germany • AlphaServer SC40 (EV67/667 MHz) in Marlboro, MA • Study performed by Pallas GmbH (www.pallas.com)
Performance comparisons • Alpha EV67/667 MHz in AS SC40 delivers about 3 times the performance of EV5/600 MHz in Cray T3E to the LM application • EV5 is running at about 6.7% of peak • EV67 is running at about 18.5% of peak
Compilation Times • Cray T3EFlags: -O3 -O aggress,unroll2,split1,pipeline2Compilation time: 41 min 37 sec • Compaq EV6/500 MHz (EV67 is faster)Flags: -fast -O4Compilation time: 5 min 15 sec • IBM SP3Flags: -04 -qmaxmem=-1Compilation time: 40 min 19 secNote: numeric_utilities.f90 had to be compiled with -O3 in order to avoid crashes
SWEEP3D • 3D discrete ordinates (Sn) neutron transport • Implicit wavefront algorithm • Convergence to stable solution • Target System - multitasked PVP / MPP • Vector style code • High ratio of (load,stores) to flops • memory bandwidth and latency sensitive • performance is sensitive to grid size
Optimizations to SWEEP3D • Fuse inner loops • demote temporary vectors to scalars • reduce load/store count • Separate loops with explicit values for “i2” = -1,1 • allows prefetch code to be generated • Fixup code moved “outside” loop • loop unrolling, pipelining
PCI 0 PCI 0 PCI 1 4xAGP PCI 1 PCI 2 PCI 1 PCI 3 AlphaServer ES45 (EV68/1.001 GHz) Each @ 64b (4.2GB/s) SDRAM Memory 133 MHz 128MB - 32GB Alpha 264 Alpha 264 Alpha 264 Alpha 264 Bank 0 Crossbar Switch (Typhoon chipset) Quad Ctl 256b 4.2GB/s Bank 1 L2 Cache Data Slices (8) L2 Cache Bank 2 L2 Cache Each @ 128b 8.0GB/s L2 Cache 256b 4.2GB/s PA PP Bank 3 PCI 0 32b@133MHz 512MB/s 64b 266MB/s 64b@66MHz 512MB/s 64b@66MHz 512MB/s 64b@66MHz 256MB/s
Pittsburgh Supercomputing Center (PSC) • Cooperative effort of • Carnegie Mellon University • University of Pittsburgh • Westinghouse Electric • Offices in Mellon Institute • On CMU campus • Adjacent to UofP campus
Westinghouse Electric • Energy Center, MonroevillePA • Major computing systems • High-speed network connections
Terascale Computing System at Pittsburgh Supercomputing Center • Sponsored by the U.S. National Science Foundation • Integrated into the PACI program (Partnerships for Academic Computing Infrastructure) • Serving the “very high end” for academic computational science and engineering • The largest open facility in the world • PSC in collaboration with Compaq and with • Application scientists and engineers • Applied mathematicians • Computer Scientists • Facilities staff • Compaq AlphaServer SC technology
CONTROL DISKS SERVERS SWITCH NODES System Block Diagram • 3040 CPUs • Tru64 UNIX • 3 TB memory • 41 TB disk • 152 CPU cabs • 20 switch cabs
ES45 nodes • 5 per cabinet • 3 local disks
QuadricsSwitches • Rail 1 & • Rail 0
QSW switch chassis • Fully wired switch chassis • 1 of 42
Installation: from 0 to 3.465 TFLOPS in 29 days (Latest: 4.059 TFLOPS on 3024 CPUs) • Deliveries & continual integration: • 44 nodes arrived at PSC on Saturday, 9-1-2001 • 50 nodes arrived on Friday, 9-7-2001 • 30 nodes arrived on Saturday, 9-8-2001 • 50 nodes arrived on Monday, 9-10-2001 • 180 nodes arrived on Wednesday, 9-12-2001 • 130 nodes arrived on Sunday, 9-16-2001 • 180 nodes arrived on Thursday, 9-20-2001 • To have shipped 12 September! • Federated switch cabled/operational by 9-23-01 • 760 nodes clustered by 9-24-01 • 3.465 TFLOPS Linpack by 9-29-01 • 4.059 TFLOPS in Dongarra’s list dated Mon Oct 22 (67% of peak performance)
EV6 .35 m, 600 MHz 4-wide superscalar Out-of-order execution High memory BW EV67 .25 m, up to 750 MHz EV68 .18 m, 1000 MHz EV7 .18 m, 1250 MHz L2 cache on-chip Memory control on-chip I/O control on-chip cc inter-proc com on-chip EV79 .13 m, ~1600 MHz Alpha Microprocessor Summary
EV7 – The System is the Silicon…. SMP CPU interconnect used to be external logic… Now it’s on the chip • EV68 core with enhancements • Integrated L2 cache • 1.75 MB (ECC) • 20 GB/s cache bandwidth • Integrated memory controllers • Direct RAMbus (ECC) • 12.8 GB/s memory bandwidth • Optional RAID in memory • Integrated network interface • Direct processor-processor interconnects • 4 links - 25.6 GB/s aggregate bandwidth • ECC (single error correct, double error detect) • 3.2 GB/s I/O interface per processor
EV7 – The System is the Silicon…. EV7 Electronics to do cache-coherent communications gets placed within the EV7 chip
Alpha EV7 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Reg Map Int Issue Queue (20) Branch Predictors Reg File (80) Exec L2 cache 1.75MB 7-Set Addr Exec L1 Data Cache 64KB 2-Set Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Addr Exec Next-Line Address 4 Instructions / cycle L1 Ins. Cache 64KB 2-Set FP ADD Div/Sqrt Reg File (72) FP Issue Queue (15) FP Reg Map Victim Buffer FP MUL Miss Address
Virtual Page Size • Current virtual page size • 8K • 64K • 512K • 4M • New virtual page size (boot time selection) • 64K • 2M • 64M • 512M
Performance • SPEC95 • SPECint95 75 • SPECfp95 160 • SPEC2000 • CINT2000 800 • CFP2000 1200 • 59% higher than EV68/1GHz
Building Block Approach to System Design • Key Components: • EV7 Processor • IO7 I/O Interface • Dual Processor Module • Systems Grow by Adding: • Processors • Memory • I/O
The hierarchy of understanding Data are uninterpreted signals Information is data equipped with meaning Knowledge is information applied in practice to accomplish a task The Internet is about information The Grid is about knowledge Tony Hey, Director, UK eScience Core Program Main technologies developed by man Writing captures knowledge Mathematics enables rigorous understanding, prediction Computing enables prediction of complex phenomena The Grid enables intentional design of complex systems Rick Stevens, ANL Two complementary views of the Grid
What is the Grid? “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computing capabilities.” • Ian Foster and Carl Kesselman, editors, “The GRID: Blueprint for a New Computing Infrastructure” (Morgan-Kaufmann Publishers, SF, 1999) 677 pp. ISBN 1-55860-8 • The Grid is an infrastructure to enable virtual communities to share distributed resources to pursue common goals • The Grid infrastructure consists of protocols, application programming interfaces, and software development kits to provide authentication, authorization, and resource location and access • Foster, Kesselman, Tuecke: “The anatomy of the Grid: Enabling Scalable Virtual Organizations”http://www.globus.org/research/papers.html
Compaq and The Grid • Sponsor of the Global Grid Forum (www.globalgridforum.org) • Founding member of the New Productivity Initiative for Distributed Resource Management (www.newproductivity.org) • Industrial member of the GridLab consortium (www.gridlab.org) • 20 leading European and US institutions • Infrastructure, applications, testbed • Cactus “worm” demo at SC2001 (www.cactuscode.org) • Intra-Grid within Compaq firewall • Nodes in Annecy, Galway, Nashua, Marlboro, Tokyo • Globus, Cactus, GridLab infrastructure and applications • iPAQ Pocket PC (www.ipaqlinux.com)
Potential dangers for the Grid • Solution in search of a problem • Shell game for cheap (free) computing • Plethora of unsupported, incompatible, non-standard tools and interfaces
“Big Science” • As with the Internet, scientific computing will be the first to benefit from the Grid. Examples: • GriPhyN (US Grid Physics Network for Data-intensive Science) • Elementary particle physics, gravitational wave astronomy, optical astronomy (digital sky survey) • www.griphyn.org • DataGrid (led by CERN) • Analysis of data from scientific exploration • www.eu-datagrid.org • There are also compute-intensive applications that can benefit from the Grid
Final Thoughts: all this will not be easy • How good have we been as a community at making parallel computing easy and transparent? • There are still some things we can’t do • predict the El Niño phenomenon correctly • plate tectonics and earth mantel convection • failure mechanisms in new materials • Validation and verification of numerical simulation are crying needs
Thank You! Please visit our HPTC Web Site http://www.compaq.com/hpc
Stability & Continuity for AlphaServer customers Commitment to continue implementing the Alpha Roadmap according to the current plan-of-record • EV68, EV7 & EV79 • Marvel systems • Tru64 UNIX support • AlphaServer systems, running Tru64 UNIX, will be sold as long as customers demand, at least several years after EV79 system arrive in 2004, with support continuing for a minimum of 5 years beyond that
EV79 8–64P (8P BB) 2–8P (2P BB) EV7 Family 8–64P (8P BB) 2–8P (2P BB) 2002 2003 2004 2001 2005 Microprocessor and System Roadmaps Alpha Processor EV68 EV68 EV7 EV79 Itanium™ Processor Family Itanium™ McKinley Madison Itanium Processor Family Next Generation EV68 Product Family GS 1 - 32P ES 1 – 4P DS 1 – 2P Alpha Servers Next GenerationServer Family Itanium™ 1 – 4P McKinley family Madison 8-64P, Blades, 2P, 4P, 8P 1–4P ProLiant Servers 1-8P 1-32P