520 likes | 727 Views
Chip-Multiprocessors & You. John Dennis dennis@ucar.edu March 16, 2007. Intel “Tera Chip”. 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process 45 nm technology High-K 2D mesh network Each processor has 5-port router Connects to “3D-memory”. Outline. Chip-Multiprocessor
E N D
Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007
Intel “Tera Chip” • 80 core chip • 1 Teraflop • 3.16 Ghz / 0.95V / 62W • Process • 45 nm technology • High-K • 2D mesh network • Each processor has 5-port router • Connects to “3D-memory” Software Engineering Working Group Meeting
Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Full with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting
Moore’s Law • Most things are twice as nice [18 months] • Transistor count • Processor speed • DRAM density • Historical Result: • Solve problem twice as large in same time • Solve same size problem in half the time --> Inactivity leads to progress! Software Engineering Working Group Meeting
The advent of Chip-multiprocessors Moore’s Law gone bad!
New implications of Moore’s Law • Every 18 months • # of cores per socket doubles • Memory density doubles • Clock cycle may increase slightly • 18 months from now • 8 cores per socket • Slight increase in clock cycle (~15%) • Same memory per core!! Software Engineering Working Group Meeting
New implications of Moore’s Law (con’t) • Inactivity leads to no progress! • Possible outcome • Same problem size / same parallelism • solve problem ~15% faster • Bigger problem size • scalable memory? • More processors enable ~2x reduction in time to solution • Non-scalable memory? • May limit number of processors that can be used • Waste 1/2 of cores on sockets to use memory? • All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself! Software Engineering Working Group Meeting
Questions ? Software Engineering Working Group Meeting
Parallel I/O library (PIO) John Dennis (dennis@ucar.edu) Ray Loy (rloy@mcs.anl.gov) March 16, 2007
Introduction • All component models need parallel I/O • Serial I/O is bad! • Increased memory requirement • Typically negative impact on performance • Primary Developers: [J. Dennis, R. Loy] • Necessary for POP BGW runs Software Engineering Working Group Meeting
Design goals • Provide parallel I/O for all component models • Encapsulate complexity into library • Simple interface for component developers to implement Software Engineering Working Group Meeting
Design goals (con’t) • Extensible for future I/O technology • Backward compatible (node=0) • Support for multiple formats • {sequential,direct} binary • netcdf • Preserve format of input/output files • Supports 1D, 2D and 3D arrays • Currently XY • Extensible to XZ or YZ Software Engineering Working Group Meeting
Terms and Concepts • PnetCDF: [ANL] • High performance I/O • Different interface • Stable • netCDF4 + HDF5 [NCSA] • Same interface • Needs HDF5 library • Less stable • Lower performance • No support on Blue Gene Software Engineering Working Group Meeting
Terms and Concepts (con’t) • Processor stride: • Allows matching of subset of MPI IO nodes to system hardware Software Engineering Working Group Meeting
Terms and Concepts (con’t) • IO decomp vs. COMP decomp • IO decomp == COMP decomp • MPI-IO + message aggregation • IO decomp != COMP decomp • Need Rearranger : MCT • No component specific info in library • Pair with existing communication tech • 1D arrays in library; component must flatten 2D and 3D arrays Software Engineering Working Group Meeting
Component Model ‘issues’ • POP & CICE: • Missing blocks • Update of neighbors halo • Who writes missing blocks? • Asymmetry between read/write • ‘sub-block’ decompositions not rectangular • CLM • Decomposition not rectangular • Who writes missing data? Software Engineering Working Group Meeting
What works • Binary I/O [direct] • Test on POWER5, BGL • Rearrange w/MCT + MPI-IO • MPI-IO no rearrangement • netCDF • netCDF • Rearrange with MCT [New] • Reduced memory • PnetCDF: • Rearrange with MCT • No rearrangement • Test on POWER5, BGL Software Engineering Working Group Meeting
What works (con’t) • Prototype added to POP2 • Reads restart and forcing files correctly • Writes binary restart files correctly • Necessary for BGW runs • Prototype implementation in HOMME [J. Edwards] • Writes netCDF history files correctly • POPIO benchmark • 2D array [3600x2400] (70 Mbyte) • Test code for correctness and performance • Tested on 30K BGL processors in Oct 06 • Performance • POWER5: 2-3x serial I/O approach • BGL: mixed Software Engineering Working Group Meeting
Complexity / Remaining Issues • Mulitple ways to express decomposition • GDOF: global degree of freedom --> (MCT, MPI-IO) • Subarrays: start + count (pNetCDF) • Limited expressiveness • Will not support ‘sub-block’ in POP & CICE, CLM • Need common language for interface • Interface between component model and library Software Engineering Working Group Meeting
Conclusions • Working prototype • POP2 for binary I/O • HOMME for netCDF • PIO telecon: discuss progress every 2 weeks • Work in progress • Multiple efforts underway • accepting help • http://swiki.ucar.edu/ccsm/93 • In CCSM subversion repository Software Engineering Working Group Meeting
Fun with Large Processor Counts:POP, CICE John Dennis dennis@ucar.edu March 16, 2007
Motivation • Can Community Climate System Model (CCSM) be a Petascale Application? • Use 10-100K processors per simulation • Increasing common access to large systems • ORNL Cray XT3/4 : 20K [2-3 weeks] • ANL Blue Gene/P : 160K [Jan 2008] • TACC Sun : 55K [Jan 2008] • Petascale for the masses ? • lagtime in Top 500 List [4-5 years] • @ NCAR before 2015 Software Engineering Working Group Meeting
Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Fun with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting
Status of POP • Access to 17K Cray XT4 processors • 12.5 years/day [Current Record] • 70% of time in solver • Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock] • 110 Rack Days/ 5.4M CPU hours • 20 year 0.1° POP simulation • Includes a suite of dye-like tracers • Simulate eddy diffusivity tensor Software Engineering Working Group Meeting
Status of POP (con’t) • Allocation will occur over ~7 days • Run in production on 30K processors • Needs Parallel I/O to write history file • Start runs in 4-6 weeks Software Engineering Working Group Meeting
Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Fun with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting
Status of CICE • Tested CICE @ 1/10 • 10K Cray XT4 processors • 40K IBM Blue Gene processors [BGW days] • Use weighted space-filling curves (wSFC) • erfc • climatology Software Engineering Working Group Meeting
POP (gx1v3) + Space-filling curve Software Engineering Working Group Meeting
Space-filling curve partition for 8 processors Software Engineering Working Group Meeting
Weighted Space-filling curves • Estimate work for each grid block Worki = w0 + Pi*w1 where: w0: Fixed work for all blocks w1: Work if block contains Sea-ice Pi: Probability block contains Sea-ice • For our experiments: w0 = 2, w1 = 10 Software Engineering Working Group Meeting
Probability Function • Error Function: Pi = erfc(( -max(|lati|))/) where: lati max lat in block i mean sea-ice extent variance in sea-ice extent NH=70°, SH =60°, =5 ° Software Engineering Working Group Meeting
Large domains @ low latitudes Small domains @ high latitudes 1° CICE4 on 20 processors Software Engineering Working Group Meeting
0.1° CICE4 • Developed at LANL • Finite Difference • Models sea-ice • Shares grid and infrastructure with POP • Reuse techniques from POP work • Computational grid: [3600 x 2400 x 20] • Computational load-imbalance creates problems: • ~15% of grid has sea-ice • Use weighted Space-filling curves? • Evaluate using Benchmark: • 1 day/ Initial run / 30 minute timestep / no Forcing Software Engineering Working Group Meeting
CICE4 @ 0.1° Software Engineering Working Group Meeting
Load-imbalance: Hudson Bay south of 70° Timings for 1°,npes=160, NH=70° Software Engineering Working Group Meeting
Timings for 1°,npes=160, NH=55° Software Engineering Working Group Meeting
Better Probability Function • Climatological Function: Where: ij climatological maximum sea-ice extent [satellite observation] ni is the number of points within block i with non-zero ij Software Engineering Working Group Meeting
Timings for 1°,npes=160, climate-based Reduces dynamics sub-cycling time by 28%! Software Engineering Working Group Meeting
Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL) Computer Time: Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) Cray XT3/4 time: ORNL Sandia Acknowledgements/Questions? Software Engineering Working Group Meeting et
Nb Partitioning with Space-filling Curves • Map 2D -> 1D • Variety of sizes • Hilbert (Nb=2n) • Peano (Nb=3m) • Cinco (Nb=5p) • Hilbert-Peano (Nb=2n3m) • Hilbert-Peano-Cinco (Nb=2n3m5p) • Partitioning 1D array Software Engineering Working Group Meeting
Scalable data structures • Common problem among applications • WRF • Serial I/O [fixed] • Duplication of lateral boundary values • POP & CICE • Serial I/O • CLM • Serial I/O • Duplication of grid info Software Engineering Working Group Meeting
Scalable data structures (con’t) • CAM • Serial I/O • Lookup tables • CPL • Serial I/O • Duplication of grid info Memory footprint problem will not solve itself! Software Engineering Working Group Meeting
Remove Land blocks Software Engineering Working Group Meeting
Case Study:Memory use in CLM • CLM Configuration: • 1x1.25 grid • No RTM • MAXPATCH_PFT = 4 • No CN, DGVM • Measure stack and heap on 32-512 BG/L processors Software Engineering Working Group Meeting
Memory use of CLM on BGL Software Engineering Working Group Meeting
Motivation (con’t) • Multiple efforts underway • CAM scalability + high resolution coupled simulation [A. Mirin] • Sequential coupler [M. Vertenstein, R. Jacob] • Single executable coupler [J. Wolfe] • CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob] • HOMME in CAM [J. Edwards] Software Engineering Working Group Meeting
Outline • Chip-Multiprocessor • Fun with Large Processor Counts • POP • CICE • CLM • Parallel I/O library (PIO) Software Engineering Working Group Meeting
Status of CLM • Work of T. Craig • Elimination of global memory • Reworking of decomposition algorithms • Addition of PIO • Short term goal: • Participation in BGW days June 07 • Investigation scalability at 1/10 Software Engineering Working Group Meeting
Status of CLM memory usage • May 1, 2006: • memory usage increases with processor count • Can run 1x1.25 on 32-512 processors of BGL • July 10, 2006: • Memory usage scales to asymptote • Can run 1x1.25 on 32- 2K processors of BGL • ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree] • January, 2007: • 150 persistent global arrays • 1/2 degee runs on 32-2K BGL processors • ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree] • February, 2007: • 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree] • Target: • no persistent global arrays • 1/10 degree runs on single rack BGL Software Engineering Working Group Meeting
Proposed Petascale Experiment • Ensemble of 10 runs/200 years • Petascale Configuration: • CAM (30 km, L66) • POP @ 0.1° • 12.5 years / wall-clock day [17K Cray XT4 processors] • Sea-Ice @ 0.1° • 42 years / wall-clock day [10K Cray XT3 processors • Land model @ 0.1° • Sequential Design (105 days per run) • 32K BGL/ 10K XT3 processors • Concurrent Design (33 days per run) • 120K BGL / 42K XT3 processors Software Engineering Working Group Meeting