640 likes | 805 Views
NCCS User Forum. 22 September 2009. Agenda. Welcome & Introduction Phil Webster, CISTO Chief. SSP Test Matt Koop, User Services . Current System Status Fred Reitz, Operations Lead. Discover Job Monitor Tyler Simon, User Services . NCCS Compute Capabilities Dan Duffy, Lead Architect.
E N D
NCCS User Forum 22 September 2009
Agenda Welcome & Introduction Phil Webster, CISTO Chief SSP Test Matt Koop, User Services Current System Status Fred Reitz, Operations Lead Discover Job Monitor Tyler Simon, User Services NCCS Compute Capabilities Dan Duffy, Lead Architect Analysis System Updates Tom Maxwell, Analysis Lead User Services Updates Bill Ward, User Services Lead PoDS Jules Kouatchou, SIVO Questions and Comments Phil Webster, CISTO Chief
Key Accomplishments • Incorporation of SCU5 processors into general queue pool • Capability to run large jobs (4000+ cores) on SCU5 • Analysis nodes placed in production • Migrated DMF from Dirac (Irix) to Palm (Linux)
New NCCS Staff Members • Lynn Parnell, Ph.D. Engineering Mechanics, High Performance Computing Lead • Matt Koop, Ph.D. Computer Science, User Services • Tom Maxwell, Ph.D. Physics, Analysis System Lead
Agenda Welcome & Introduction Phil Webster, CISTO Chief SSP Test Matt Koop, User Services Current System Status Fred Reitz, Operations Lead Discover Job Monitor Tyler Simon, User Services NCCS Compute Capabilities Dan Duffy, Lead Architect Analysis System Updates Tom Maxwell, Analysis Lead User Services Updates Bill Ward, User Services Lead PoDS Jules Kouatchou, SIVO Questions and Comments Phil Webster, CISTO Chief
Key Accomplishments Discover/Analysis Environment • Added SCU5 (cluster totals: 10,840 compute CPUs, 110 TF) • Placed analysis nodes (dali01-dali06) in production status • Implemented storage area network (SAN) • Implemented GPFS multicluster feature • Upgraded GPFS • Implemented RDMA • Implemented InfiniBand token network Discover/Data Portal • Implemented NFS mounts for select Discover data on Data Portal Data Portal • Migrated all users/applications to HP Bladeservers • Upgraded GPFS • Implemented GPFS multicluster feature • Implemented InfiniBand IP network • Upgraded SLES10 operating system to SP2 DMF • Migrated DMF from Irix to Linux Other • Migrated non-compliant AUIDs • Transitioned SecurID operations from NCCS to ITCD • Enhanced NCCS network redundancy
Discover 2009 Daily Utilization Percentage 2/4/09 – SCU4 (544 cores added) 2/19/09 – SCU4 (240 cores added) 2/27/09 – SCU4 (1,280 cores added) 8/13/09 – SCU5 (4,128 cores added)
Discover Daily Utilization Percentage by GroupMay – August 2009 8/13/09 – SCU5 (4,128 cores added)
Discover Total CPU ConsumptionPast 12 Months (CPU Hours) 9/4/08 – SCU3 (2,064 cores added) 2/4/09 – SCU4 (544 cores added) 2/19/09 – SCU4 (240 cores added) 2/27/09 – SCU4 (1,280 cores added) 8/13/09 – SCU5 (4,128 cores added)
Discover Availability Scheduled Maintenance: Jun-Aug 10 Jun - 17 hrs 5 min GPFS (Token and Subnets, 3.2.1-12) 24 Jun - 12 hours GPFS (RDMA, Multicluster, SCU5 integration) 29 Jul - 12 hours GPFS 3.2.1-13, OFED1.4 , DDN firmware 30 Jul - 2 hours 20 minutes DDN controller replacement 19 Aug - 4 hours NASA AUID transition Unscheduled Outages: Jun-Aug 16 Jun – 3 hrs 35 min – nodes out of memory 24 Jun – 4 hrs 39 min – maintenance extension 6-7 Jul – 4 hrs 18 min – internal switch error 13 Jul – 2 hrs 59 min – GPFS error 14 Jul – 26 min – nodes out of memory 20 Jul – 2 hrs 2 min – GPFS error 29 Jul – 55 min – Maintenance extension 19 Aug – 2 hrs 45 min – maintenance extension
Current Issues on Discover:Login Node Hangs • Symptom: Login nodes become unresponsive. • Impact: Users cannot login. • Status: Developing/testing solution. Issue arose during critical security patch installation.
Current Issues on DMF:Post-Migration Clean-Up • Symptoms: Various. • Impact: Various. • Status: Issues addressed as they are encountered and reported.
Future Enhancements • Discover Cluster • PBS V 10 • Additional storage • SLES10 SP2 • Data Portal • GDS OPeNDAP performance enhancements • Use of GPFS-CNFS for improved NFS mount availability
Agenda Welcome & Introduction Phil Webster, CISTO Chief SSP Test Matt Koop, User Services Current System Status Fred Reitz, Operations Lead Discover Job Monitor Tyler Simon, User Services NCCS Compute Capabilities Dan Duffy, Lead Architect Analysis System Updates Tom Maxwell, Analysis Lead User Services Updates Bill Ward, User Services Lead PoDS Jules Kouatchou, SIVO Questions and Comments Phil Webster, CISTO Chief
I/O Study Team • Dan Kokron • Bill Putman • Dan Duffy • Bill Ward • Tyler Simon • Matt Koop • Harper Pryor • Building on work by SIVO and GMAO (Brent Swartz)
Representative GEOS Output • Dan Kokron has generated many runs containing data in order to characterize the GEOS I/O • 720 core, quarter degree GEOS with YOTC-like history • Number of processes that write: 67 • Total amount of data: ~225 GB (written to multiple files) • Average write size: ~1.7 MB • Running in dnb33 • Using Nehalem cores (GPFS with RDMA) • Average Bandwidth • Timing the entire CFIO calls results in a bandwidth of 3.8 MB/sec • Timing just the NetCDF ncvpt calls results in a bandwidth of 44.4 MB/sec • Why is this so slow?
Kernel Benchmarks • Used open source I/O kernel benchmarks of xdd and iozone • Achieved over 1 GB/sec to all the new nobackup file systems • Wrote two representative one-node c-code benchmarks • Using c writes and appending to files • Using NetCDF writes with chunking and appending to files • Ran these benchmarks writing out exactly the same as process 0 in the GEOS run • C-writes: Average bandwidth of around 900 MB/sec (consistent with kernel benchmarks) • NetCDF writes: Average bandwidth of around 600 MB/sec • Why is GEOS I/O running so slow? C-writes Average Bandwidth ~900MB/sec NetCDF-writes Average Bandwidth ~600MB/sec
Effect of NetCDF Chunking • How does changing the NetCDF chunk size affect the overall performance? • The table shows runs varying the chunk size for an average of 10 runs for each chunk size • Used the NetCDF kernel benchmark • The smallest chunk size reproduces the GEOS bandwidth • As best as we can tell, this is roughly equivalent to the default chunk size • The best chunk size turned out to be about the size of the array being written ~3MB • References: • “NetCDF-4 Performance Report”, Lee, et. Al., June 2008. • NetCDF on-line tutorial: • http://www.unidata.ucar.edu/software/netcdf/docs_beta/netcdf-tutorial.html • Benchmarking I/O Performance with GEOSdas and other modeling guru posts • https://modelingguru.nasa.gov/clearspace/message/5615#5615
Setting Chunk Size in GEOS • Dan K. ran several baseline runs to make sure we were measuring things correctly • Turned on chunking and set the chunk size equal to the write size (1080x721x1x1) • Dramatic improvement in ncvpt bandwidth • Why was the last run so slow? • Because we had a file system hang during that run
What next? • Further explore chunk sizes in NetCDF • What is the best chunk size? • Do you set the chunk sizes for write performance or for read performance? • Once a file has been written with a set chunk size, it cannot be changed without rewriting the file. • Need to better understand the variability seen in the file system performance • Not uncommon to see a 2x or greater difference in performance from run to run • Turn the NetCDF kernel benchmark into a multi-node benchmark • Use this benchmark for testing system changes and potential new systems • Compare performance across NCCS and NAS systems • Write up results
Agenda Welcome & Introduction Phil Webster, CISTO Chief SSP Test Matt Koop, User Services Current System Status Fred Reitz, Operations Lead Discover Job Monitor Tyler Simon, User Services NCCS Compute Capabilities Dan Duffy, Lead Architect Analysis System Updates Tom Maxwell, Analysis Lead User Services Updates Bill Ward, User Services Lead PoDS Jules Kouatchou, SIVO Questions and Comments Phil Webster, CISTO Chief
Issue: Parallel Jobs > 1500 CPUs • Original problem: Many jobs wouldn’t run at > 1500 CPUs • Status at last Forum: Resolved using a different version of the DAPL library • Current Status: Now able to run at 4000+ CPUs using MVAPICH on SCU5
Issue: Getting Jobs into Execution • Long wait for queued jobs before launching • Reasons • SCALI=TRUE is restrictive • Per user & per project limits on number of eligible jobs (use qstat –is) • Scheduling policy: first-fit on job list ordered by queue priority and queue time • User services will be contacting folks using SCALI=TRUE to assist them in migration away from this feature
Future User Forums • NCCS User Forum schedule • 8 Dec 2009, 9 Mar 9 2010, 8 Jun 2010, 14 Sep 2010, and 7 Dec 2010 • All on Tuesday • All 2:00-3:30 PM • All in Building 33, Room H114 • Published • On http://nccs.nasa.gov/ • On GSFC-CAL-NCCS-Users
Agenda Welcome & Introduction Phil Webster, CISTO Chief SSP Test Matt Koop, User Services Current System Status Fred Reitz, Operations Lead Discover Job Monitor Tyler Simon, User Services NCCS Compute Capabilities Dan Duffy, Lead Architect Analysis System Updates Tom Maxwell, Analysis Lead User Services Updates Bill Ward, User Services Lead PoDS Jules Kouatchou, SIVO Questions and Comments Phil Webster, CISTO Chief
Sustained System Performance • What is the overall system performance? • Many different benchmarks or peak numbers are available • Often unrealistic or not relevant • SSP refers to a set of benchmarks that evaluates performance as related to real workloads on the system • SSP concepts originated from NERSC (LBNL)
Performance Monitoring • Not just for evaluating a new system • Ever wonder if a system change has affected performance? • Often changes can be subtle and not detected with normal system validation tools • Silent corruption • Slowness • Find out immediately instead of after running the application and getting an error
Performance Monitoring (contd.) • Run real workloads (SSP) to determine performance changes over time • Quickly determine if something is broken or slow • Perform data verification • Run automatically on a regular basis as well as after system changes • e.g. change to a compiler, MPI version, OS update NERSC SSP Example Chart
Meaningful Measurements • How you can help • We need your application and a representative dataset for your application • Ideally should take ~20-30 minutes to run at various processor counts • Your benefits • Changes to the system that affect your application will be noticed immediately • Data will be placed on NCCS website to show system performance over time
Agenda Welcome & Introduction Phil Webster, CISTO Chief SSP Test Matt Koop, User Services Current System Status Fred Reitz, Operations Lead Discover Job Monitor Tyler Simon, User Services NCCS Compute Capabilities Dan Duffy, Lead Architect Analysis System Updates Tom Maxwell, Analysis Lead User Services Updates Bill Ward, User Services Lead PoDS Jules Kouatchou, SIVO Questions and Comments Phil Webster, CISTO Chief
Discover Job Monitor • All data is presented as a current system snapshot, in 5 min intervals. • Displays system load as a percentage • Displays the number of running jobs and running cores • Queued jobs and job wait times • Displays current qstat -a output • Interactive Historical Utilization Chart • Message of the day • Displays average number of cores per job • Job Monitor
Agenda Welcome & Introduction Phil Webster, CISTO Chief SSP Test Matt Koop, User Services Current System Status Fred Reitz, Operations Lead Discover Job Monitor Tyler Simon, User Services NCCS Compute Capabilities Dan Duffy, Lead Architect Analysis System Updates Tom Maxwell, Analysis Lead User Services Updates Bill Ward, User Services Lead PoDS Jules Kouatchou, SIVO Questions and Comments Phil Webster, CISTO Chief
Climate Data Analysis • Climate models are generating ever-increasing amounts of output data. • Larger datasets are making it increasingly cumbersome for scientists to perform analyses on their desktop computers. • Server-side analysis of climate model results is quickly becoming a necessity.
Parallelizing Application Scripts • Many data processing shell scripts can be easily parallelized • MatLab, IDL, etc. • Use task parallelism to process multiple files in parallel • Each file processed on a separate core within a single dali node • Limit load on dali (16 cores per node ) • Max: 10 compute intensive processes per node Serial Version: while ( … ) … # process another file run.grid.qd.s … end Parallel Version: while ( … ) … # process another file run.grid.qd.s & … end
ParaView • Open-source, multi-platform visualization application • Developed by Kitware, Inc. (authors of VTK) • Designed to process large data sets • Built on parallel VTK • Client-server architecture: • Client: Qt based desktop application • Data Server: MPI based parallel application on dali. • Parallel streaming filters for data processing • Large library of existing filters • Highly extensible using plugins • Plugin development required for HDF, NetCDF, OBS data • No existing climate-specific tools or algorithms • Data Server being integrated into ESG
ParaView Client Qt desktop application that Controls data access, processing, analysis, and visualization
Analysis Workflow Configuration Configure a parallel streaming pipeline for data analysis
ParaView Applications Polar Vortex Breakdown Simulation Golevka Asteroid Explosion Simulation 3D Rayleigh-Benard problem Cross Wind Fire Simulation
Climate Data Analysis Toolkit • Integrated environment for data processing, viz, & analysis • Integrates numerous software modules in python shell • Open source with a large diverse set of contributors • Analysis environment for ESG developed @ LLNL
Data Manipulation • Exploits NumPy Array and Masked Array • Adds persistent climate metadata • Exposes NumPy, SciPy, & RPy mathematical operations Clustering FFT Image processing Linear algebra Interpolation Max entropy Optimization Signal processing Statistical functions Convolution Sparse matrices Regression Spatial algorithms
Grid Support • Spherical Coordinate Remapping and Interpolation Package • remapping and interpolation between grids on a sphere • Map between any pair of lat-long grids • GridSpec • Standard description of earth system model grids • To be implemented in NetCDF CF convention • Implemented in CMOR • MoDAVE • Grid visualization
Climate Analysis • Genutil & Cdutil (PCMDI) • General Utilities for climate data analysis • Statistics, array & color manipulation, selection, etc. • Climate Utilities • time extraction, averages, bounds, interpolation • masking/regridding, region extraction • PyClimate • Toolset for analyzing climate variability • Empirical Orthogonal Functions (EOF) analysis • Analysis of coupled data sets • Singular Vector Decomposition (SVD) • Canonical Correlation Analysis (CCA) • Linear digital filters • Kernel based probability • Density function estimation
CDAT Climate Diagnostics • Provides a common environment for climate research • Uniform diagnostics for model evaluation and comparison Taylor Diagram Thermodynamic Plot Performance Portrait Plot Wheeler-Kalidas Analysis
Contributed Packages • PyGrADS (potential) • AsciiData • BinaryIO • ComparisonStatistics • CssGrid • DsGrid • Egenix • EOF • EzTemplate • HDF5Tools • IOAPITools • Ipython • Lmoments • MSU • NatGrid • ORT • PyLoapi • PynCl • RegridPack • ShGrid • SP • SpanLib • SpherePack • Trends • Twisted • ZonalMeans • ZopeInterface
Visualization • Visualization and Control System (VCS) • Standard CDAT 1D and 2D graphics package • Integrated Contributed 2D Packages • Xmgrace • Matplotlib • IaGraph • Integrated Contributed 3D packages • ViSUS • VTK • NcVTK • MoDAVE
Visual Climate Data Analysis Tools (VCDAT) • CDAT GUI, facilitates: • Data access • Data processing & analysis • Data visualization • Accepts python input • Commands and scripts • Saves state • Converts keystrokes to python • Online help