160 likes | 274 Views
Tools at Scale - Requirements and Experience. Mary Zosel, LLNL ASCI / PSE ASCI Simulation Development Environment Tools Project Prepared for SciComp 2000 La Jolla, Ca. Aug 14-16, 2000. UCRL: VG - 139702. Presentation Outline: Overview of Systems Requirements for Scale
E N D
Tools at Scale - Requirements and Experience Mary Zosel, LLNL ASCI / PSE ASCI Simulation Development Environment Tools Project Prepared for SciComp 2000 La Jolla, Ca. Aug 14-16, 2000 UCRL: VG - 139702
Presentation Outline: Overview of Systems Requirements for Scale Experience/Progress in debugging and tuning
ASCI WHITE • 8192 P3 cpu’s • NightHawk 2 nodes • Colony Switch • 12.3 TF peak • 160 TB disk • 28 tractor trailers • Classified Network Full system at IBM 120 nodes in new home at LLNL - remainder due late Aug.
White joins these IBM platforms at LLNL • 128 cpu - SNOW - (8-way P3 NH 1 nodes - Colony) • Experimental software development platform - Unclassified • 1344 cpu - BLUE - (4-way 604e silver nodes / TB3MX) • Production unclassified platform • 16 cpu - BABY - (4-way 604e silver nodes / TB3MX) • Experimental development platform - first stop for new system software • 64 cpu - ER - (4 way 604e silver nodes / TB3MX) • Backup production system “parts” - and experimental software • 5856 cpu - SKY (3 sectors of 488 silver nodes - connected with TB3MX and 6 HPGN IP routers) - Classified production system. • When White is complete - ~2/3 of SKY will become the unclassified production system
Why the big machines? • The purpose of ASCI is new 3-D codes for use in place of testing for Stockpile Certification. • ASCI program plan calls for series of application milepost demonstrations of increasingly complex calculations which require the very large platforms. • Last year- 1000 cpu requirement • This year - 1500 cpu requirement • Next year - ~4000 cpu requirement • Tri-lab resource -> multiple code teams with large scale requirements
What does this imply for development environment?Pressure Stress Pressure • Deadlines: multiple code teams working against time • Long Calculations: need to understand and optimize time requirements of each component to plan for production runs • Large Scale: easy to push past the knee of scalability - and past the Troutbeck US limit of 1024 tasks • Large Memory: n**2 buffer management schemes hurt • Access Contention: not easy to get large test runs - especially for tool work
What Tools are in use?Staying with standards helps make tools usable • Languages/Compilers: • C, C++, Fortran from both IBM and KAI • Runtime: OpenMP and MPI • Production codes not using pvm, shmem, direct LAPI use, etc. and direct use of pthreads is very limited • Debugging / Tuning: • TotalView, LCF, Great Circle, ZeroFault, Guide, Vampir, xprofiler, pmapi / papi, and hopefully new IBM tools
Debugging --- LLNL Experience • Users DO want to use the debugger with large # cpus • There have been lots of frustrations - but there is progress and expectation of further improvements • Slow to attach / start … what was hours is now minutes • Experience / education helps avoid some problems ... • Need large memory settings in ld • Now have MP_SYNC_ON_CONNECT off by default • Set startup timeouts (MP_TIMEOUT) • “Sluggish but tolerable” describes a recent 512 cpu session • Local feature development aimed at scale ... • Subsetting, collapse, shortcuts, filtering, … both CLI and X versions • Etnus continuing to address scalability
Root window collapsed Shows task 4 in different state. Same Root window opened to show all tasks
Cycle thru message state Example of thumb-screw on msg window
Performance … status quo is less promising • MPI scale is an issue - OpenMP reduces problem • Understanding thread performance is issue • Users DO want to use the tools - this is new • They need estimates for their large code runs … • Is my job is running or hung? • Tools aren’t yet ready for scale - including size-of-code scaling • Several tools do not support threads • Problems often not in the user’s code
List of sample problems User observes that … • … as the number of tasks grows, the code becomes relatively slower and slower. The sum of the CPU time and the system time doesn't add up to wall-clock time – and this missing time is the component growing the fastest. [Diagnosis – bad adaptor software configuration was causing excessive fragmentation and retransmission of MPI messages] • … unexplained code slow-down from previous runs and nothing in the code has changed. [Diagnosis – orphaned processes on one node slowed down entire code,] • … threaded version of code much slower than straight MPI. [Diagnosis – code had many small malloc calls and was serializing through the malloc code.] • … certain part of code takes 10 seconds to run while the problem is small – and then after a call to a memory-intensive routine – the same portion of code takes 18 seconds to run. [Diagnosis – not sure – but believed to be memory heap fragmentation causing paging.] • … job runs faster on Blue (604e system) than it does on Snow (P3 system). [Diagnosis – not yet known – wonder about flow-control default setting]. • … a non-blocking message-test code is taking up to 15 times longer to run on Snow than it does on Blue. [Diagnosis - not yet known - flow control setting doesn’t help.]
What are we doing about this? • PathForward contracts: KAI/Pallas, Etnus, MSTI • Infrastructure development: to facilitate new tools / probes • supports click-back to source • currently QT on DPCL … future??? • Probe components: -memory usage, mpi classification • Lightweight CoreFile … and Performance Monitors • External observation … Monitor, PS, VMSTAT … • Testing new IBM beta tools • Sys admins starting performance regression database
MPI Classification Performance Monitor Tools Tool Infrastructure Memory Performance Tool Work In Progress
the faster I go, the behinder I get … we ARE making progress, but the problems are getting harder and coming in faster ... It’s a Team Effort Rich Zwakenberg - debugging Karen Warren Bor Chan John May - performance tools Jeff Vetter John Gyllenhaal Chris Chambreau Mike McCracken John Engle - compiler support Linda Stanberry - mpi related Bronis deSupinski Susan Post - system testing Brian Carnes - general Mary Zosel Scott Taylor - emeritas John Ranelletti