1 / 40

allinea

Programming weather, climate, and earth-system models on heterogeneous multi-core platforms Conference Sept 7 & 8 Allinea DDT 3.0 For Debugging Challenge for Weather, Climate and Earth-systems models David Maples Allinea Software Inc dmaples@allinea.com. www.allinea.com. HPC World.

forbes
Download Presentation

allinea

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming weather, climate, and earth-system modelson heterogeneous multi-core platforms Conference Sept 7 & 8 Allinea DDT 3.0 For Debugging Challenge for Weather, Climate and Earth-systems models David Maples Allinea Software Inc dmaples@allinea.com www.allinea.com

  2. HPC World Systems in Top 500 • High Performance Computing needs ever-increasing compute power • Performance improvements will come from: • Concurrency and multi-core architectures • Optimized software • Writing or migrating software for concurrency is more complex and requires different tools and skills 180 160 140 120 8k - 32k 100 cores 80 32k+ cores 60 40 20 0 2006 2007 2008 2009 2010 2006 2007 2008 2009 2010 Year (June & November Lists)

  3. New Market Drivers Most ISV codes do not scale High programming costs are delaying GPU usage Development tools are a vital part of the solution • “Software has become the #1 roadblock … Many applications will need a major redesign” • IDC HPC Update, June 2010

  4. Allinea Software HPC tools company since 2001 Allinea DDT: Scalable parallel debugger Allinea OPT: Optimization tool for MPI and non-MPI Large U.S. and Large European customer base 12 of top 20 systems run Allinea DDT in EMEA Most scalable and cost effective debugger for CUDA Users debugging at all scales from 1 to 100,000 cores and beyond, but it's also easy to use on small clusters! World's only Petascale debugger!

  5. Clients and Partners Aviation and Defence Climate and Weather Energy Electronic Design Automation • Academic • Over 200 universities

  6. Allinea Clients in Climate Weather and Climate are a great fit HLRS, our first user in Germany in 2004. NERSC Met Office (UK) Proudman UK Irish Centre for High-End Computing (ICHEC) British Geological Survey (BGS) UK IFREMER (France) Meteo France NOAA USA (Cray Linux) Mercator Franc  US Navy – Fleet Numeric BoM – Australia, Royal Meteorological Institute of Belgium (IRM).

  7. Collaborations Partnership to develop Petascale debugger with NVIDIA support Partnership to develop Petascale/ Exascale tools and standards Partnership on Full Scale debugging on IBM Blue Gene /P & /Q Allinea DDT is “Debugger of Choice” on NERSC 5 and NERSC 6 and first implementation on CRAY XE6 Partnership with CEA French Atomic Energy Authority on scalable programming and CUDA Partnership on Keeneland project to help solving software challenges introduced by mixed architectures

  8. Allinea Software Collaborations Technical Collaboration Results - examples Cray Scalability - Most Scalable Debugger for Cray Developed on Jaguar Fast Track support – Rapid Debugging exclusive from Cray UPC and CAF Support Cray User Group Titan Debugger Development collaboration In house expertise on Allinea Software SGI UPC and CAF Support for SGI Compiler SGI Training for Allinea Users In house Expertise on Allinea Software IBM Enhanced BlueGene Support for Scalable Debugging for BG/P and future Nvidia Allinea DDT with CUDA Support Shipped commercially since April 2010

  9. What is the value to your work? • Scalability, Ease of Use and Intuitive GUI • - Allinea DDT extended capabilities • - Allinea Joint Development deliverables are include in Standard Product • - Allinea Collaboration with you to build new capabilities for your market • - Allinea support for current and future architectures • - Large group of DDT users in Weather/Climate

  10. Allinea DDT - Key capabilities for WC&E www.allinea.com

  11. Use a Parallel Debugger Many benefits to graphical parallel debuggers Large feature sets for common bugs Richness of user interface and real control of processes Historicallyall parallel debuggers hit scale problems Bottleneck at the front-end: Direct GUI → nodes architectures Linear performance in number of processes Human factors limit – mouse fatigue and brain overload Are tools ready for the task? Allinea DDT has changed the game

  12. Achievements Allinea DDT: First debugger with MPI and CUDA debugging Simplifying hybrid debugging Strong partnership with NVIDIA enables support for latest toolkit Allinea DDT new releases support new capabilities June 2010: Toolkit 3.0 - Nvidia DDT 2.6 December 2010: Toolkit 3.1 and 3.2 - Nvidia DDT 2.6.1 April 2011 Scalability and More DDT 3.0 Allinea DDT smashes the Petascale barrier 220,000 core debugging delivered to Oak Ridge National Laboratories Full set of core capabilities with global ~100ms timings

  13. Allinea DDT 3.0 • Petascale Architecture:Common collective process operations complete in a fraction of a second, even at over 200,000 cores! • Smart Highlighting: Automated display of the differences between processes and the changing of variable values • Visualization: New distributed multiple-dimensional array viewer with filtering • Faster C++ debugging: Automatic display of STL, Boost and Qt variables • Cross Process Comparison: Improved scalable cross process comparison • Attaching to Jobs: Improved Attach window lets you easily find and select MPI jobs and attach to subsets • HMPP Support: DDT 3.0 introduces support for CAPS HMPP • Tracepoints: Intelligent logging and merging of variable history during program execution

  14. DDT in a nutshell Scalar features Advanced C++ and STL Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types Memory debugging Multithreading & OpenMP features Step, breakpoint etc. one or all threads MPI features Easy to manage groups Control processes by groups Compare data Visualize message queues www.allinea.com

  15. DDT Platforms

  16. CAPS HMPP Support • Automatic detection of HMPP code fragments and set breakpoints before/after kernel • Step-over a kernel • Ignore HMPP wrapper layers • Suppress stack of HMPP internals to report only user code and high-level name of HMPP fragment • Obtain error codes (if possible) from HMPP kernels

  17. Handling Regular Bugs Immediate stop on crash Segmentation fault, or other memory problems Abort, exit, error handlers CUDA errors Scalable handling of error messages Leaps to the problem Source code highlighted Affected processes shown Process stacks displayed clearly in parallel

  18. Finding the cause Full class/structure browsing Locals and Current line(s) Show variables relevant to current position Drag in the source code to see more C, C++, F90: object members, static members and derived types Automatic comparison and change detection Scalable and fast

  19. Smart Highlighting • Compare variables across processes and instantly detect changes: • Blue:Value change • Green: Different value on other process(es) Fast and scalable! • Full class/structure browsing • Local variables and current line(s) • Show variables relevant to current position • Drag in the source code to see more • C, C++, F90: object members, static members, derived types

  20. Finding rogue processes Easy to find where differences are: Cross process comparison of data Fetches values from every process, compares and then groups by value Summary of NaN, Inf and statistics Easy to spot rogues Use to group processes Define process group and control en-masse www.allinea.com

  21. Cross Process Comparison • Analyse expressions calculated on each process in the current process group • Cross process comparison of data • Fetches values from every process, compares and then groups by value • Summary of NaNs, Infs and statistics • Easy to spot rogue processes! • Use to group processes • Define a process group

  22. Visualization 3-D Visualization of distributed data using the Multi-Dimensional Array viewer • Large Array Support • Browse arrays • 1, 2, 3… dimensions • Table view • Filtering • Look for an outlying value • Export • Save to a spreadsheet • View arrays from multiple processes • Search through terabytes for rogue data in parallel

  23. Tracepoints • Intelligent logging and merging of variable history during execution • “Scalable printf”: • No need to recompile your program • Merging helps prevent information overload: Network traffic and user interface • Add conditions to filter output • Allows you to view both the data and the lines of code your program is executing without stopping • View program flow and state quickly over multiple iterations • Save output for offline analysis – Free up system resources

  24. Improved C++ debugging • Faster startup when debugging C++ codes • Much improved performance for heavily templated code • Edit Type Feature • Helps viewing polymorphic types • Automatic display of STL, Boost and Qt containers • Easily view the contents of the data structure Easily de-reference pointers Before After

  25. Attaching to Jobs • Improved Attach window allows you to easily find and select MPI jobs and attach to running processes • Clicking the Attach to a Running Program button on the Welcome Screen will show DDT's Attach Window: • List of automatically detected MPI jobs: No need to select individual processes • Or you can manually select from a list of processes if required

  26. Memory Debugging Find memory leaks Or stop on read/write beyond end of array: www.allinea.com

  27. Debugging at Scale www.allinea.com

  28. Problems at Scale Increasing job sizes leads to unanticipated errors Regular bugs Data issues from larger data sets – eg. garbage in..., overflow Logic issues and control flow Increasing probability of independent random error Memory errors/exhaustion – “random” bugs! System problems – MPI and operating system Pushing coded boundaries Algorithmic (performance) Hard-wired limits (“magic numbers”) Unknown unknowns ....

  29. Strategies for bug fixing I Improved coding standards – unit tests, assertions Good practice – but coverage is rarely perfect Random/system issues – often missed Combines well with debuggers Find why a failure occurs not just a pass/fail Logging – printf and write If you have good intuition into the problem Edit code, insert print, recompile and re-run Slow and iterative Post-mortem analysis only Hard establish real order of output of multiple processes Rapid growth in log output size Unscalable

  30. Strategies for bug fixing II Reproduce at a smaller scale Attempt to make problem happen on fewer nodes Often requires reduced data set – the large one may not fit Smaller data set may not trigger the problem Does the bug even exist on smaller problems? Didn't you already try the code at small scale? Is it a system issue – eg. an MPI problem? Is probability stacking up against you? Unlikely to spot on smaller runs – without many many runs But near guaranteed to see it on a many-thousand core run What can a parallel debugger do to help? Debug at the scale of the problem - Now.

  31. Scalable Process Control Parallel Stack View Finds rogue processes quickly Identify classes of process behaviour Rapid grouping of processes Control Processes by Groups Set breakpoints, step, play, stop for groups Scalable groups view: compact group display

  32. Petascale Architecture • Developed due to collaborations with ORNL on Jaguar Cray XT, ANL and CEA • Logarithmic performance due to new tree architecture • Many operations are now faster at 220,000 than previously at 1000 cores • ~1/10th of a second to step and gather all stacks at 220,000 cores • A massive performance revolution for every user’s benefit!

  33. Debugging GPU Applications

  34. CUDA Debugging Options Old world “printf” NVIDIA SDK 3.0 allows this but with limitations Fake it – Run the kernel on the host x86_64 processor Languages often support targeting host CPU instead of GPU Different numeric precision – different answer? Different scheduling – different answer? A reasonable option for some bugs Or run on the GPU with Allinea DDT...

  35. GPUs Made Easy View all threads in parallel stack view At one glance, see all GPU and CPU threads together Links with thread selection Pick a tree node to select one of the CUDA threads at that location Full MPI support See GPU and CPU threads from multiple nodes

  36. Debugging CPU and GPU concurrently Browse source, examine variables, control processes and threads Set breakpoints Automatically stop on kernel launch Stop at a line of CUDA code Kernels stop when breakpoint reached Hover the mouse for more information Step a warp - 32 CUDA threads Debugging Kernels

  37. Examine Thread Data At a glance display of variables Expressions, local variables, and current line Also possible to edit values Displays the memory types shared, parameter, constant, register, …

  38. DDT CUDA Status NVIDIA SDK 3.1, SDK 3.2 Allinea DDT 3.0 Multi-device support Fermi and Tesla support CUDA Memcheck support for memory errors MPI and CUDA support for GPU clusters Breakpoints, thread control, and data evaluation Stop on kernel launch

  39. Summary Debuggers are the right tools to fix bugs quickly Other methods have limited success and issues at scale Allinea DDT scales in both performance and interface Breaking all records and making problems manageable Be sure to get DDT release 3.0! Allinea DDT supports NVIDIA CUDA with the ability to debug code running on both the CPU and GPU NVIDIA SDK 3.1, SDK 3.2 and available for SDK 4.0 Contact support@allinea.com or sales@allinea.com

  40. Thank You David Maples Allinea Software Inc. 2033 Gateway Pl. San Jose, Ca. 408 884 0282 dmaples@allinea.com

More Related