1 / 19

Moving Complex Apps To Take Advantage of Complex Hardware

Moving Complex Apps To Take Advantage of Complex Hardware. Salishan. 4 /24/2014. Ian Karlin. ASC Codes Last Many HW Generations. Tuning large complex applications for each hardware generation is impractical Performance Productivity Code Base Size.

mliss
Download Presentation

Moving Complex Apps To Take Advantage of Complex Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Moving Complex Apps To Take Advantage of Complex Hardware Salishan • 4/24/2014 • Ian Karlin

  2. ASC Codes Last Many HW Generations • Tuning large complex applications for each hardware generation is impractical • Performance • Productivity • Code Base Size Solutions must be general, adaptable to the future and maintainable

  3. Power Efficiency is Driving Computer Designs Power vs. frequency for Intel Ivy Bridge What drives these designs? • Handheld mobile device weight and battery life • Exascale power goals • Power cost Lower power reduces performance and reliability

  4. Reliability of Systems Will Decrease Chips operating near threshold voltage encounter • More transient errors • More hard errors Checkpoint restart is our current reliability mechanism

  5. Advancing Capability at Reduced Power Requires More Complexity Complex power saving features • SIMD and SIMT • Multi-Level memory systems • Heterogeneous systems Memory NVRAM Processing NVRAM Memory Processing GPU Multi-Core CPU In-Package Memory In-Package Memory Exploiting these features is difficult

  6. Currently, We Do Not Use These Features • No production GPU or Xeon Phi code • GPU and Xeon Phi optimizations are different • No production codes explicitly manage on-node data motion • Less than 10% of our FLOPs use SIMD units, even with the best compilers • Architecture dependent data layouts may hinder the compiler Mechanisms are needed to isolate architecture specific code

  7. What Would Continuing on Today’s Path Look Like? • We add directives to existing codes where portable • Multi-level memory handled by OS, runtime or used as a cache • We continue to get little SIMD and probably a bit better SIMT parallelism Overall performance improvement is incremental at best

  8. Can We Get Today’s Codes Where We Need To Be Tomorrow? • Are our algorithms well suited for future machines? • Can we rewrite our data structures to match future machines? We will address these questions in the next few slides

  9. We Can Manage Locality and Reduce Data Motion • Loop fusion • Make each operator a single sweep over a mesh • Data structure reorganization • Reduce mallocs or Use better libraries LULESH BG/Q However, better implementations only get us 2-3x

  10. We Can Reduce Serial Sections • Throughput optimized processors execute serial sections slowly • Design codes with limited serial sections • Better runtime support is needed to reduce serial overhead • OpenMP • Malloc Libraries Use latency optimized processor for what remains

  11. We Can Vectorize Better • More parallelism exists in current algorithms than we exploit today • Code changes are required to express parallelism more clearly • SIMT or SIMD with HW Gather/Scatter are easier to exploit LULESH Sandy Bridge Bandwidth constraints will eventually limit us

  12. However, There Are Fundamental Data Motion Requirements Many of today’s apps need 0.5-2 bytes for every FLOP performed.

  13. Future Machines Can Not Move Data Fast Enough For Current Algorithms to Exploit All Resources Excess FLOPs

  14. High-order Algorithms Can Bridge the Gap B to F Requirement vs. Algorithmic Order • More FLOPs per byte • Small dense operations • More accurate • Potentially more robust and better symmetry preservation

  15. They Present New Questions • How do you use the FLOPs efficiently? • What does high-order accuracy mean when a there is a shock? • Can you couple all they physics we need at high-order? We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…

  16. How to Retarget Large Applications in a Manageable Way?

  17. Are Optimizations Portable Across Architectures? Mechanisms are needed to isolate non-portable optimizations

  18. You Saw One Approach Earlier This Week • RAJA, Kokkos and Thrust allow portable abstractions in today’s codes There Are Other Attractive Research Approaches For The Future Charm++ Liszt

  19. Ultimately We Need to Make Performance a First Class Citizen Architectures Programming Models Algorithms Today’s RAJA High Order

More Related