Moving Complex Apps To Take Advantage of Complex Hardware

Moving Complex Apps To Take Advantage of Complex Hardware Salishan • 4/24/2014 • Ian Karlin

ASC Codes Last Many HW Generations • Tuning large complex applications for each hardware generation is impractical • Performance • Productivity • Code Base Size Solutions must be general, adaptable to the future and maintainable

Power Efficiency is Driving Computer Designs Power vs. frequency for Intel Ivy Bridge What drives these designs? • Handheld mobile device weight and battery life • Exascale power goals • Power cost Lower power reduces performance and reliability

Reliability of Systems Will Decrease Chips operating near threshold voltage encounter • More transient errors • More hard errors Checkpoint restart is our current reliability mechanism

Advancing Capability at Reduced Power Requires More Complexity Complex power saving features • SIMD and SIMT • Multi-Level memory systems • Heterogeneous systems Memory NVRAM Processing NVRAM Memory Processing GPU Multi-Core CPU In-Package Memory In-Package Memory Exploiting these features is difficult

Currently, We Do Not Use These Features • No production GPU or Xeon Phi code • GPU and Xeon Phi optimizations are different • No production codes explicitly manage on-node data motion • Less than 10% of our FLOPs use SIMD units, even with the best compilers • Architecture dependent data layouts may hinder the compiler Mechanisms are needed to isolate architecture specific code

What Would Continuing on Today’s Path Look Like? • We add directives to existing codes where portable • Multi-level memory handled by OS, runtime or used as a cache • We continue to get little SIMD and probably a bit better SIMT parallelism Overall performance improvement is incremental at best

Can We Get Today’s Codes Where We Need To Be Tomorrow? • Are our algorithms well suited for future machines? • Can we rewrite our data structures to match future machines? We will address these questions in the next few slides

We Can Manage Locality and Reduce Data Motion • Loop fusion • Make each operator a single sweep over a mesh • Data structure reorganization • Reduce mallocs or Use better libraries LULESH BG/Q However, better implementations only get us 2-3x

We Can Reduce Serial Sections • Throughput optimized processors execute serial sections slowly • Design codes with limited serial sections • Better runtime support is needed to reduce serial overhead • OpenMP • Malloc Libraries Use latency optimized processor for what remains

We Can Vectorize Better • More parallelism exists in current algorithms than we exploit today • Code changes are required to express parallelism more clearly • SIMT or SIMD with HW Gather/Scatter are easier to exploit LULESH Sandy Bridge Bandwidth constraints will eventually limit us

However, There Are Fundamental Data Motion Requirements Many of today’s apps need 0.5-2 bytes for every FLOP performed.

Future Machines Can Not Move Data Fast Enough For Current Algorithms to Exploit All Resources Excess FLOPs

High-order Algorithms Can Bridge the Gap B to F Requirement vs. Algorithmic Order • More FLOPs per byte • Small dense operations • More accurate • Potentially more robust and better symmetry preservation

They Present New Questions • How do you use the FLOPs efficiently? • What does high-order accuracy mean when a there is a shock? • Can you couple all they physics we need at high-order? We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…

How to Retarget Large Applications in a Manageable Way?

Are Optimizations Portable Across Architectures? Mechanisms are needed to isolate non-portable optimizations

You Saw One Approach Earlier This Week • RAJA, Kokkos and Thrust allow portable abstractions in today’s codes There Are Other Attractive Research Approaches For The Future Charm++ Liszt

Ultimately We Need to Make Performance a First Class Citizen Architectures Programming Models Algorithms Today’s RAJA High Order

Moving Complex Apps To Take Advantage of Complex Hardware