160 likes | 341 Views
Leveraging Hierarchy Is this our Undiscovered Country?. John T. Daly. Undiscovered Country: Cost vs. Risk?. Technology Generation ~ 15 years. Parallel (IN). Data Movement. Exascale ?. Parallel ( OUT). Log(Performance). Concurrency. Vector. Latency Hiding. Time.
E N D
Leveraging HierarchyIs this our Undiscovered Country? John T. Daly
Undiscovered Country: Cost vs. Risk? Technology Generation ~ 15 years Parallel (IN) Data Movement Exascale? Parallel (OUT) Log(Performance) Concurrency Vector Latency Hiding Time
Advanced Computing Systems (ACS) • HPC capability doubles every 14 months, but data doubles every 9 months • Innovative solutions required to bridge the gap • Partner with industry, academia and national labs to develop technology enablers for next generation computing • Generate a steady stream of capability; no “end goal” for scaling
ACS: Bridge to research community Participatory Research Mission Problems Technical Challenges Agency Compute Mission Universities National Labs Government Industry CEC Mission Capability Technical Solutions Mirroring
ACS: technical thrusts + end-to-end • Our HPC stakeholders • System integrator optimizes power, performance and reliability for a set number of dollars • System user optimizes usability, dependability and time-to-solution for a set number of deliverables • Point solutions in six technical thrusts: power efficiency, chip I/O, interconnects, productivity, file I/O and resilience • Innovative end-to-end solutions • AMOEBA: chip level data movement and packaging • MYRIAD(?): system level modeling and simulation
Extreme is not necessarily “balanced” • Traditional HPC is an important part of ACS, but not the only part • Dynamic design space drives the need for simulation and abstract machine model • Goal: Scientific understanding in HPC Productivity Interconnect Traditional HPC and ACS too File I/O & Storage Chip I/O Also ACS, but maybe not traditional HPC Resilience Power Efficiency
! ? ! ! Future “convergence” ? • Today • Predictive science starts with an initial model and runs a numerical experiment to generate lots of data • Data analytics starts with lots of data and extracts features or information that characterize the data • Tomorrow • Predictive science uses in situ data analytics to reduce the data storage and post-processing requirements • Data analytics uses in situ predictive science to ask the question “what ought this data to look like?” ? ? vs. Advancing Intelligence Through Science ?
Power Efficiency Resilience Energy is the next shared resource Productivity Chip I/O Interconnect File I/O • Off node communication is over budget • Off chip communication is over budget DOE Architectures and Technology for Extreme Scale Computing, San Diego, CA
Data is the challenge of scale • Energy, performance and data integrity tapers are a function of the distance between the data and the processor • Data locality is key to computing at scale for optimizing right answers per Joule per second • Spatial locality allows me to grab more data in a single memory transaction • Temporal locality allows me to use the same data multiple times before I have to move it
A role for NV in the hierarchy http://www.bit-tech.net/hardware/memory/2007/11/15/the_secrets_of_pc_memory_part_1/3
Node architecture = “shops” of data • Byte/Word addressable memory up and down the stack, block synchronous between stacks • Control is data aggregator (e.g., gather/scatter) RAM/ NVRAM RAM/ NVRAM RAM Control Processor/Control Control RAM/ NVRAM RAM/ NVRAM RAM Control Processor/Control Control
Exploiting Spatial Locality • Fractal Memory • Create a virtual mapping of data lines to space filling curves (e.g., Jin and Mellor Crummey, “Using Space-filling Curves for Computation Reordering”) • Use memory control logic to resolve mappings • Dynamic mapping by user via PM interface • Move work to data • Adaptive mesh refinement is a refine operation spawned at another memory component • Map memory references back to processor
Exploiting Temporal Locality • Global one-sided memory model • Different processors updating same values in PDE solver creates race conditions • You’re going to get the wrong answer anyway, so checkpoint asynchronously and use QMU • Inherently resilient algorithms that avoid global synchronization • Reconfigurable hierarchy: “cache” vs. “scratch pad” • “Cache” is seamless and easy to use, but sometimes I’d like to be able to bypass it • “Scratch pad” avoids duplicating memory and can be higher performing, but it is harder to use • Is SSD going to work like “cache” or “scratch pad”?
Motivating example: Exa-sorting • Many linear solution methods are already robust against errors and data race conditions (e.g. multigrid methods) • What about an application like sorting? • Gradient descent approach is robust under errors* and can be parallelized asynchronously • Suggests possibility for research in asynchronous parallel minimization approach for other classes of problems • How about non-linear solvers? • Analogy in minimization of the objective function via solution of the adjoint problem? • What about chaotic systems? Non-linear term * Joseph Sloan, David Kesler, Rakesh Kumar, and Ali Rahimi. “A Numerical Optimization-based Methodology for Application Robustification: Transforming Applications for Error Tolerance”. DSN2010, Chicago, July 2010.
From the user/developer perspective • Domain specific language to serve as portable wrapper for domain user and SME • Support for globally addressable memory space • Easy one-sided and two-sided, synchronous and asynchronous access to remote data • Intuitive mechanism for lightweight thread creation and remote task invocation • Application control over dynamically reconfigurable memory (hardware cache, software cache and software scratch) at each level of the memory hierarchy (chip, nodeand storage) • Tools for monitoring memory and energy utilization, so I know when I’m swapping to DIMM!
Conclusions • Exascale arrives at the end of the technology generation bridging concurrency to data: risk or opportunity? • Traditional algorithms + architectures too expensive in power, performance and reliability if data leaves cache • Rethinking computation may yield large ROI • models of computation • “balanced architecture” • predictive science vs. data analytics • Required to facilitate new approaches • programming models and tools • simulation and modeling framework • vendor partnerships and technology investment