360 likes | 452 Views
Efficient HPC Data Motion via Scratchpad Memory. Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua Peraza, Pietro Cicotti, Laura Carrington. Question 1 Do HPC workloads benefit from software managed Scratchpads? YES! If, so how will we manage it?. Outline. Motivation
E N D
Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua Peraza, Pietro Cicotti, Laura Carrington
Question 1Do HPC workloads benefit from software managed Scratchpads? YES! If, so how will we manage it?
Outline • Motivation • Scratchpad Background • Simulation Framework and Methodology • Initial Study • Current Direction
Outline • Motivation • Scratchpad Background • Simulation Framework and Methodology • Initial Study • Current Direction
Problem: HPC Powerwall • Can't scale old systems • Powerwall already reached by petaflop systems • Must redesign for power savings • Efficiency must increase by 2x Source: Exascale Report (Kogge, 2008)
How to get Energy Savings • Redesign Hardware • Simpler hardware • Transfer complexity to software • Minimize expensive data movement • Memory slower • More cores=more contention • HPC codes have large working set sizes
Outline • Motivation • Scratchpad Background • Simulation Framework and Methodology • Initial Study • Current Direction
What is a Scratchpad? • Scratchpad (SPM)? • Local memory (like a cache) • SPM: software allocated memory • Simpler Hardware Tagging Array Memory Array Decoder Memory Array VS Decoder
Scratchpad Allocation • Dynamic • Move block of code • Iterate over code • Move another block • Static: Move block of code once • Strategies • Knapsack • Graph Coloring • register allocation problem
The Idea: Less Data Movement • Scratchpad saves energy • Allocation burden now on software • Less complexity on hardware • Move only what you use • Uses temporal locality • Cache • Spatial locality can fail: Superfluous data movement(Spatial locality is built into cache design – note the 8-word linesize in most architectures) A B C D E Moved into Cache
Implication of Scratchpads • Current use: Embedded Systems • Smaller working set size • Predictable code • GPU's • Coding overhead • Issue: HPC codes • Large unpredictable codes • How to generalize codes? • How to make it practical and efficient
Outline • Motivation • Scratchpad Background • Simulation Framework and Methodology • Initial Study • Current Direction
Question 2Are there computation patterns which get the most benefit from SPM?
Why idioms? • Pattern of computation/memory access • Characterize Application Data Movement • Metric to compare different scientific codes (good coverage) • Easy to port HPC Code
The Methodology • Idiom characterization study: idioms SPM vs. Cache favorability • Find idioms on HPC codes • Port SPM favorable idioms in HPC codes to scratchpad
Tool: PEBIL Executable Binary • Binary instrumentation tool • Executable Binary => Identify Basic Blocks => Cache Simulation • Cache Simulator built on top of PEBIL • User Defined Cache Structures • Profiles executables (hit/miss) Stage 1 A op B A=b+3 ….. Block1 Block2 PEBIL Output Stage 2 Block 1 {#hits} {#misses} Block 2 {#hits} {#misses} ……. Cache Block1 Block2
Cache/SPM only Executable Binary Stage 1 Block1 Block2 Stage 2 Cache SPM Block1 Block1 Block2 Block2
Hybrid System Executable Binary Stage 1 Block1 Block2 Stage 2 Hybrid SPM Cache Block1 Block2
Tool: PIR (find Idioms in HPC) • Used for: automatically identifies idioms in large-scale HPC applications • Input: Idioms.txt • Idioms are defined using a pattern language • Output: • Idioms matched to source line number Gather Loop1 Transpose Loop2
Outline • Motivation • Scratchpad Background • Simulation Framework and Methodology • Initial Study • Current Direction
Under the hood: HPC Results • Under the hood: HPC ResultsFundamental question: Is there a benefit of SPM for HPC codes? • Simulate full apps on cache and SPM • Use simple heuristic to define the mappings • Simulate on hybrid • Pitfalls: • Sometime SPM moves more than cache: LRU
Metrics Data Moved=(Cache Misses)*Cache Line Size Data Movement Ratio (SPM Data Movement) (Cache Data Movement)
HPC Applications • Graph500 • Construct and traverse weighted undirected graph • HYCOM • Ocean model: hybrid isopycnal-sigma-pressure, generalized coordinate • SMG2000 • Parallel semi-coarsening Multi-grid Solver • Sequoia Benchmarks • SPHOT • Monte Carlo photon transport code • UMT • Unstructured-mesh deterministic radiation transport code • AMG2006 • Algebraic mult-grid linear system solver for unstructured mesh
Question 1Do HPC workloads benefit from software managed Scratchpads?YES!
Using Methodology for HYCOM • Gather Idiom: Prefers SPM • Find gather in HYCOM: 33 instances • Port Idiom Blocks: Hybrid Structure • Port Gather Basic Blocks to SPM • Rest on Cache Result HYCOM (Ocean Modeling Code) Savings: 20% in data motion
Outline • Motivation • Scratchpad Background • Simulation Framework and Methodology • Initial Study • Current Direction
Real SPM for PEBIL? • Extension of PEBIL Simulator • Fully associative cache • Rethink replacement policy • Dynamic Allocation Scheme • Idioms determine loops for allocation • Reuse distance library • Track how often used • Track distance of use Reuse Distance = 2 A B C A
Results Summary • SPM • Simpler Hardware • Efficient Data Movement • Developed Methodology for SPM • Idiom characterization • Idiom identification in HPC codes • Port SPM hotspots • 20% Data Movement Savings for HYCOM • Scratchpad shows potential • Good when spatial locality fails • HPC applications • SPM only: Average 22% Data Movement Saved • Hybrid: Average 39% Max 69% Data Movement Saved • 4x Improvement for Gather idiom • Current work on creating SPM for PEBIL
Acknowledgements • AcknowledgementsPMaC team • Laura Carrington • Ananta Tiwari • Michael Laurenzano • Pietro Cicotii • Mitesh Meswani • Dedicated to: Allan Snavely
Idioms: Strided Access i=i+stride
Looking Forward • Idiom Driven Allocation • PIR-determines loops for allocation • Pre-Allocated array for SPM • Pointers to loops: trigger replacement • Mimic Dynamic Compiler Replacement Policy