Characterizing a New Class of Threads in Science Applications for High End Supercomputing

Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation by Todd Gamblin for 11/15/2004 RENCI meeting

PIM: Processing in Memory • Put general purpose logic and memory on the same chip • Multiple cores with a common shared memory • Has long been used in supercomputing projects • Lower latency, higher bandwidth

Simultaneous multithreading (SMT)at the chip level • Multiple processes can execute simultaneously w/o context switch on a single chip • 2 models: • Processor can issue alternately from multiple processes. Only one process’s instructions are issued per clock (superthreading) • Processor issues instructions from multiple processes simultaneously (hyperthreading) • Currently supported to varying degrees on Pentium 4, Power4, Power5 Super Hyper

Threadlets • Rodrigues, et al. propose finer granularity: Threadlets • Threadlets exist within basic blocks (i.e., between branches) • “perhaps a half-dozen” instructions each • Lightweight fork, join, and synch mechanisms • Concept of fine-grained threads used before on the Cray Cascade and MTA projects • Large numbers of lightweight threads executing concurrently • Fork, join, synch provided by extra bit per word in memory • Full/Empty Bit (FEB) for produced/consumed

What’s the difference? • Processor equivalent of Lightweight vs. Heavyweight processes (think back to OS) • Threadlets: Light • Share the same registers • Only state per threadlet is a unique PC and status bits • Processes: Heavy • Need replicated renaming logic shares a pool of general purpose and floating point registers • Register namespaces for processes are entirely separate • Can now have producer/consumer relationships bt/w threadlets sharing registers • Need low-level synchronization (fork, join, synch) • Implemented with FEB per register or per-memory word (as with Cray), but the paper doesn’t implement anything, so no specifics here.

Extracting threadlets from code • Each basic block in the code is transformed into: • One master thread • Multiple smaller threadlets • Master spawns threadlets opportunistically

So where exactly do I put this instruction? • Given a set thread size, Algorithm examines each instruction, tries to assign it to: • A threadlet already containing instructions it depends on (minimizes synchronization) • A threadlet with fewer already assigned instructions (balances load) • Computes a “score” per threadlet based on these, assigns to the one with the highest score • Also tries to keep synchs far apart, to reduce waiting • Keep the producers at the top, consumers at the bottom

Why is this good for scientific applications? • Let’s find out… • Ran traces and constructed threadlets from: • LAMMPS: classical molecular dynamics • CTH: multi-material, large deformation, strong shock wave solid mechanics • ITS: Monte carlo radiation problems • sPPM: 3D gas dynamics

Conclusions from traces • Some observations made: • These apps tend to have very large basic blocks • Typically from 9 to over 20 instructions on average • Typical applications much smaller (“a few”… does that mean 3-5ish?) • Tend to access very large amounts of data • References span thousands of pages • 40% of instructions are accessing memory • Dependency graph widths were on average around 3-4 for entire apps, after control dependencies are added • Basic block dependency graph widths are in the 1-5 range • Usually there are multiple consumers per produced value • So, we have: • Room to make these threadlets • Good reason to want to avoid waiting on memory • Available parallelism • Less synchronization than we think

Strengths • Could possibly use this to exploit parallelism on PIM architectures • Paper mentions (in passing) the possibility of migrating threadlets on PIM units to the “vicinity” of the data • e.g. change the threadlet based on where it references memory • Provides a lot of data for people to gauge whether this sort of thing is worthwhile

Weaknesses • Paper is all data and no tests or real conclusions • Claims that these are “early results” • They’re talking about threadlets within basic blocks: 6ish instructions each. • Typical processor today: • Has multiple ALU’s and FPU’s • Issues out of order • Has a dynamic issue window anywhere from 100 to 128 instructions (Power4, Pentium 4) • Typical memory latency is in the 100-300ns range (For PowerPC and Pentium chips today… not sure about supercomputers) • How are threadlets going to improve anything? • Parallelism is extracted statically • Can’t see past branches • Are these 6-instruction threads going to fill gaps of so many (200-600+) clock cycles? • Is this new parallelism beyond that extracted by the processor? • Could be that PIM processing units are simpler than this, so maybe yes.

References • A. Rodrigues, R. Murphy, P. Kogge, and K. Underwood. Characterizing a New Class of Threads in Scientific Applications for High End Supercomputers.In Proceedings of ICS’04. • Peter M. Kogge. Processing-In-Memory: An Enabling Technology for Scalable Petaflops Computing. Presentation Slides. • http://www.cacr.caltech.edu/powr98/presentations/koggepowr/sld001.htm • Ars Technica: Introduction to Multithreading, Superthreading and Hyperthreading • http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/

Characterizing a New Class of Threads in Science Applications for High End Supercomputing

Characterizing a New Class of Threads in Science Applications for High End Supercomputing

Presentation Transcript

Production of Laminated Cards for High-End Applications

Supercomputing and Science

Monitoring Streams -- A New Class of Data Management Applications

New Education in a New Science for a New Society

Performance of a high throughput multichannel detector for life science applications

Grid Computing in Distributed High-End Computing Applications:

A New Class of High Performance FFTs

Supercomputing and Science

Supercomputing and Science

Towards Scalable Checkpointing for Supercomputing Applications

Science class should not end in tragedy.... Science class should not end in tragedy....

Computational Science and Infrastructure for Supercomputing in Japan

New High-Level Applications for 2009

Supercomputing and Science

A New Class of High Performance FFTs

Science class should not end in tragedy.... Science class should not end in tragedy....

Supercomputing and Science

Characterizing Semantic Web Applications

Supercomputing and Science

Grid Computing in Distributed High-End Computing Applications:

New Applications for High-Strength Tapes

Development Of Iphone Applications For High-End Mobile Phones