120 likes | 219 Views
Characterizing a New Class of Threads in Science Applications for High End Supercomputing. Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood. Presentation by Todd Gamblin for 11/15/2004 RENCI meeting. PIM: Processing in Memory. Put general purpose logic and memory on the same chip
E N D
Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation by Todd Gamblin for 11/15/2004 RENCI meeting
PIM: Processing in Memory • Put general purpose logic and memory on the same chip • Multiple cores with a common shared memory • Has long been used in supercomputing projects • Lower latency, higher bandwidth
Simultaneous multithreading (SMT)at the chip level • Multiple processes can execute simultaneously w/o context switch on a single chip • 2 models: • Processor can issue alternately from multiple processes. Only one process’s instructions are issued per clock (superthreading) • Processor issues instructions from multiple processes simultaneously (hyperthreading) • Currently supported to varying degrees on Pentium 4, Power4, Power5 Super Hyper
Threadlets • Rodrigues, et al. propose finer granularity: Threadlets • Threadlets exist within basic blocks (i.e., between branches) • “perhaps a half-dozen” instructions each • Lightweight fork, join, and synch mechanisms • Concept of fine-grained threads used before on the Cray Cascade and MTA projects • Large numbers of lightweight threads executing concurrently • Fork, join, synch provided by extra bit per word in memory • Full/Empty Bit (FEB) for produced/consumed
What’s the difference? • Processor equivalent of Lightweight vs. Heavyweight processes (think back to OS) • Threadlets: Light • Share the same registers • Only state per threadlet is a unique PC and status bits • Processes: Heavy • Need replicated renaming logic shares a pool of general purpose and floating point registers • Register namespaces for processes are entirely separate • Can now have producer/consumer relationships bt/w threadlets sharing registers • Need low-level synchronization (fork, join, synch) • Implemented with FEB per register or per-memory word (as with Cray), but the paper doesn’t implement anything, so no specifics here.
Extracting threadlets from code • Each basic block in the code is transformed into: • One master thread • Multiple smaller threadlets • Master spawns threadlets opportunistically
So where exactly do I put this instruction? • Given a set thread size, Algorithm examines each instruction, tries to assign it to: • A threadlet already containing instructions it depends on (minimizes synchronization) • A threadlet with fewer already assigned instructions (balances load) • Computes a “score” per threadlet based on these, assigns to the one with the highest score • Also tries to keep synchs far apart, to reduce waiting • Keep the producers at the top, consumers at the bottom
Why is this good for scientific applications? • Let’s find out… • Ran traces and constructed threadlets from: • LAMMPS: classical molecular dynamics • CTH: multi-material, large deformation, strong shock wave solid mechanics • ITS: Monte carlo radiation problems • sPPM: 3D gas dynamics
Conclusions from traces • Some observations made: • These apps tend to have very large basic blocks • Typically from 9 to over 20 instructions on average • Typical applications much smaller (“a few”… does that mean 3-5ish?) • Tend to access very large amounts of data • References span thousands of pages • 40% of instructions are accessing memory • Dependency graph widths were on average around 3-4 for entire apps, after control dependencies are added • Basic block dependency graph widths are in the 1-5 range • Usually there are multiple consumers per produced value • So, we have: • Room to make these threadlets • Good reason to want to avoid waiting on memory • Available parallelism • Less synchronization than we think
Strengths • Could possibly use this to exploit parallelism on PIM architectures • Paper mentions (in passing) the possibility of migrating threadlets on PIM units to the “vicinity” of the data • e.g. change the threadlet based on where it references memory • Provides a lot of data for people to gauge whether this sort of thing is worthwhile
Weaknesses • Paper is all data and no tests or real conclusions • Claims that these are “early results” • They’re talking about threadlets within basic blocks: 6ish instructions each. • Typical processor today: • Has multiple ALU’s and FPU’s • Issues out of order • Has a dynamic issue window anywhere from 100 to 128 instructions (Power4, Pentium 4) • Typical memory latency is in the 100-300ns range (For PowerPC and Pentium chips today… not sure about supercomputers) • How are threadlets going to improve anything? • Parallelism is extracted statically • Can’t see past branches • Are these 6-instruction threads going to fill gaps of so many (200-600+) clock cycles? • Is this new parallelism beyond that extracted by the processor? • Could be that PIM processing units are simpler than this, so maybe yes.
References • A. Rodrigues, R. Murphy, P. Kogge, and K. Underwood. Characterizing a New Class of Threads in Scientific Applications for High End Supercomputers.In Proceedings of ICS’04. • Peter M. Kogge. Processing-In-Memory: An Enabling Technology for Scalable Petaflops Computing. Presentation Slides. • http://www.cacr.caltech.edu/powr98/presentations/koggepowr/sld001.htm • Ars Technica: Introduction to Multithreading, Superthreading and Hyperthreading • http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/