1 / 12

Characterizing a New Class of Threads in Science Applications for High End Supercomputing

Characterizing a New Class of Threads in Science Applications for High End Supercomputing. Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood. Presentation by Todd Gamblin for 11/15/2004 RENCI meeting. PIM: Processing in Memory. Put general purpose logic and memory on the same chip

Download Presentation

Characterizing a New Class of Threads in Science Applications for High End Supercomputing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation by Todd Gamblin for 11/15/2004 RENCI meeting

  2. PIM: Processing in Memory • Put general purpose logic and memory on the same chip • Multiple cores with a common shared memory • Has long been used in supercomputing projects • Lower latency, higher bandwidth

  3. Simultaneous multithreading (SMT)at the chip level • Multiple processes can execute simultaneously w/o context switch on a single chip • 2 models: • Processor can issue alternately from multiple processes. Only one process’s instructions are issued per clock (superthreading) • Processor issues instructions from multiple processes simultaneously (hyperthreading) • Currently supported to varying degrees on Pentium 4, Power4, Power5 Super Hyper

  4. Threadlets • Rodrigues, et al. propose finer granularity: Threadlets • Threadlets exist within basic blocks (i.e., between branches) • “perhaps a half-dozen” instructions each • Lightweight fork, join, and synch mechanisms • Concept of fine-grained threads used before on the Cray Cascade and MTA projects • Large numbers of lightweight threads executing concurrently • Fork, join, synch provided by extra bit per word in memory • Full/Empty Bit (FEB) for produced/consumed

  5. What’s the difference? • Processor equivalent of Lightweight vs. Heavyweight processes (think back to OS) • Threadlets: Light • Share the same registers • Only state per threadlet is a unique PC and status bits • Processes: Heavy • Need replicated renaming logic shares a pool of general purpose and floating point registers • Register namespaces for processes are entirely separate • Can now have producer/consumer relationships bt/w threadlets sharing registers • Need low-level synchronization (fork, join, synch) • Implemented with FEB per register or per-memory word (as with Cray), but the paper doesn’t implement anything, so no specifics here.

  6. Extracting threadlets from code • Each basic block in the code is transformed into: • One master thread • Multiple smaller threadlets • Master spawns threadlets opportunistically

  7. So where exactly do I put this instruction? • Given a set thread size, Algorithm examines each instruction, tries to assign it to: • A threadlet already containing instructions it depends on (minimizes synchronization) • A threadlet with fewer already assigned instructions (balances load) • Computes a “score” per threadlet based on these, assigns to the one with the highest score • Also tries to keep synchs far apart, to reduce waiting • Keep the producers at the top, consumers at the bottom

  8. Why is this good for scientific applications? • Let’s find out… • Ran traces and constructed threadlets from: • LAMMPS: classical molecular dynamics • CTH: multi-material, large deformation, strong shock wave solid mechanics • ITS: Monte carlo radiation problems • sPPM: 3D gas dynamics

  9. Conclusions from traces • Some observations made: • These apps tend to have very large basic blocks • Typically from 9 to over 20 instructions on average • Typical applications much smaller (“a few”… does that mean 3-5ish?) • Tend to access very large amounts of data • References span thousands of pages • 40% of instructions are accessing memory • Dependency graph widths were on average around 3-4 for entire apps, after control dependencies are added • Basic block dependency graph widths are in the 1-5 range • Usually there are multiple consumers per produced value • So, we have: • Room to make these threadlets • Good reason to want to avoid waiting on memory • Available parallelism • Less synchronization than we think

  10. Strengths • Could possibly use this to exploit parallelism on PIM architectures • Paper mentions (in passing) the possibility of migrating threadlets on PIM units to the “vicinity” of the data • e.g. change the threadlet based on where it references memory • Provides a lot of data for people to gauge whether this sort of thing is worthwhile

  11. Weaknesses • Paper is all data and no tests or real conclusions • Claims that these are “early results” • They’re talking about threadlets within basic blocks: 6ish instructions each. • Typical processor today: • Has multiple ALU’s and FPU’s • Issues out of order • Has a dynamic issue window anywhere from 100 to 128 instructions (Power4, Pentium 4) • Typical memory latency is in the 100-300ns range (For PowerPC and Pentium chips today… not sure about supercomputers) • How are threadlets going to improve anything? • Parallelism is extracted statically • Can’t see past branches • Are these 6-instruction threads going to fill gaps of so many (200-600+) clock cycles? • Is this new parallelism beyond that extracted by the processor? • Could be that PIM processing units are simpler than this, so maybe yes.

  12. References • A. Rodrigues, R. Murphy, P. Kogge, and K. Underwood. Characterizing a New Class of Threads in Scientific Applications for High End Supercomputers.In Proceedings of ICS’04. • Peter M. Kogge. Processing-In-Memory: An Enabling Technology for Scalable Petaflops Computing. Presentation Slides. • http://www.cacr.caltech.edu/powr98/presentations/koggepowr/sld001.htm • Ars Technica: Introduction to Multithreading, Superthreading and Hyperthreading • http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/

More Related