1 / 36

Jake Adriaens jtadriaens@wisc Dan Gibson gibson@cs.wisc

CS 838: Pervasive Parallelism Profiling and Parallelization of the Multifacet GEMS Simulation Infrastructure. Instructor: Mark D. Hill. Jake Adriaens jtadriaens@wisc.edu Dan Gibson gibson@cs.wisc.edu. Problem. Simulation is (really) slow! Simics alone runs at ~ 5 MIPS (fast!)

Download Presentation

Jake Adriaens jtadriaens@wisc Dan Gibson gibson@cs.wisc

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 838: Pervasive ParallelismProfiling and Parallelization of the Multifacet GEMS Simulation Infrastructure Instructor: Mark D. Hill Jake Adriaens jtadriaens@wisc.edu Dan Gibson gibson@cs.wisc.edu

  2. Problem • Simulation is (really) slow! • Simics alone runs at ~ 5 MIPS (fast!) • Add Ruby ~ 50 KIPS • Add Opal ~ 20 KIPS • Fast simulations lead to faster evaluation of new ideas. • Running many simulations in parallel (via Condor, for instance) is great for shrinking error bars, less useful for development. • Fast simulations useful for educational purposes • Remember how long it took to simulate HW 5, HW 6? • Simulations of long-running commercial workloads can take hours or DAYS, even on top-of-the-line hardware CS 838

  3. More Motivation – Why Parallelize? Chips currently look like this: A couple of cores Memory & I/O Control On-Chip Cache Dual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004 CS 838

  4. $ BANK $ BANK $ BANK $ BANK CORE CORE CORE CORE CORE CORE CORE CORE Interconnect More Motivation – Why Parallelize? Soon, chips may look like this: More cores! Many more threads The free lunch is over: To get speedup out of multithreaded processors, programmers must implement parallel programs. (for now) CS 838

  5. Summary • Good News: Found parallelism in GEMS • Ruby’s event queue often contains independent events • Opal has some implicit parallelism, as it simulates many logically independent processors • Bad News: Speedup potential is limited • In most cases, execution within Simics dominates execution time • Amdahl’s Law suggests parallelization of GEMS will yield small increases in performance • Good News: Discovered inefficiencies • The way GEMS uses Simics greatly affects Simics • Isolated troublesome API calls and stalled processor effects • Bad News: Simics isn’t very thread-friendly • No thread-safe functionality • Calling Simics API requires a (costly) thread switch! CS 838

  6. Summary • More Bad News: Parallelization of Ruby was not (entirely) successful • Demonstrated little/no performance gain • Suffers from deadlock • We have a good excuse for this… • Nondeterministic • Fixable, minor effect • Assumptions of non-concurrent execution • Ready()/Operate() pairs CS 838

  7. What Next? • Overview of Simics/Ruby/Opal • Lengthy example • Profiling Experiments • Description of profiling experiments • Results • Effects Ruby / Opal have on Simics • “Null” module experiments • Parallel Ruby • …and its catastrophic failure • Observations • Conclusions CS 838

  8. Simics / Ruby / Opal Overview - 1 Random Tester Opal Simics Deterministic Contended locks Trace flie Detailed Processor Model Microbenchmarks Simics loadable modules E1 E2 E3 E4 E5 Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

  9. Opal + Ruby + Simics Operation Opal Install Module Detailed Processor Model Start Sim Install Module Instruction Fetches E1 I-Fetch Complete E2 E3 E4 E5 I-Fetch Complete loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

  10. Opal + Ruby + Simics Operation Opal API Calls for Decoding Detailed Processor Model Instruction Fetches E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret D D D D F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

  11. Opal + Ruby + Simics Operation Opal Step 1 Instr. Detailed Processor Model E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C X X W X X S X D D D D X S S X D D D D F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

  12. Opal + Ruby + Simics Operation Opal Step 3 Instrs. Detailed Processor Model ld A ld B E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C C C M M S W X S X X Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

  13. Opal + Ruby + Simics Operation Opal Detailed Processor Model A=1 B=1 ld C E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret S S S W S S M W Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

  14. Opal + Ruby + Simics Operation Opal Step 4 Instrs. Detailed Processor Model I-Fetch call E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C C C C X F Simics Simple, Right? Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

  15. Finding Parallelism • Lots of parallelism opportunities in the example! • Ruby/Opal (as described) could be run by separate threads! • Ruby is a discrete event simulator… • Can we apply Fujimoto’s PDES strategies directly? • Places we found parallelism: • Ruby’s Event Queue (Experiment 1) • Opal in general, on a per-processor basis (Experiment 2) • Modular structure (not explored) • But how much speedup can we gain through parallelism? CS 838

  16. Experiment 1: Ruby’s Event Queue • Ruby is already a discrete event simulator (DES) • Making it a parallel DES (PDES) ala Fujimoto might be a way to speed things up! • Already has implicit lookahead of 1, due to existing event scheduling constraints. • How many events are available for processing in a given cycle of the event queue? • Too few could limit lookahead properties • How long does a typical event execute? • Short events could make the queue itself a bottleneck CS 838

  17. Results 1 – Ruby’s Event Queue Percentage of All Events Event Counts Event Duration Simics Time = SimTime – RubyTime = ~80% CS 838

  18. Experiment 2 – Opal’s Per-Processor Parallelism • Opal simulates multiple logically independent processors • Simulated processor independence => Parallelism • Use one thread per simulated processor? • Raises work imbalance issues • In practice, the work imbalance is tolerable • Processors are only logically independent • A common sequential bottleneck is shared between all Opal processors: Simics CS 838

  19. All other API calls SIM_break_simulation API call SIM_read_phys_memory API call Experiment 2 – Opal’s Per-Processor Parallelism SIM_continue API Call Opal Best Parallel Opal Speedup <= 40%! Execution time in Opal+Simics Simulation CS 838

  20. Experiment 2 – Opal’s Per-Processor Parallelism • Why is SIM_continue so slow? • Opal uses SIM_continue to logically progress the simulation by a small number (1-4) instructions at a time. • SIM_continue performs extensive start-up and tear-down optimization, expecting large (10,000+) step sizes • Increasing Opal’s stepping size decreases total SIM_continue time significantly, but makes fine-grained simulation difficult • Why is SIM_read_phys_memory so slow? • One call to SIM_read_phys_memory ~ 1us of execution time • Reads from a proprietary-format compressed file • Used by Opal once for every load instruction • Loads are quite frequent! CS 838

  21. Our Thread’s Output, having just returned from an API call Something our thread does crashes one of the Simics threads! Experiment 3 – Simics API Calls • Can there be more bad news? • Yes. • How does Simics react to alien threads using its API? Thread 5 returned from Simics. patch PC: 0x1034e68 0x1034e64 *** ASSERTION ERROR: in line 7530 of file 'v9_service_routines_1.c' with RCSID '@(#) $Id: v9.sg,v 33.0.2.31 2004/10/08 12:23:07 am Exp $' Please report this. Simics will now self-signal an abort. patch NPC: 0x1034e6c 0x1034e68 *** Simics getting shaky, switching to 'safe' mode. *** Simics (thread 31) received an abort signal, probably an assertion. *** thread 31 exiting. CS 838

  22. Experiment 3 – Simics API Calls • Simics forbids calling the Simics API from alien threads • SIM_thread_safe_callback is the only mechanism to use interface from threads • Slow (see table) • Non-blocking • Must have released “Main Simics Thread” (MST) CS 838

  23. Intermediate Conclusions • Interactions with Simics limit our ability to exploit parallelism in Ruby and Opal • Simics is fast without Ruby and/or Opal • Ruby and Opal in isolation are reasonably fast • Ruby and Opal cause slowdowns in Simics • The interactions between the GEMS modules and Simics result in performance loss CS 838

  24. Experiment 4 – “NULL” Modules • To study Simics slowdown, we use “NULL” modules: • Empty, trivial modules that use interfaces similar to Ruby and Opal • Modules contribute very little to runtime directly • Effectively isolates Simics performance from module performance • NullRUBY( X ) • A simple memory timing model, using the same interface as Ruby • Models a memory with a constant latency (X cycles per access) • NullOPAL( IPC ) • A trivial processor model, using a similar interface as Opal • Steps Simics (with SIM_continue) by IPC instructions per cycle CS 838

  25. NULLRUBY(0) increases execution time by 2x-3x on average. This is logically equivalent to having no timing model installed. Experiment 4 – “NULL” Modules CS 838

  26. Runtime increases ~linearly (or greater) as memory latency increases. Processors stalled on memory requests are costly to simulate! Experiment 4 – “NULL” Modules CS 838

  27. Experiment 4 – “NULL” Modules Using SIM_continue with a stepping quanta of 10 is 3x-7x faster than the Opal default of 1! CS 838

  28. Ruby (with simulated memory latency of 300 cycles) slows Simics about as much as NullRUBY(200) Experiment 4 – “NULL” Modules CS 838

  29. In agreement with the pie chart, the runtime of SIM_continue accounts for about half of the Opal+Simics runtime Experiment 4 – “NULL” Modules CS 838

  30. “NULL” Module Observations • Simulations are slow because of interactions between Simics, Ruby, and Opal • T(Simics+Modules) != T(Simics) + T(Modules) • Little or no speedup is possible from parallelizing Ruby and/or Opal with the current Simics interfaces • Suggested improvements dramatically affect fidelity of simulations • Increasing Opal’s step size reduces accuracy • Optimizing Simics memory stall time requires coarse-grain simulation CS 838

  31. Parallelizing Ruby • Despite overwhelming likelihood of failure, parallelize anyway! • Obstacles: • Assumptions of non-concurrency • Portions of Ruby are auto-generated • Simics threading hurdles • 48,059 lines of C++ in 312 separate files. CS 838

  32. Parallelizing Ruby • Final implementation suffers from frequent deadlock • Fine-grained locking leads to many deadlock opportunities • Can’t always acquire locks in same order: • Lock ordering by meaning of protected object: Locks have different semantic meanings for different logical events (input vs. output queues) • Lock ordering by address of the lock: May need to acquire a lock in order to determine which locks are needed • Lock ordering by simulated chip topology: Need knowledge of “where” a particular event is occurring in simulated chip • Coarse-grained locking has worse performance than a single thread CS 838

  33. Parallelizing Ruby • Occasionally (for very short simulations), no deadlock occurs (soln: coarse-grain locks) • Some non-determinism, but results are actually quite close to sequential version • Almost no speedup CS 838

  34. Parallelizing Ruby • Other challenges: • Ready()/Operate() pairs violate object-encapsulated synchronization • Ready() status may change between calls of Ready() and Operate() • Fine-grained locking with object-encapsulated synchronization greatly simplified by Solaris-only lock recursion • x86-64 pthread libraries on main simulation machines do not support lock recursion • Unidentified sharing leads to difficult races • Interactions with Simics require extreme synchronization CS 838

  35. Closing Remarks • Improvements must be made to Ruby/Simics and Opal/Simics interfaces • Parallelization of Ruby requires a substantial re-write of Ruby’s event queue and associated classes • Incorporate knowledge of network topology to provide a lock acquisition order • Replace “event” abstraction with “active object” abstraction, which is race-free. • Parallel programming is hard • Chip manufacturers should be worried CS 838

  36. Opal Detailed Processor Model ? ? The End Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

More Related