1 / 27

Performance analysis of a Pose application -- BigNetSim

Nilesh Choudhury. Performance analysis of a Pose application -- BigNetSim. BigNetSim. A parallel simulator for performance prediction of parallel machines Two components: Processor performance modelling Interconnection Network modelling

joshua
Download Presentation

Performance analysis of a Pose application -- BigNetSim

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nilesh Choudhury Performance analysis of a Pose application -- BigNetSim

  2. BigNetSim • A parallel simulator for performance prediction of parallel machines • Two components: • Processor performance modelling • Interconnection Network modelling • The two components could be used individually or in synergy. • Two modes of operation: • Direct Simulation (on-line mode) • Trace-driven simulation (off-line mode)

  3. Architecture of BigNetSim

  4. Which part is most relevant to Pose Performance ? • The Interconnection Network simulation

  5. Network Simulator Modelling • Very detailed model: • Switch modelled as: • Collection of Input/Output ports • Arbitration strategies to serve incoming requests fairly • Detailed Virtual Channel selection strategies • Input VC and Output VC • Switch delays and arbitration costs modelled here • Switch load and contention measures computed and updated to assist adaptive routing strategies and fault-tolerant routing • Virtual cut-through routing; store-and-forward routing • Number of posers per switch = # ports!

  6. Network Simulator Modelling • Network Information Card (NIC) • 'Send NIC' packetizes and sends a message at CPU's request • 'Recv Nic' unpacks, reassembles and delivers a message to the CPU on receiving incoming packets. • Network card send and receive latencies modelled here • Number of posers per NIC = 2 • Channel • Doesn't need to be very sophisticated • Models a simple channel delay and receives a packet from a switch/Nic and delivers it to the corresponding switch/Nic it is connected to. • Number of posers per channel = 1

  7. Topologies and Routing Algorithms • Topology and Routing strategies provide functions which the network uses • Extrmely modular design • Write your own routing strategies • Write your own topology • We have some available: • KaryNcube; KaryNmesh; KaryNtree; Nmesh; fattree; hypercbe and some hybrid variations

  8. Routing Algorithms • Minimal deadlock-free; Non-minimal and Fault-tolerant variations • K-ary-N-mesh / N-mesh • Direction Ordered; • Planar Routing; • Static Direction Reversal Routing • Optimally Fully Adaptive Routing (modified too) • K-ary-N-tree • UpDown (modified, non-minimal) • HyperCube • Hamming • P-Cube (modified too)

  9. Input/Output VC selection • Input Virtual Channel Selection • RoundRobin; • Shortest Length Queue • Output Buffer length • Output Virtual Channel Selection • Max. available buffer length • Max. available buffer bubble VC • Output Buffer length

  10. Building up a machine • Involves selecting the processor capabilities • Selecting the Interconnection network • Available set of topologies, routing algorithms, virtual channel selection strategies • Easy to build an interconnection network closely modelling the target machine • All these modules are easily extendable to create and plug in new topology, routing algorithm, etc • Some preconfigured machines include: • Bluegene; RedStorm; lemieux; etc • Generalized hypercube, fat-tree, torii and mesh architectures

  11. Hardware support for Collectives • You could also model a network with hardware collectives for multicast, reduction and broadcast • Collective Manager is interfaced with the basic network units • You need to define the collective manager operations for the corresponding topology • Already available for: • Hypercube; fattree; densegraph and hybrid variations

  12. Network configuration Parameters • Apart from Routing algorithm; Topology; virtual channel selection; switch size (number of ports); number of virtual channels associated to a port • Size of network • Channel bandwidth; Switch bandwidth • NIC send/recv packet latencies • NIC packetization costs • Switch buffer size • Size of a single packet • Delays in various components • DMA delay; Processor send overhead; etc

  13. How does this modelling translate in the POSE framework • If we model the following machine: • 'n' nodes; • 's' switches; • 'p' ports per switch • There are: • 4*n + 2p*s posers. • A proc, co-proc, send-NIC, recv-NIC per machine • 'p' ports and 'p' channels per switch

  14. An example • Suppose we model a 2048 node bluegene network connected as a 3D torus: • n=2048; • s=2048; • p=6; • Total number of posers = 4*2048+2*6*2048 = 16*2048 = 32,768 posers. • Ample virtualization to run this simulation on 100 processors.

  15. Factors related to Performance • Number of GVT synchronizations: • Gives an insight of the amount of parallelism within the threshold controlled by the simulation • Large number of sync – possibly little work within allowable limits • Phase Time – real time elapsed between consecutive GVT synchronizations • Indicates the amount of parallelism • Rollback fraction • Proportion of time for undoing speculative work • Implies too many strict dependencies in the simulation

  16. contd... • Communication fraction: • Fraction of total time spent communicating • Simulation dependencies: • Posers should be distributed on processors such that it minimizes dependency • Simulation strategy to use: • Optimistic; Adept; etc • Control the amount of throttling – speculative window • Speedup with sequential simulation: • Sequential simulation is faster as it gets rid of all synchronization, provided it fits in memory

  17. DetailedSim – performance case study • DetailedSim (with switch modelled as a single poser) • running on 16 processors • simulating a 2048 node hypercube network • random traffic generated at each processing node • Specculation still within reasonable limits (<20%) • Phase time very small (<5ms)

  18. contd... • Poor real speedup • Breakeven with sequential at 12 procs • Increasing number of processors worsened the problem • Synchronization more expensive • Did not scale

  19. Identify the problem • Large switch poser • Trying to do a lot of activities • Hence had a very complex state • Handles a disproportionally large number of events • Faces large number of rollbacks • Leading to frequent synchronizations • Not allowing the GVT to advance • Large state size caused each check-point to be expensive • Large number of events meant frequently check-pointing its state

  20. The Solution • Decompose switch into fine-grained posers • Ports are logical parallel entities in a switch. • Refactor switch in a number of ports • Smaller state; infrequent events • Meticulosly refactor, so as not to increase the number of events • Output Buffered switches were refactored • Input Buffered switches need a complex arbitration mechanism involving a central switch state

  21. Improved Results • Phase time up • # GVT iterations down • Rollback fraction ok • Simulation time half • We still had a problem: • Could not scale!! • Expedited GVT calculation • First idle processor triggers a gvt calculation, and everyone has an updated GVT, not waiting for the phase to finish • GVT computation gets highest priority, if any processor is idle

  22. Load Imbalances • transient load imbalance went down • # GVT computations up • Improved scaling • But, small cyclic imbalance • Application specific dependencies • Distribute posers to minimize simulation dependencies • Partition input problem randomly

  23. Communication load • Important consideration for fine-grained simulation is communication • partition along the min-cut of the application communication graph • decreases communiation • might increase inherent appliation dependencies among various partitions

  24. Performance results • Hypercube networks • Run on Turing • Reached over 2.5 million events/sec on 128 processors

  25. Communication Challenges • A 8192 node hypercube network across 128 procs • Fits in memory comfortably • Communication – 50MB/s per processor • Small messages (msg size ~250 bytes) • Myrinet just about handles this • A step further: • 16384 node hypercube on 128 procs • Still fits in memory • Myrinet starts dropping packets at an alarming rate • NIC freezes • Runs out of execution time

  26. Conclusion • Virtualization and fine decomposition coupled with adaptive synchronization strategies help to address the challenges of large-scale fine-grained PDES • Excellent problem-size and self scaling • Careful decomposition of complex objects required • Modelling posers correctly is essential for the simulation to have good performance and scale

  27. Download charm / POSE • Charm++ / POSE / BigNetSim all freely downloadable at http://charm.cs.uiuc.edu/ • For more information on the research projects http://charm/cs.uiuc.edu/research/ • POSE: http://charm.cs.uiuc.edu/research/pose • BigNetSim: http://charm.cs.uiuc.edu/research/BigNetSim

More Related