270 likes | 410 Views
Nilesh Choudhury. Performance analysis of a Pose application -- BigNetSim. BigNetSim. A parallel simulator for performance prediction of parallel machines Two components: Processor performance modelling Interconnection Network modelling
E N D
Nilesh Choudhury Performance analysis of a Pose application -- BigNetSim
BigNetSim • A parallel simulator for performance prediction of parallel machines • Two components: • Processor performance modelling • Interconnection Network modelling • The two components could be used individually or in synergy. • Two modes of operation: • Direct Simulation (on-line mode) • Trace-driven simulation (off-line mode)
Which part is most relevant to Pose Performance ? • The Interconnection Network simulation
Network Simulator Modelling • Very detailed model: • Switch modelled as: • Collection of Input/Output ports • Arbitration strategies to serve incoming requests fairly • Detailed Virtual Channel selection strategies • Input VC and Output VC • Switch delays and arbitration costs modelled here • Switch load and contention measures computed and updated to assist adaptive routing strategies and fault-tolerant routing • Virtual cut-through routing; store-and-forward routing • Number of posers per switch = # ports!
Network Simulator Modelling • Network Information Card (NIC) • 'Send NIC' packetizes and sends a message at CPU's request • 'Recv Nic' unpacks, reassembles and delivers a message to the CPU on receiving incoming packets. • Network card send and receive latencies modelled here • Number of posers per NIC = 2 • Channel • Doesn't need to be very sophisticated • Models a simple channel delay and receives a packet from a switch/Nic and delivers it to the corresponding switch/Nic it is connected to. • Number of posers per channel = 1
Topologies and Routing Algorithms • Topology and Routing strategies provide functions which the network uses • Extrmely modular design • Write your own routing strategies • Write your own topology • We have some available: • KaryNcube; KaryNmesh; KaryNtree; Nmesh; fattree; hypercbe and some hybrid variations
Routing Algorithms • Minimal deadlock-free; Non-minimal and Fault-tolerant variations • K-ary-N-mesh / N-mesh • Direction Ordered; • Planar Routing; • Static Direction Reversal Routing • Optimally Fully Adaptive Routing (modified too) • K-ary-N-tree • UpDown (modified, non-minimal) • HyperCube • Hamming • P-Cube (modified too)
Input/Output VC selection • Input Virtual Channel Selection • RoundRobin; • Shortest Length Queue • Output Buffer length • Output Virtual Channel Selection • Max. available buffer length • Max. available buffer bubble VC • Output Buffer length
Building up a machine • Involves selecting the processor capabilities • Selecting the Interconnection network • Available set of topologies, routing algorithms, virtual channel selection strategies • Easy to build an interconnection network closely modelling the target machine • All these modules are easily extendable to create and plug in new topology, routing algorithm, etc • Some preconfigured machines include: • Bluegene; RedStorm; lemieux; etc • Generalized hypercube, fat-tree, torii and mesh architectures
Hardware support for Collectives • You could also model a network with hardware collectives for multicast, reduction and broadcast • Collective Manager is interfaced with the basic network units • You need to define the collective manager operations for the corresponding topology • Already available for: • Hypercube; fattree; densegraph and hybrid variations
Network configuration Parameters • Apart from Routing algorithm; Topology; virtual channel selection; switch size (number of ports); number of virtual channels associated to a port • Size of network • Channel bandwidth; Switch bandwidth • NIC send/recv packet latencies • NIC packetization costs • Switch buffer size • Size of a single packet • Delays in various components • DMA delay; Processor send overhead; etc
How does this modelling translate in the POSE framework • If we model the following machine: • 'n' nodes; • 's' switches; • 'p' ports per switch • There are: • 4*n + 2p*s posers. • A proc, co-proc, send-NIC, recv-NIC per machine • 'p' ports and 'p' channels per switch
An example • Suppose we model a 2048 node bluegene network connected as a 3D torus: • n=2048; • s=2048; • p=6; • Total number of posers = 4*2048+2*6*2048 = 16*2048 = 32,768 posers. • Ample virtualization to run this simulation on 100 processors.
Factors related to Performance • Number of GVT synchronizations: • Gives an insight of the amount of parallelism within the threshold controlled by the simulation • Large number of sync – possibly little work within allowable limits • Phase Time – real time elapsed between consecutive GVT synchronizations • Indicates the amount of parallelism • Rollback fraction • Proportion of time for undoing speculative work • Implies too many strict dependencies in the simulation
contd... • Communication fraction: • Fraction of total time spent communicating • Simulation dependencies: • Posers should be distributed on processors such that it minimizes dependency • Simulation strategy to use: • Optimistic; Adept; etc • Control the amount of throttling – speculative window • Speedup with sequential simulation: • Sequential simulation is faster as it gets rid of all synchronization, provided it fits in memory
DetailedSim – performance case study • DetailedSim (with switch modelled as a single poser) • running on 16 processors • simulating a 2048 node hypercube network • random traffic generated at each processing node • Specculation still within reasonable limits (<20%) • Phase time very small (<5ms)
contd... • Poor real speedup • Breakeven with sequential at 12 procs • Increasing number of processors worsened the problem • Synchronization more expensive • Did not scale
Identify the problem • Large switch poser • Trying to do a lot of activities • Hence had a very complex state • Handles a disproportionally large number of events • Faces large number of rollbacks • Leading to frequent synchronizations • Not allowing the GVT to advance • Large state size caused each check-point to be expensive • Large number of events meant frequently check-pointing its state
The Solution • Decompose switch into fine-grained posers • Ports are logical parallel entities in a switch. • Refactor switch in a number of ports • Smaller state; infrequent events • Meticulosly refactor, so as not to increase the number of events • Output Buffered switches were refactored • Input Buffered switches need a complex arbitration mechanism involving a central switch state
Improved Results • Phase time up • # GVT iterations down • Rollback fraction ok • Simulation time half • We still had a problem: • Could not scale!! • Expedited GVT calculation • First idle processor triggers a gvt calculation, and everyone has an updated GVT, not waiting for the phase to finish • GVT computation gets highest priority, if any processor is idle
Load Imbalances • transient load imbalance went down • # GVT computations up • Improved scaling • But, small cyclic imbalance • Application specific dependencies • Distribute posers to minimize simulation dependencies • Partition input problem randomly
Communication load • Important consideration for fine-grained simulation is communication • partition along the min-cut of the application communication graph • decreases communiation • might increase inherent appliation dependencies among various partitions
Performance results • Hypercube networks • Run on Turing • Reached over 2.5 million events/sec on 128 processors
Communication Challenges • A 8192 node hypercube network across 128 procs • Fits in memory comfortably • Communication – 50MB/s per processor • Small messages (msg size ~250 bytes) • Myrinet just about handles this • A step further: • 16384 node hypercube on 128 procs • Still fits in memory • Myrinet starts dropping packets at an alarming rate • NIC freezes • Runs out of execution time
Conclusion • Virtualization and fine decomposition coupled with adaptive synchronization strategies help to address the challenges of large-scale fine-grained PDES • Excellent problem-size and self scaling • Careful decomposition of complex objects required • Modelling posers correctly is essential for the simulation to have good performance and scale
Download charm / POSE • Charm++ / POSE / BigNetSim all freely downloadable at http://charm.cs.uiuc.edu/ • For more information on the research projects http://charm/cs.uiuc.edu/research/ • POSE: http://charm.cs.uiuc.edu/research/pose • BigNetSim: http://charm.cs.uiuc.edu/research/BigNetSim