1 / 30

Dynamic Load Balancing for VORPAL

Dynamic Load Balancing for VORPAL. Viktor Przebinda Center for Integrated Plasma Studies. What is VORPAL?. A parallel simulation package that models laser - plasma interactions using the PIC or fluid model. Built on an object oriented framework. Over 1,300 pages (68,000 lines) of C++ code.

Download Presentation

Dynamic Load Balancing for VORPAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies

  2. What is VORPAL? • A parallel simulation package that models laser - plasma interactions using the PIC or fluid model. • Built on an object oriented framework. • Over 1,300 pages (68,000 lines) of C++ code.

  3. Goal: Implement Dynamic Load Balancing for VORPAL • Dynamic: Automatically adjusts decomposition appropriately at runtime. • Fully compatible with all simulations. • Efficient: Minimizes overhead. • User friendly: Requires no special configuration.

  4. Grids define how regions are to be updated. physical region One or two cells wide extended region

  5. Extended regions simplify parallel synchronization. CPU 0 CPU 1 physical region physical region extended region extended region Extended regions contain data in physical regions of other CPUs.

  6. Allocated region provides simple over allocation scheme. physical region Allocated Region extended region

  7. Working with current framework introduces restrictions. • Boundaries can be adjusted by at most one cell in each direction. • Decomposition description is restricted to prevent irregular domains. Impossible Normal Direction 1 Direction 1 Direction 0 Direction 0

  8. Two strategies exist to accommodate decomposition adjustment. • Over allocate all fields to accommodate future decomposition adjustment. • Resize and copy all field data as needed for each decomposition adjustment.

  9. Over Allocation overflow overflow Region in use Advantage: Minimal overhead. Disadvantage: Decreases cache hits, resulting in lower efficiency

  10. Over allocation introduces minimal overhead. overflow Region in use

  11. Over Allocation in direction of maximum stride is optimal. overflow Region in use

  12. Resizing New memory block requested, values are copied over Advantage: Does not affect efficiency. Disadvantage: Large overhead

  13. Optimal performance is achieved through both methods New memory block requested, values are copied over Memory is over allocated to prevent future resizes.

  14. Decomposition adjustment occurs at end of update Update Load Balancing CPU time measured each iteration

  15. DLB is achieved in eight steps. • All nodes send processing times to node zero. • Node zero decides whether to perform an adjustment. • Node zero constructs an adjustment description and sends to all other nodes. • All nodes apply the adjustment, reconfigure grid. • All fields resize as necessary. • Field iteraters are rebuilt • All messengers are rebuilt • Fields and particles are synchronized.

  16. 1. Processing times are collected to aid in adjustment decisions. • Each node measures the virtual time it took to perform its last update. Virtual time excludes time spent blocked on I/O. • This amount is sent to node zero. • The process waits for reply from node zero. Time

  17. 2. Adjustment decision made based on idle time. • Given the cost of performing a load balance, VORPAL only does so only if any node during the last time step was idle for more than 10% of the time. Time

  18. 3. Adjustment is computed to eliminate bottlenecks • Using a greedy algorithm, node zero constructs adjustment information for all processors. • This is sent to all nodes. Time

  19. Boundaries are shrunk around processors with high load. highest CPU1 Direction 1 CPU0 CPU2 CPU3 lowest CPU Load Direction 0

  20. 4. Decomposition object is modified on each processor • Each node applies the adjustment. • The local grid is adjusted to match the new size. • The allocated region is modified if it cannot support the new size. Time

  21. 5. Fields resize if allocated region has changed. • All fields check the allocated region to see if it has grown. • If so, the field allocates additional memory to accommodate 25 more cells in the direction of growth. Time

  22. 6. Outdated information in Field iteraters is rebuilt. • Pointers to specific memory locations may have changed if a resize was performed. • Physical and extended regions may have changed size. Time

  23. 7. Outdated message passing objects are rebuilt. • Intersecting regions may have changed so the must be reconstructed. Time

  24. 8. New boundaries must be synchronized with neighbors. • Field data on physical boundaries is sent to neighboring processors and extended regions are filled from neighbors. • Particles that may have crossed outside the boundary of the current node are sent to neighboring nodes. • Unfortunately, since there is nothing to do while synchronization takes place, an enormous overhead is seen at this step. Time

  25. When to use load balancing. • When running a PIC simulations. • When plasma concentration is expected to change. • When decomposition is along the zero direction. • When a large number of time steps are used.

  26. Case study: DLB can beat best static decomposition by 23%. CPU1 CPU0 Boundary at midpoint Particles loaded into right region

  27. Sliding window moves particles left to CPU0 CPU1 CPU0 Particles shifted into left region

  28. Standard run shows large differences in CPU use.

  29. Load Balancing ensures uptimes on each node are equal.

  30. Conclusion • Load balancing performs desired functions. • Overhead involved in message passing is quite significant, somewhat limiting usefulness.

More Related