Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor

Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor Ali Eker, Barry Williams, Nitesh Mishra, Dushyant Thakur, Kenneth Chiu, Dmitry Ponomarev, Nael Abu-Ghazaleh Department of Computer Science State University of New York at Binghamton Department of Computer Science University of California Riverside Presented at IEEE/ACM Conference on Distributed Systems and Real Time Applications (DS-RT), October 16th 2018 1

Overview • Global Virtual Time (GVT) Algorithms • Synchronous GVT – Barrier Based • Asynchronous GVT – Mattern's GVT Algorithm • Asynchronous GVT – Wait-Free GVT Algorithm • Performance • On a 12-core Xeon • On Intel's Knights Landing Processors • Balanced Loading & Fast Event Processing • Balanced Loading & Slower Event Processing • Imbalanced in terms of Communication • Imbalanced in terms of Event Processing DS-RT 2018 2

Presentation Agenda • Overview of PDES • Intel Knights Landing Architecture • Test Environment • Analysis of Global Virtual Time Algorithms • Performance Implications and Results • Conclusions • Future Work DS-RT 2018 3

Parallel Discrete Event Simulation (PDES) • Logical Processes (LPs) running on parallel • Communication via time-stamped event messages to advance simulation discretely • Causality order between events • Conservative Simulation: Avoid straggler events completely • Optimistic Simulation: Handle straggler events by rolling back • Check pointing • Reverse computation DS-RT 2018 4

ROSS-MT • Multi-threaded ROSS: ROSS-MT • Optimistic simulation is used to increase performance • Reverse computation is used to handle casualty violations and rolling back • Classical PHOLD Benchmark is implemented for workload • Modified to create different models • Balanced • Imbalanced in communication load • Imbalanced in processing delay C. Carothers, D. Bauer, and S. Pearce, “Ross: A high-performance, low memory, modular time wrap system,” in Proc of the 11th Workshop on Parallel and Distributed Simulation (PADS), 2000. DS-RT 2018 5

Intel Knights Landing Architecture • Second generation of Intel's many core architecture (MIC) • Standalone processor up to 72 cores, each with 4 hardware threads. • Branch prediction and out-of-order execution logic to each core. • 2 Vector Processing Units per core • Tiles with 1 MB L2 Cache • 16 GB on-package fast RAM (MCDRAM) + 96 GB off-package DDR memory DS-RT 2018 6

Test Platforms DS-RT 2018 7

Test Parameters DS-RT 2018 8

Global Virtual Time (GVT) PE 0 LVT: 28 Time Stamp: 22 PE 1 LVT: 37 PE 2 LVT: 32 Time Stamp: 29 PE 3 LVT: 24 GVT = min(22, 24) DS-RT 2018 9

GVT Algorithms • Synchronous GVT • Barrier based implementation • pthread barrier • Asynchronous GVT • Mattern's GVT Algorithm • Modified for shared memory • Wait-Free GVT Algorithm • Proposed for shared memory • GVT is the minimum of : • minimum local virtual time (LVT) among all LPs and • minimum time stamped message in all in transit messages DS-RT 2018 10

Synchronous GVT: Barrier Based • “Stop-synchronize-and-go” model • Stop processing • Wait until all in transient messages to arrive • Compute GVT as minimum of all LVTs • Start processing • Problem: inefficient when threads arrive at different times • Faster threads will wait by spinning or scheduled out DS-RT 2018 11

GVT Wait & Synchronize PE 0 PE 1 PE 2 PE 3 12 DS-RT 2018

Asynchronous GVT • Solution to the inefficient synchronous algorithms • “in-line” model • GVT is computed at the background without stopping the simulation • Higher computational overhead for thread synchronization • Locking or atomic operations needed • Mattern's GVT Algorithm • Wait-Free GVT Algorithm DS-RT 2018 13

Mattern's GVT Algorithm • Designed for distributed systems, modified for shared memory • Instead of circulating the control massage through a ring, a global shared control structure is implemented • Tree based lock structure to protect the control message • LPs become white or red with respect to their position in GVT computation • In red phase, LPs checks the control message and starts recording sending messages • Problem: Computational overhead and locking delays DS-RT 2018 14

+ 1 PE 0 PE 0 0 PE 1 PE 1 0 PE 2 - 2 - 1 0 PE 2 PE 4 PE 3 + 1 0 Message Count: LVT min: Red Message min: Message Count: LVT min: Red Message min: - 1 +1 0 GVT = min(LVT_min, Red_Msg_min) LVT of PE 4 message (0 – 2) - DS-RT 2018 15

Wait-Free GVT Algorithm • Take advantage of shared memory, Mattern does not • LPs can read messages instantly, there is no in transit messages • Events are written to the destination buffers just after they are sent • 5 Phases: A, Send, B, Aware, End. Controlled by atomic operations • Solution to Mattern's computationally expensive algorithm • Simple design • No locking needed, instead atomic operations A. Pellegrini and F. Quaglia, “Wait-free global virtual time computation in shared memory timewarp systems,” in Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on, pp. 9-16, IEEE, 2014 DS-RT 2018 16

GVT = min(global min A, global min B) min A GVT min B PE 0 GVT min A min B PE 1 min A GVT min B PE 2 min A GVT min B PE 3 Phase B Phase Aware Phase A Phase Send Phase End 17 DS-RT 2018

Performance Implications of GVT Algorithms • On a 12-core Xeon - up to 24 threads • Changing Event Processing Delays • On Intel's Knights Landing Processor - up to 250 threads • Balanced Loading & Fast Event Processing • Balanced Loading & Slower Event Processing • Imbalanced in terms of Communication • Imbalanced in terms of Event Processing DS-RT 2018 18

1) 12 Cores Xeon • 10% & 100% remote • Balanced loading • EPG = 0 • Communication dominated, little event processing • Wait-Free is 30% faster than Mattern, 49% faster than Barrier • Observation: Expected due to advantages of asynchronous GVT computations DS-RT 2018 19

50% remote • Balanced loading • 100 EPG & 500 EPG • Event processing dominates over communication • Wait-Free is 19% faster than Barrier at 100 EPG, 12% faster at 500 EPG • Observation: Asynchronous algorithms are clear win while GVT becomes less critical with high event processing delays. DS-RT 2018 20

2) Balanced Loading & Fast Event Processing • 50% & 100% remote • Balanced loading • EPG = 0 • Wait-Free is 30% faster than Barrier • Mattern is 21% faster than Barrier • Observation: Asynchronous algorithms still perform better but difference is smaller in Mattern while Wait-Free is still a clear win. • Computational and locking overhead of Mattern's GVT • Synchronous nature of Barrier GVT DS-RT 2018 21

3) Balanced Loading & Slower Event Processing • 10% & 100% remote • Balanced loading • EPG = 100 • Wait-Free is 29% faster than Mattern, 31% faster than Barrier • Observation: Event processing granularity is not as effective in KNL architecture as it was in Xeon processor. DS-RT 2018 22

4) Imbalanced in terms of Communication • Some threads receive more messages than others • 10% remote • Barrier is 25% faster than Wait-Free, 30% faster than Mattern • Observation: Barrier outperforms asynchronous algorithms • Barrier is more efficient, performs less rollbacks • Synchronous nature of Barrier reduces the disparity between LPs • Asynchronous algorithms need more memory for slower threads DS-RT 2018 23

5) Imbalanced in terms of Event Processing • Changing EPG values per LP • 100% remote • Wait-Free is 47% faster than Barrier • Observation: Asynchronous algorithms suffer from imbalanced models in terms of communication not event processing DS-RT 2018 24

Conclusions • Choice of GVT algorithm significantly impacts the performance of PDES. • Event processing granularity reduces the GVT impact. • For balanced models, asynchronous algorithms outperform synchronous algorithms significantly. • Key Takeaway: for network communication imbalanced modes, synchronous nature of Barrier implementation limits the disparity between LPs, thus performs better than asynchronous algorithms. DS-RT 2018 25

Future Work • Hybrid synchronous / asynchronous GVT algorithms • Can adjust based on the changing simulation models • Scaling beyond the single node • GVT analysis in a cluster of KNLs DS-RT 2018 26

Discussion Questions / Comments DS-RT 2018 27

Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor

Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor

Presentation Transcript

Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation

DBMSs On a Modern Processor: Where Does Time Go?

A Symbolic Representation of Time Series, with Implications for Streaming Algorithms

Managing Performance and Efficiency of a Processor

Co-Processor Architectures Fermi vs. Knights Ferry

Landing on a STAAR

Airline On Time Performance

Continuous Performance Testing in Virtual Time

Performance Analysis of Processor

Evaluating Global Performance of MTCLIM (and other algorithms)

Time Complexity of Algorithms

Time Scales Virtual Clocks and Algorithms

Overview of a Performance Evaluation System for Global Computing Scheduling Algorithms

High Performance Programming on a Single Processor: Memory Hierarchies

Strategic Implications of Virtual Warfare:

“Iron Law” of Processor Performance

Performance Analysis of Processor

LANDING PERFORMANCE

Evolution of Processor Performance

Time Complexity of Algorithms

Reducing Average CPE Time On A Y86 Pipelined Processor

Improve Performance Of WordPress Landing Page?