Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations Qi (Jacky) Liu and Gabriel Wainer Department of Systems and Computer Engineering Carleton University Ottawa, Canada

Outline Motivation & Background Fine-Grained Event Parallelism Event Processing Kernel Parallel DEVS Simulation on Cell Experimental Results Conclusion & Future Work

Motivation • Accelerate general-purpose DEVS-based simulations on heterogeneous CMP architectures like the Cell processor • Develop new parallelization strategies based on fine-grainedevent-level parallelism inherent in the simulation process • Exploit multi-grained parallelismsimultaneously at different levels of the system • Allow general users to gain performance transparently w/o being distracted by multicore programming details • Provide some generalizable methods & insight for PDES on emerging CMP architectures

Cell Processor Overview • Nine-core heterogeneous CMP with two distinct ISAs • Software-managed LS with explicitly-addressed DMA transfer • Low-latency EIB channels – 32-bit mailbox & signal messages

Parallel DEVS (P-DEVS) Formalism Discrete-EVent System Specification (DEVS) • Cell-DEVS Formalism

Layered View of M&S

Structured Simulation Process Parallel Simulation with CD++ • Flat LP Structure • (I)  LP and model init. • (@)  model output • (*)  model state trans. • (D)  model sync. • (X)  model input data • (Y)  model output data

Fine-Grained Event Parallelism • Event-embarrassing parallelism • Independent events within a step • Executed in an arbitrary order • Event-streaming parallelism • Causally-related events between consecutive steps • Executed in a pipelined fashion • Phase-changing events • Exchanged between NC & FC • Natural fork & join points • Data-flow oriented parallelization

SEK Concurrent exec. across SPEs - 98.02% (event-embarrassing parallelism) Pipelined exec. between PPE & SPEs - 1.15% (event-streaming parallelism) Event Processing Kernel • Hydrological Watershed Simulation • 320×320×2 with 204,800 Simulators • Compute-intensive state transitions • Over 300 million events across 663 phases • Cell-DEVS model defined in CD++ spec. lang. • Simulation Profile on the PPE

EVENT-STREAMING PARALLELISM (TWO-STAGE PIPELINE) Parallel DEVS Simulation on Cell - Overview COMPUTE-I/O PARALLELISM THREAD PARALLELISM VECTOR PARALLELISM (SPE SIMD) EVENT-EMBARRASSING PARALLELISM DATA-STREAMING PARALLELISM (DOUBLED-BUFFERED DMA AT THREE LAYERS)

Parallel DEVS Simulation on Cell – LP Virtualization • Purpose • Map active Simulators to a limited group of SPE threads • Fit into the small on-chip LS • Assign each SPE a reusable task operating on a stream of data • Facilitate fine-grained dynamic load-balancing between SPEs • Solution • Turn Simulators (and associated atomic models) into virtual LPs • Separate event-processing logic (wrapped in SPE threads) from state data (maintained in main memory buffers) • Match the states of active Simulators to available SPE threads dynamically at each virtual time – SEK job scheduling

Virtual Simulator State Mgmt. • Decentralized Event Mgmt. Parallel DEVS Simulation on Cell – More Details

Rule Evaluation on SPEs • SEK Job Scheduling Parallel DEVS Simulation on Cell – More Details

IBM BladeCenter QS22 3.2GHz PowerXCell 8i × 2 32GB RAM Red Hat Enterprise Linux 5.2 IBM SDK for Multicore Acceleration 3.1 Platform and Configuration • Parallel DEVS simulator on Cell  CD++/Cell • SEK job scheduling policy  round-robin or shortest-queue-first • CD++ event-logging turned off  minimize the impact of file I/O

Total Simulation Time with Watershed Model • Performance gain with just one SPE  5.84× • OO C++ code on PPE vs. SIMD-aware C code on SPEs • memory latency & cache miss vs. data locality & double-buffered DMA • Low-level optimizations on SPEs (LS data alignment, call stack usage, branch minimization, loop unrolling, in-line substitution, pipelined event execution) • Overall performance with 8 SPEs  33.06×

Speedups over (PPE with 1 SPE) Version • Speedup grows slower with more and more SPEs • Higher overhead for SEK job scheduling and orchestration • Increased DMA contention & channel stalls

Conclusion • Formalism-Based Design Methodology • Facilitate model reuse & portability • Reduce validation & verification cost • Performance-Centric Approach • Accelerate event processing for compute-intensive DEVS models • Minimize communication & synchronization overhead • Achieve fine-grained dynamic load balancing • New Parallelization Strategy for PDES • Exploit fine-grained event parallelism from a data-flow perspective • Combine multi-grained parallelism at different system levels • Break LP boundaries with LP virtualization • Insight for PDES on Heterogeneous CMP Architectures • Match workload characteristics to functional specialization of cores • Address data locality, memory latency, & code optimization issues

Future Work • Porting different types of models to Cell  performance testing • Transparency • Minimal knowledge (and learning curve) from users • Integrating with existing conservative/optimistic approaches • Combine cluster-level LP-based conservative simulation •  Using both synchronous & asynchronous algorithms • Combine cluster-level Time Warp optimistic simulation •  Using Lightweight Time Warp (DS-RT 2008, PADS 2009) • Testing on large-scale hybrid supercomputers • Using Cell processor in new ways 18/18

Questions? This research was supported in part by the MITACS Accelerate Ontario program, Canada, and by the IBM T. J. Watson Research Center, NY. liuqi@sce.carleton.ca http://www.sce.carleton.ca/~liuqi/ ARS Lab: http://cell-devs.sce.carleton.ca/ars/

Some Applications • Defense & Emergency Planning Battlefield Simulations Crowd Behavior & Evacuation Analysis

Some Applications • Biomedical & Environmental Analysis Deformable Membrane Presynaptic Nerve Krebs Cycle in living organisms Forest fire propagation Watershed formation

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

Presentation Transcript

Taming GPU compute with C++ Accelerated Massive Parallelism

Harnessing GPU compute with C++ Accelerated Massive Parallelism

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

Enhancing Fine-Grained Parallelism Part II

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Compute Intensive Research on Cloud Computing Infrastructure

DEVS 201

Multi-threading and other parallelism options

Data Intensive Scientific Compute Model for Multicore clusters

Creating Coarse-grained Parallelism for Loop Nests

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Optimizing N-body Simulations for Multi-core Compute Clusters

Atomistic vs. Coarse Grained Simulations

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

From Compute Intensive to Data Intensive Grid Computing

Parallelism (Multi-threaded)

Revisiting Pipelined Parallelism in Multi-Join Query Processing

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism Part II

From Compute Intensive to Data Intensive Grid Computing