Advanced Molecular Modeling Application Performance Insights

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Mattan Erez,Jung Ho Ahn,Ankit Garg,William J. Dally,Eric Darve (Stanford Univ.) Presented by Jiahua He

Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions

Parallel Architectures • Flynn taxonomy • SISD (sequential machine), SIMD, MIMD,MISD (no commercial system) • SIMD • Processor-array machine • Single processor vector machine • MIMD • PVP, SMP, DSM, MPP, Cluster

single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element Processor-Array Machine • Control processor issues instructions • All processors in processor array operate the instructions in lock-step • Distributed memory • Need permutation if data not aligned

… … … vr2 vr1 vr3 + (logically, performs #elts adds in parallel) Vector Machine • A processor can do element-wise operations on entire vectors with a single instruction • Dominated the high performance computer market for about 15 years • Overtaken by MPP in 90s • Re-emerges in recent years (Earth Simulator and Cray X1)

P1 NI P0 NI Pn NI memory memory memory . . . interconnect MPP and Cluster • Distributed memory • Each processor/node has its own private memory • Nodes may be SMPs • MIMD • Nodes execute different instructions asynchronously • Nodes communicate and synchronize by interconnection network

Earth Simulator • Vector machine re-emerges • Rmax 36 GFLOPS > Rmax sum of top 10 • Vector machines focused on powerful processors • MPP or Cluster focused on large-scale “clustering” • Trend: merge the above two

Modern VLSI Technology • Arithmetic is cheap • 100s of GFLOPS/chip today • TFLOPS in 2010 • Bandwidth is expensive • General purpose processor architectures have not adapted to this change

Stream Processor • One control unit and 100s of FPUs • 90nm fabrication process: 64-bit, 0.5mm2, 50pJ • Deep register hierarchy with high local bandwidth • Match bandwidth demands and tech. limits • Stream: sequence of data objects • Expose large amounts of data parallelism • Keep 100s of FPUs per processor busy • Hide long latencies of memory operations

Stream Processor (cont’d) • Expose multiple levels of locality • Short term producer-consumer locality (LRF) • Long term producer-consumer locality (SRF) • Cannot be exploited by caches – no reuse, no spatial locality • Scalable • 128GFLOPS processor • 16 node 2TFLOPS single board workstation • 16,384 node 2PFLOPS supercomputer with 16 cabinets

Merrimac Processor • Scalar core (1) • Perform control code and issue stream instructions • Arithmetic clusters (16) • 64-bit multiply-accumulate (MADD) FPUs (4) • Execute the same VLIW instruction • Local register file (LRF) per FPU (192 words) • Short term producer-consumer locality in a kernel • Stream register file (SRF) per cluster (8K words) • Long term producer-consumer locality across kernels • Staging area for memory data transfer to hide latencies

Architecture of Merrimac

kernel1 kernel2 Stream Programming Model • Cast the computation as a collection of streams passing through a series of computational kernels. • Data parallelism • Across stream elements • Task parallelism • Across kernels

Memory System • A stream memory instruction transfers entire stream • Address generator (2) • 8 single-word addresses every cycle • Stride access or gathers/scatters pattern • Cache (128K words, 64GB/s) • Directly interface with external DRAM and network • External DRAM (2GB, 38.4GB/s) • Single-word remote memory access • Scatter-add operation

Interconnection Network (Fat Tree)

Molecular Dynamics • Explore kinetic and thermodynamic properties of molecular system by simulating atomic models • water  water molecule • protein  water molecules • GROMAC: fastest MD code available • Cut-off distance approximation • Neighbor list (neighbors within rc)

StreamMD • Single kernel: non-bonded interaction between all atom pairs of a molecule and one of its neighbor • Pseudo code: c_positions = gather(positions, i_central); n_positions = gather(positions, i_neighbor); partial_forces = compute_force(c_positions, n_positions); forces = scatter_add(partial_forces, i_forces);

Latency Tolerance • Pipeline the requests • To amortize long initial latency • By issuing a memory op of long stream • Hide memory ops with computations • Concurrently executing memory ops and kernel computations • Strip-mining • Large data set  smaller strips • Outer loop (done manually)

Parallelism • 4 variants to exploit parallelism • Also implemented on Pentium 4 for comparison

“Expanded” Variant • Simplest version • Fully expand the interaction list • For each cluster per iteration • Read 2 interacting molecules • Produce 2 partial forces

“Fixed” Variant • Fixed-length neighbor list of length L • For each cluster • Read a central molecule once every L iteration • Read a neighbor molecule each iteration • Partial forces of central molecule are reduced in cluster • Repeat central molecule in i_central • Add dummy_neighbor in i_neighbor if needed

“Variable” Variant • Variable-length neighbor list • Process inputs and produce outputs at a different rate for each cluster • Merrimac’s inter-cluster communication • Conditional streams mechanism • Indexable SRF • Instructions to read new central position and write partial forces are issued on every iteration but with a condition • Slight overhead of unexecuted instructions

“Duplicated” Variant • Fixed-length neighbor list • Duplicate all interaction calculations • Reduce complete force for central molecule within cluster • No partial force for neighbor molecule is written out

Locality • Only short term producer consumer locality within a single kernel • Computing partial forces • Internal reduction of forces within a cluster • Computation/bandwidth trade-off • Extra computation for interactions with dummy molecules: “fixed” variant • Extreme case: “duplicated” variant • Need more sophisticated schemes (discuss later)

Experiment Setup • Single-node experiments • 900 water-molecule system • Cycle-accurate simulator of Merrimac • 4 variants of StreamMD • Pentium 4 version • Latest version of GROMACS • Fully hand optimized • Single precision SSE

Latency Tolerance • Snippet of the execution of “duplicated” variant • Left column • Kernel computations • Right column • Memory operations • Perfect overlap of memory and computation

Locality • Arithmetic intensities • “fixed” and “variable” depend on data set • Small diff  compiler efficiently utilize register hierarchy • Reference percentages • Nearly all to LRFs • Small diff  use SRF just as staging area for memory

Performance • “variable” outperforms “expanded”by 84%, “fixed”by 26%, “duplicated”by 119%, and “Pentium 4” by a factor of 13.2 • 38.8 GFLOPS is 50% of the optimal solution GFLOPS

Automatic Optimizations • Communication scheduling • SRF decouples memory from computation • Loop unrolling and software pipelining • Improve execution rate by 83% • Stream scheduling • SRF is software managed • Capture long term producer consumer locality by intelligent eviction

Computation/bandwidth Trade-off • Blocking technique • Group molecules into cubic clusters of size r3 • Pave the rc3 (cut-off radius) sphere with cubic clusters • Memory bandwidth requirement scales as O(r-3) • Extra computation between rc and rc+2sqrt(3)r • Minimum occurs at about 3 molecules per cluster (1.43)

Conclusions • Reviewed the architecture and organization of Merrimac • Presented app StreamMD, implemented 4 variants and evaluated their performance • Compared Merrimac’s suitability for molecular dynamic app against a conventional Pentium 4 processor

Special Applications? • Merrimac is tuned for scientific applications • Programming model • A collection of streams pass through a series of computational kernels • Need large data level parallelism to utilize the FPUs • Task parallelism just can be exploit across nodes because of SIMD

Easy to Program? • Effective automatic compilation • Communication scheduling and stream scheduling (shown earlier) • Highly optimized code for conventional processors is often written in assembly • Performance of different StreamMD variants vary only by 2 fold (shown earlier)

Compare with Supercomputer? • Only comparing with Pentium 4 seems not convincing • MDGRAPE-3 of Protein Explorer can achieve 165 GFLOPS out of 200 GFLOPS (peak) • But it is special purpose design • How about vector machines? • Lack of standard benchmarks

Thanks! And questions?

Advanced Molecular Modeling Application Performance Insights

Advanced Molecular Modeling Application Performance Insights

Presentation Transcript

Molecular Modeling Part I Molecular Mechanics and Conformational Analysis

Molecular modeling of DARC

Molecular Modeling: Conformational Molecular Field Analysis (CoMFA)

Molecular Modeling

Molecular Modeling Part I Molecular Mechanics and Conformational Analysis

The Performance Analysis of Molecular dynamics RAD GTPase with AMBER application on Cluster computing environtment.

Town of Merrimac Performance Contracting Initiative

Performance Modeling and Analysis with PEBIL

Molecular Modeling

Performance analysis of a Pose application -- BigNetSim

Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Molecular Modeling

Principles and Practice of Application Performance Measurement and Analysis on Parallel Systems

Molecular Modeling

High Performance Molecular Simulation, Visualization, and Analysis on GPUs

Application Performance Analysis and Modeling

Molecular Modeling

Molecular Modeling

Application Performance Analysis on Blue Gene/L

Molecular Modeling?

Principles and Practice of Application Performance Measurement and Analysis on Parallel Systems

Principles and Practice of Application Performance Measurement and Analysis on Parallel Systems