400 likes | 416 Views
This study delves into the analysis and performance results of a molecular modeling application, StreamMD, on the Merrimac architecture. It explores parallel architectures, Flynn taxonomy, SIMD, MIMD, and MISD systems. The discussion covers vector machines, MPP, Cluster, modern VLSI technology, and the Stream Processor. The Merrimac Processor's architecture, stream programming model, memory system, and interconnection network are highlighted. The study focuses on Molecular Dynamics simulations, specifically exploring kinetic and thermodynamic properties at the atomic level using GROMAC. Latency tolerance and high-performance computing trends are also discussed.
E N D
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Mattan Erez,Jung Ho Ahn,Ankit Garg,William J. Dally,Eric Darve (Stanford Univ.) Presented by Jiahua He
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Parallel Architectures • Flynn taxonomy • SISD (sequential machine), SIMD, MIMD,MISD (no commercial system) • SIMD • Processor-array machine • Single processor vector machine • MIMD • PVP, SMP, DSM, MPP, Cluster
single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element Processor-Array Machine • Control processor issues instructions • All processors in processor array operate the instructions in lock-step • Distributed memory • Need permutation if data not aligned
… … … vr2 vr1 vr3 + (logically, performs #elts adds in parallel) Vector Machine • A processor can do element-wise operations on entire vectors with a single instruction • Dominated the high performance computer market for about 15 years • Overtaken by MPP in 90s • Re-emerges in recent years (Earth Simulator and Cray X1)
P1 NI P0 NI Pn NI memory memory memory . . . interconnect MPP and Cluster • Distributed memory • Each processor/node has its own private memory • Nodes may be SMPs • MIMD • Nodes execute different instructions asynchronously • Nodes communicate and synchronize by interconnection network
Earth Simulator • Vector machine re-emerges • Rmax 36 GFLOPS > Rmax sum of top 10 • Vector machines focused on powerful processors • MPP or Cluster focused on large-scale “clustering” • Trend: merge the above two
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Modern VLSI Technology • Arithmetic is cheap • 100s of GFLOPS/chip today • TFLOPS in 2010 • Bandwidth is expensive • General purpose processor architectures have not adapted to this change
Stream Processor • One control unit and 100s of FPUs • 90nm fabrication process: 64-bit, 0.5mm2, 50pJ • Deep register hierarchy with high local bandwidth • Match bandwidth demands and tech. limits • Stream: sequence of data objects • Expose large amounts of data parallelism • Keep 100s of FPUs per processor busy • Hide long latencies of memory operations
Stream Processor (cont’d) • Expose multiple levels of locality • Short term producer-consumer locality (LRF) • Long term producer-consumer locality (SRF) • Cannot be exploited by caches – no reuse, no spatial locality • Scalable • 128GFLOPS processor • 16 node 2TFLOPS single board workstation • 16,384 node 2PFLOPS supercomputer with 16 cabinets
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Merrimac Processor • Scalar core (1) • Perform control code and issue stream instructions • Arithmetic clusters (16) • 64-bit multiply-accumulate (MADD) FPUs (4) • Execute the same VLIW instruction • Local register file (LRF) per FPU (192 words) • Short term producer-consumer locality in a kernel • Stream register file (SRF) per cluster (8K words) • Long term producer-consumer locality across kernels • Staging area for memory data transfer to hide latencies
kernel1 kernel2 Stream Programming Model • Cast the computation as a collection of streams passing through a series of computational kernels. • Data parallelism • Across stream elements • Task parallelism • Across kernels
Memory System • A stream memory instruction transfers entire stream • Address generator (2) • 8 single-word addresses every cycle • Stride access or gathers/scatters pattern • Cache (128K words, 64GB/s) • Directly interface with external DRAM and network • External DRAM (2GB, 38.4GB/s) • Single-word remote memory access • Scatter-add operation
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Molecular Dynamics • Explore kinetic and thermodynamic properties of molecular system by simulating atomic models • water water molecule • protein water molecules • GROMAC: fastest MD code available • Cut-off distance approximation • Neighbor list (neighbors within rc)
StreamMD • Single kernel: non-bonded interaction between all atom pairs of a molecule and one of its neighbor • Pseudo code: c_positions = gather(positions, i_central); n_positions = gather(positions, i_neighbor); partial_forces = compute_force(c_positions, n_positions); forces = scatter_add(partial_forces, i_forces);
Latency Tolerance • Pipeline the requests • To amortize long initial latency • By issuing a memory op of long stream • Hide memory ops with computations • Concurrently executing memory ops and kernel computations • Strip-mining • Large data set smaller strips • Outer loop (done manually)
Parallelism • 4 variants to exploit parallelism • Also implemented on Pentium 4 for comparison
“Expanded” Variant • Simplest version • Fully expand the interaction list • For each cluster per iteration • Read 2 interacting molecules • Produce 2 partial forces
“Fixed” Variant • Fixed-length neighbor list of length L • For each cluster • Read a central molecule once every L iteration • Read a neighbor molecule each iteration • Partial forces of central molecule are reduced in cluster • Repeat central molecule in i_central • Add dummy_neighbor in i_neighbor if needed
“Variable” Variant • Variable-length neighbor list • Process inputs and produce outputs at a different rate for each cluster • Merrimac’s inter-cluster communication • Conditional streams mechanism • Indexable SRF • Instructions to read new central position and write partial forces are issued on every iteration but with a condition • Slight overhead of unexecuted instructions
“Duplicated” Variant • Fixed-length neighbor list • Duplicate all interaction calculations • Reduce complete force for central molecule within cluster • No partial force for neighbor molecule is written out
Locality • Only short term producer consumer locality within a single kernel • Computing partial forces • Internal reduction of forces within a cluster • Computation/bandwidth trade-off • Extra computation for interactions with dummy molecules: “fixed” variant • Extreme case: “duplicated” variant • Need more sophisticated schemes (discuss later)
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Experiment Setup • Single-node experiments • 900 water-molecule system • Cycle-accurate simulator of Merrimac • 4 variants of StreamMD • Pentium 4 version • Latest version of GROMACS • Fully hand optimized • Single precision SSE
Latency Tolerance • Snippet of the execution of “duplicated” variant • Left column • Kernel computations • Right column • Memory operations • Perfect overlap of memory and computation
Locality • Arithmetic intensities • “fixed” and “variable” depend on data set • Small diff compiler efficiently utilize register hierarchy • Reference percentages • Nearly all to LRFs • Small diff use SRF just as staging area for memory
Performance • “variable” outperforms “expanded”by 84%, “fixed”by 26%, “duplicated”by 119%, and “Pentium 4” by a factor of 13.2 • 38.8 GFLOPS is 50% of the optimal solution GFLOPS
Automatic Optimizations • Communication scheduling • SRF decouples memory from computation • Loop unrolling and software pipelining • Improve execution rate by 83% • Stream scheduling • SRF is software managed • Capture long term producer consumer locality by intelligent eviction
Computation/bandwidth Trade-off • Blocking technique • Group molecules into cubic clusters of size r3 • Pave the rc3 (cut-off radius) sphere with cubic clusters • Memory bandwidth requirement scales as O(r-3) • Extra computation between rc and rc+2sqrt(3)r • Minimum occurs at about 3 molecules per cluster (1.43)
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Conclusions • Reviewed the architecture and organization of Merrimac • Presented app StreamMD, implemented 4 variants and evaluated their performance • Compared Merrimac’s suitability for molecular dynamic app against a conventional Pentium 4 processor
Special Applications? • Merrimac is tuned for scientific applications • Programming model • A collection of streams pass through a series of computational kernels • Need large data level parallelism to utilize the FPUs • Task parallelism just can be exploit across nodes because of SIMD
Easy to Program? • Effective automatic compilation • Communication scheduling and stream scheduling (shown earlier) • Highly optimized code for conventional processors is often written in assembly • Performance of different StreamMD variants vary only by 2 fold (shown earlier)
Compare with Supercomputer? • Only comparing with Pentium 4 seems not convincing • MDGRAPE-3 of Protein Explorer can achieve 165 GFLOPS out of 200 GFLOPS (peak) • But it is special purpose design • How about vector machines? • Lack of standard benchmarks