400 likes | 406 Views
Analysis and Performance Results of a Molecular Modeling Application on Merrimac. Mattan Erez , Jung Ho Ahn, Ankit Garg, William J. Dally, Eric Darve (Stanford Univ.) Presented by Jiahua He. Content. Background Motivation Merrimac Architecture Application: StreamMD Performance Evaluation
E N D
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Mattan Erez,Jung Ho Ahn,Ankit Garg,William J. Dally,Eric Darve (Stanford Univ.) Presented by Jiahua He
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Parallel Architectures • Flynn taxonomy • SISD (sequential machine), SIMD, MIMD,MISD (no commercial system) • SIMD • Processor-array machine • Single processor vector machine • MIMD • PVP, SMP, DSM, MPP, Cluster
single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element Processor-Array Machine • Control processor issues instructions • All processors in processor array operate the instructions in lock-step • Distributed memory • Need permutation if data not aligned
… … … vr2 vr1 vr3 + (logically, performs #elts adds in parallel) Vector Machine • A processor can do element-wise operations on entire vectors with a single instruction • Dominated the high performance computer market for about 15 years • Overtaken by MPP in 90s • Re-emerges in recent years (Earth Simulator and Cray X1)
P1 NI P0 NI Pn NI memory memory memory . . . interconnect MPP and Cluster • Distributed memory • Each processor/node has its own private memory • Nodes may be SMPs • MIMD • Nodes execute different instructions asynchronously • Nodes communicate and synchronize by interconnection network
Earth Simulator • Vector machine re-emerges • Rmax 36 GFLOPS > Rmax sum of top 10 • Vector machines focused on powerful processors • MPP or Cluster focused on large-scale “clustering” • Trend: merge the above two
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Modern VLSI Technology • Arithmetic is cheap • 100s of GFLOPS/chip today • TFLOPS in 2010 • Bandwidth is expensive • General purpose processor architectures have not adapted to this change
Stream Processor • One control unit and 100s of FPUs • 90nm fabrication process: 64-bit, 0.5mm2, 50pJ • Deep register hierarchy with high local bandwidth • Match bandwidth demands and tech. limits • Stream: sequence of data objects • Expose large amounts of data parallelism • Keep 100s of FPUs per processor busy • Hide long latencies of memory operations
Stream Processor (cont’d) • Expose multiple levels of locality • Short term producer-consumer locality (LRF) • Long term producer-consumer locality (SRF) • Cannot be exploited by caches – no reuse, no spatial locality • Scalable • 128GFLOPS processor • 16 node 2TFLOPS single board workstation • 16,384 node 2PFLOPS supercomputer with 16 cabinets
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Merrimac Processor • Scalar core (1) • Perform control code and issue stream instructions • Arithmetic clusters (16) • 64-bit multiply-accumulate (MADD) FPUs (4) • Execute the same VLIW instruction • Local register file (LRF) per FPU (192 words) • Short term producer-consumer locality in a kernel • Stream register file (SRF) per cluster (8K words) • Long term producer-consumer locality across kernels • Staging area for memory data transfer to hide latencies
kernel1 kernel2 Stream Programming Model • Cast the computation as a collection of streams passing through a series of computational kernels. • Data parallelism • Across stream elements • Task parallelism • Across kernels
Memory System • A stream memory instruction transfers entire stream • Address generator (2) • 8 single-word addresses every cycle • Stride access or gathers/scatters pattern • Cache (128K words, 64GB/s) • Directly interface with external DRAM and network • External DRAM (2GB, 38.4GB/s) • Single-word remote memory access • Scatter-add operation
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Molecular Dynamics • Explore kinetic and thermodynamic properties of molecular system by simulating atomic models • water water molecule • protein water molecules • GROMAC: fastest MD code available • Cut-off distance approximation • Neighbor list (neighbors within rc)
StreamMD • Single kernel: non-bonded interaction between all atom pairs of a molecule and one of its neighbor • Pseudo code: c_positions = gather(positions, i_central); n_positions = gather(positions, i_neighbor); partial_forces = compute_force(c_positions, n_positions); forces = scatter_add(partial_forces, i_forces);
Latency Tolerance • Pipeline the requests • To amortize long initial latency • By issuing a memory op of long stream • Hide memory ops with computations • Concurrently executing memory ops and kernel computations • Strip-mining • Large data set smaller strips • Outer loop (done manually)
Parallelism • 4 variants to exploit parallelism • Also implemented on Pentium 4 for comparison
“Expanded” Variant • Simplest version • Fully expand the interaction list • For each cluster per iteration • Read 2 interacting molecules • Produce 2 partial forces
“Fixed” Variant • Fixed-length neighbor list of length L • For each cluster • Read a central molecule once every L iteration • Read a neighbor molecule each iteration • Partial forces of central molecule are reduced in cluster • Repeat central molecule in i_central • Add dummy_neighbor in i_neighbor if needed
“Variable” Variant • Variable-length neighbor list • Process inputs and produce outputs at a different rate for each cluster • Merrimac’s inter-cluster communication • Conditional streams mechanism • Indexable SRF • Instructions to read new central position and write partial forces are issued on every iteration but with a condition • Slight overhead of unexecuted instructions
“Duplicated” Variant • Fixed-length neighbor list • Duplicate all interaction calculations • Reduce complete force for central molecule within cluster • No partial force for neighbor molecule is written out
Locality • Only short term producer consumer locality within a single kernel • Computing partial forces • Internal reduction of forces within a cluster • Computation/bandwidth trade-off • Extra computation for interactions with dummy molecules: “fixed” variant • Extreme case: “duplicated” variant • Need more sophisticated schemes (discuss later)
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Experiment Setup • Single-node experiments • 900 water-molecule system • Cycle-accurate simulator of Merrimac • 4 variants of StreamMD • Pentium 4 version • Latest version of GROMACS • Fully hand optimized • Single precision SSE
Latency Tolerance • Snippet of the execution of “duplicated” variant • Left column • Kernel computations • Right column • Memory operations • Perfect overlap of memory and computation
Locality • Arithmetic intensities • “fixed” and “variable” depend on data set • Small diff compiler efficiently utilize register hierarchy • Reference percentages • Nearly all to LRFs • Small diff use SRF just as staging area for memory
Performance • “variable” outperforms “expanded”by 84%, “fixed”by 26%, “duplicated”by 119%, and “Pentium 4” by a factor of 13.2 • 38.8 GFLOPS is 50% of the optimal solution GFLOPS
Automatic Optimizations • Communication scheduling • SRF decouples memory from computation • Loop unrolling and software pipelining • Improve execution rate by 83% • Stream scheduling • SRF is software managed • Capture long term producer consumer locality by intelligent eviction
Computation/bandwidth Trade-off • Blocking technique • Group molecules into cubic clusters of size r3 • Pave the rc3 (cut-off radius) sphere with cubic clusters • Memory bandwidth requirement scales as O(r-3) • Extra computation between rc and rc+2sqrt(3)r • Minimum occurs at about 3 molecules per cluster (1.43)
Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions
Conclusions • Reviewed the architecture and organization of Merrimac • Presented app StreamMD, implemented 4 variants and evaluated their performance • Compared Merrimac’s suitability for molecular dynamic app against a conventional Pentium 4 processor
Special Applications? • Merrimac is tuned for scientific applications • Programming model • A collection of streams pass through a series of computational kernels • Need large data level parallelism to utilize the FPUs • Task parallelism just can be exploit across nodes because of SIMD
Easy to Program? • Effective automatic compilation • Communication scheduling and stream scheduling (shown earlier) • Highly optimized code for conventional processors is often written in assembly • Performance of different StreamMD variants vary only by 2 fold (shown earlier)
Compare with Supercomputer? • Only comparing with Pentium 4 seems not convincing • MDGRAPE-3 of Protein Explorer can achieve 165 GFLOPS out of 200 GFLOPS (peak) • But it is special purpose design • How about vector machines? • Lack of standard benchmarks