Memory Architectures for Protein Folding: MD on million PIM processors

Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

Overview • EIA-0081307: “ITR: Intelligent Memory Architectures and Algorithms to Crack the Protein Folding Problem” • PIs: • Josep Torrellas and Laxmikant Kale (University of Illinois) • Mark Tuckerman (New York University) • Michael Klein (University of Pennsylvania) • Also associated: Glenn Martyna (IBM) • Period: 8/00 - 7/03

Project Description • Multidisciplinary project in computer architecture and software, and computational biology • Goals: • Design improved algorithms to help solve the protein folding problem • Design the architecture and software of general-purpose parallel machines that speed-up the solution of the problem

Some Recent Progress: Ideas • Developed REPSWA • (Reference Potential Spatial Warping Algorithm) • Novel algorithm for accelerating conformational sampling in molecular dynamics, a key element in protein folding • Based on ``spatial warping'' variable transformation. • This transformation is designed to shrink barrier regions on the energy landscape and grow attractive basins without altering the equilibrium properties of the system • Result: large gains in sampling efficiency • Using novel variable transformations to enhance conformational sampling in molecular dynamicsZ. Zhu, M. E. Tuckerman, S. O. Samuelson and G. J. Martyna, Phys. Rev. Lett.88, 100201 (2002).

Some Recent Progress: Tools • Developed LeanMD, a molecular dynamics parallel program that targets at very large scale parallel machines • Research-quality program based on the Charm++ parallel object oriented language • Descendant from NAMD (another parallel molecular dynamics application) that achieved unprecedented speedup on thousands of processors • LeanMD to be able to run on next generation parallel machines with ten thousands or even millions of processors such as Blue Gene/L or Blue Gene/C • Requires a new parallelization strategy that can break up the simulation problem in a more fine grained manner to generate parallelism enough to effectively distribute work across a million processors.

Some Recent Progress: Tools • Developed a high-performance communication library • For collective communication operations • AlltoAll personalized communication, AlltoAll multicast, and AllReduce • These operations can be complex and time consuming in large parallel machines • Especially costly for applications that involve all-to-all patterns • such as 3-D FFT and sorting • Library optimizes collective communication operations • by performing message combining via imposing a virtual topology • The overhead of AlltoAll communication for 76-byte message exchanges between 2058 processors is in the low tens of milliseconds

Some Recent Progress: People • The following graduate student researchers have been supported: • Sameer Kumar (University of Illinois) • Gengbin Zheng (University of Illinois) • Jun Nakano (University of Illinois) • Zhongwei Zhu (New York University)

Overview • Rest of the talk: • Objective: Develop a Molecular Dynamics program that will run effectively on a million processors • Each with low memory to processor ratio • Method: • Use parallel objects methodology • Develop an emulator/simulator that allows one to run full-fledged programs on simulated architecture • Presenting Today: • Simulator details • LeanMD Simulation on BG/L and BG/C

Problem: How to predict performance of applications on future machines? How to do performance tuning without continuous access to a large machine? Solution: Leverage virtualization Develop a machine emulator Simulator: accurate time modeling Run a program on “100,000 processors” using only hundreds of processors Performance Prediction on Large Machines

Communication threads Communication threads Worker threads Worker threads inBuff inBuff CorrectionQ CorrectionQ Non-affinity message queues Non-affinity message queues Blue Gene Emulator: functional view Affinity message queues Affinity message queues Converse scheduler Converse Q

Emulator: Study programming model and application development Simulator: performance prediction capability models communication latency based on network model; Doesn’t model memory access on chip, or network contention Parallel performance is hard to model Communication subsystem Out of order messages Communication/computation overlap Event dependencies Parallel Discrete Event Simulation Emulation program executes in parallel with event time stamp correction. Exploit inherent determinacy of application Emulator to Simulator

How to simulate? • Time stamping events • Per thread timer (sharing one physical timer) • Time stamp messages • Calculate communication latency based on network model • Parallel event simulation • When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor • When a message is received, update current time as: • currTime = max(currTime,recvTime) • Time stamp correction

Parallel correction algorithm • Sort message execution by receive time; • Adjust time stamps when needed • Use correction message to inform the change in event startTime. • Send out correction messages following the path message was sent • The events already in the timeline may have to move.

RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 Timestamps Correction

RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Timestamps Correction

RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Correction Message Timestamps Correction

RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M4 M4 Correction Message (M4) RecvTime Execution TimeLine M1 M2 M4 M3 M5 M6 M7 Correction Message Correction Message (M4) RecvTime Execution TimeLine M1 M2 M3 M5 M6 M4 M7 Correction Message Timestamps Correction

Predicted time vs latency factor Validation

LeanMD • LeanMD is a molecular dynamics simulation application written in Charm++ • Next generation of NAMD, • The Gordon Bell Award winner in SC2002. • Requires a new parallelization strategy • break up the problem in a more fine-grained manner to effectively distribute work across the extreme large number of processors.

LeanMD Performance Analysis Need readable graphs: 1 to a page is fine, but with larger fonts, thicker lines

Memory Architectures for Protein Folding: MD on million PIM processors

Memory Architectures for Protein Folding: MD on million PIM processors

Presentation Transcript

Protein Folding

Protein folding catalysts

Protein FOLDING

Protein folding

Protein Folding

Protein Folding

Protein Folding

Protein Folding/Unfolding

MPI on a Million Processors

Protein Folding

Protein Structure: protein folding

Protein folding

Cotranslational Protein Folding

New Strategies for Protein Folding

Protein Folding

PROTEIN FOLDING

Protein Folding

Protein Folding

Protein Structure: protein architectures

Experimental Approach for Protein Folding

Protein Folding

Protein Folding