120 likes | 347 Views
Folding @ Home - Distributed Parallel Protein folding. Chris Garlock. Protein Folding - Why is it important. Proteins are biological nano-machines which play apart in all of our bodies functions Protein folding is the process all proteins undergo to assemble into their native structure
E N D
Folding @ Home - Distributed Parallel Protein folding Chris Garlock
Protein Folding - Why is it important • Proteins are biological nano-machines which play apart in all of our bodies functions • Protein folding is the process all proteins undergo to assemble into their native structure • Strive for states of low free energy • Sometimes proteins misfold, and misfolded proteins can clump together or aggregate, which can cause serious health problems • Alzheimers • Cystic Fibrosis • Mad Cow • Several types of cancer
How is protein folding simulated on a computer? • Atomic level simulations • Newtonian mechanics (lots of numerical integration) • It takes a large amount of computational time to simulate even a small amount of time at the molecular level (with a sufficiently small time step) • Proteins can fold in many ways, so a statistical model is needed to represent all of the possibilities accurately
Markov State Models • Protien Folding is stochastic in nature: • Robability distribution may be analyzed statistically, but may not be preidcted precisely • A way of describing all of the conformations (shapes) a protein explores as a set of states (i.e. distinct structures) and transition rates between them. • Facilitate parallelization by allowing statistical aggregation of short, independent simulation trajectories, replacing the need for unrealistically long trajectories. • Through Adaptive Sampling, MSM’s allow us to increase simulation efficiency
What is Adaptive Sampling? • Conventinal approach for all-atom molecular dynamics has two steps: • First, run a set of simulations • Second, analyze the resulting data • The Markov State Model/Adaptive Sampling approach interleaves these two steps. • Instead of building model after data has been collected, model is built on the fly as data is generated • Current state of the model is used to inform the progress of further simulations • Imagine exploring a maze, you have no map, but you do have a gps which will display parts of the maze that you have explored. • Conventional approach is akin to sticking the GPS in your pocket and walking around blindly, bumping off of walls until you are exhausted, and then taking out the GPS to see the structure of the maze. You will probably notice that you have wasted lots of time stuck in parts of the maze. • Adaptive Sampling approach is like watching the GPS as you walk, you can identify when you are stuck in a certain part of the maze and avoid re-exploring parts of the maze that you are confident you have fully explored.
How to parallelize folding simulations • To start a simulation project, first choose some initial conformations (protein shapes). • Each conformation becomes the starting point for some simulations which together are called a run • Within each run we launch many different trajectories each called a clone • All clones in a run start with the same conformation, but different initial velocities for the atoms involved • Because each clone takes large amounts of time to execute, clones are further divided into generations. Generations have to be run serially. • Some clones may find additional conformations (states of equilibrium) in which case new runs are started from those conformations • Repeat steps 2-5 until the Markov State Model is complete
Generate Initial Conformation Start a run for A (comprised of 5 clones) A Discover additional Conformations Start a run for B and C B Discover additional Conformations and new pathways Start a run for D and E Discover a misfold condition D C E
How long do simulations take to run • The time it takes to run one of F@H’s simulations depends on a variety of factors: • Number of Amino Acids • F@H’s simulations run in polynomial time based on the number of amino acids • Resolution - Higher resolution means more conformations in a model • Resolution is adjusted by varying the definition of what it means for two conformations to interconvert (switch between each other). • High resolution would be requiring two conformations to be able to interconvert on a nanosecond timescale for them to be grouped into the same state. High resolution models are appropriate for making quantitative comparisons with experiments done in the lab. • A more coarse grained low resolution model would only require to conformations to interconvert on the microsecond timescale. This yields fewer, larger states more suitable for human insight. • Because the number of threads we are able to put to use effectively is restricted, simulations for several different proteins are run simultaneously • Building one MSM can take up to a year, but simulations are given staggered starts so a new simulation finishes about once or twice a week.
How does the MSM approach achieve linear speedup? • Probability of a protein moving from one conformation to another: Tserial = k( -k t) where k is the folding rate • Traditionally this means you must simulate 1/k of real time in order to achieve a reasonable likelihood of capturing a folding event. • Instead of simulating a single protein molecule, we will simulate N molecules in parallel and wait for the first simulation to fold into a new conformation. • This gives Tparallel = Nk(-Mkt) • The parallel folding rate is N times faster than the folding rate for a single simulation. • What would normally take 30 years on a single CPU could be simulated in 10 days using 1000 CPU’s • Complications to overcome when we have to capture several folding events • Because of the low energy nature of conformations, we can waste lots of time lingering in intermediate states. • Solution is to identify intermediate states, and speed the transition through those states • Also need the ability to sample trapped states without unproductively lingering in them