440 likes | 549 Views
Background HT MD – Target Selection – Database Mining Native DB Reference unfolded peptide DB Mining Unfolding Protein DB Prion Protein and amyloid DB. Dynameomics. Valerie Daggett Bioengineering Department Biomedical and Health Informatics University of Washington Seattle, WA. DNA.
E N D
Background HT MD – Target Selection – Database Mining Native DB Reference unfolded peptide DB Mining Unfolding Protein DB Prion Protein and amyloid DB Dynameomics Valerie Daggett Bioengineering Department Biomedical and Health Informatics University of Washington Seattle, WA
DNA transcription RNA translation Protein Central dogma of biology Genomes …AAAGTCCAGGCAGAATATAATTCTATAAAG GGAACTCCTTCAGAGGCTGAAATCTTT… information to make protein template to make protein …LEVVAATPTSLLISWDAPAVTVRYYTYGETGGNSPVQEFTVPGS… function, phenotype Life
DNA transcription RNA translation Protein Central dogma of biology Genomes …AAAGTCCAGGCAGAATATAATTCTATAAAG GGAACTCCTTCAGAGGCTGAAATCTTT… information to make protein template to make protein …LEVVAATPTSLLISWDAPAVTVRYYTYGETGGNSPVQEFTVPGS… function, phenotype Life Motion critical
Dynamic cleft discovered through MD Cytochrome b5 Storch et al., Biochem, 1995, 1999a,b, 2000
DNA transcription RNA translation Protein Protein folding embedded Genomes Proteinfolding problem D, denatured biologically inactive ? N, native biologically active Life
DNA transcription RNA translation Protein Protein folding embedded Genomes Protein un/folding problem D, denatured biologically inactive ? Process or pathway N, native biologically active Life
Unfolding pathway of CI2 in water [Simulation contains 500,000 structures] 373 K N (94 ns) TS (21 ns) D (30 ns) D (94 ns) • MD unfolding process in good agreement with experiment • TS in quantitative agreement with experiment---prediction • Residual structure in D verified experimentally • Atomic-level characterization of transition, intermediate and denatured state ensembles Daggett and Fersht, TIBS, PNAS, +
Conformational ensembles in folding N TS D Day and Daggett, PNAS, 2005 100 simulations
TS N D Refolding by quenching TS 8 ‘D’ 7 6 5 TS Ca RMSD (Å) 4 3 2 Control, N 1 Brute force MD can refold proteins from the TS Plan: predict TS structures, perform MD simulations and solve protein folding problem But we need info to predict TS (TS easier than D) 0 0.5 1 1.5 2 2.5 3 Time (ns) DeJong et al., JMB 2002
Xtal • 5 ns • 25.6 ns • 200 ns • I57 • A16 • I57 • A16 • L49 • I20 • L49 • I20 • 4.8 Å • 4.0 Å • 8.9 Å • Reversible folding and unfolding • 348K in water, the Tm of the protein • And, refolding = unfolding Detailed pathway reversed • A16/I20 orientation maintained • Day and Daggett, JMB, 2007 • McCully et al., Biochem in press (EnHD)
Reverse central dogma of biology Determine pathways for many proteins, ascertain general features D, denatured biologically inactive ? Process or pathway DNA N, native biologically active RNA Decode genomes Protein
Proteins • Proteins are life’s machines, tools and structures • Many jobs, many shapes, many sizes
Dynameomics Goals: • Perform HT MD simulations of representatives of all folds (41,000 structures in PDB → 1130 fold families) 2. Construct a novel relational/multidimensional database to house these data and facilitate discovery • Native state – information relevant to disease and drug design targets, SNPs • Unfolding – disease and solution to protein folding problem • NERSC • DOE • Unix • The Wall • Windows • Athena @ MS • Beck et al., Prot Eng Des Sel, 2008
700 1.0 600 0.8 500 400 0.6 Population Coverage 300 0.4 200 100 0.2 0 0 50 100 150 200 0 200 400 600 800 1000 Fold Rank Fold Rank Fold space 30 folds represent ~ 50% of known protein structures • Divide protein structures into folds • Consensus of SCOP, CATH and Dali • Rank folds based on population • Choose a representative protein from each fold Day et al., Prot. Sci., 2003
Target selection • Selection criteria • Structure quality • Protein size • Experimental data available • Biomedical relevance • 1st globular then membrane CheY [PDB:3chy] Example: Rank 2, population 424 Amanda Jonsson
Targets with biomedical relevance Amyloid- precursor protein HIV-1 Protease Glutathione S-transferase Alzheimer’s disease HIV Chemotherapy resistance Triosephosphate isomerase MAP30 Serum amyloid P component Amyloidosis HIV and cancer Neurodegeneration
Top 30 folds Represent 50% of all known protein folds Data and metadata for ‘Top 30’ at www.dynameomics.org
Dynameomics protocol • One 298 K native state simulation (21-60 ns, <26 ns>) • At least three 310 K native simulations (some) • At least five 498 K unfolding simulations • Two long simulations (at least 31 ns, <36 ns>) • At least three short simulations (2 ns, <14 ns>) • (5 simulations ~ 100 simulations) Trade-off sampling of different folds and different sequences as opposed to more thorough sampling of individual protein (~400 simulations of PrP)
Validation of Trajectories • Computational checks: energy conservation • Native State: NOEs, S2 order parameters from NMR relaxation experiments, etc. • Unfolding Process: F values, residual structure in denatured state, intermediates David Beck
Native State Simulations: Ubiquitin • NOEs (2727) • MD: 95.2 % • XTAL: 94.4% • Proton Chemical shifts: R=0.98
Comparison with available NMR • The 27 proteins with available data (by PDB code) are: 1aa3, 1c06, 1d1r, 1gle, 1kjs, 2ife, 3gcc, 1bf0, 1cmz, 1cok, 1cz4, 1d1n, 1d8v, 1enh, 1fad, 1fvl, 1fzt, 1ght, 1i11, 1iyu, 11dl, 1mut, 1sso, 1tfb, 1ubq, 1uxc, 3chy. • Proton chemical shifts from MD structures were calculated with SHIFTS (Osapay and Case, 1991). The 15 proteins with data available (by PDB code): 1mjc, 1hcc, 1ubq, 1baz, 1cz4, 1a2p, 1e65, 1ill, 3chy, 1ght, 1cmz, 1gpr, 1byl, 1fzt, 1b10.
Dynameomics status • Dataset includes over 500 proteins and nearly 4000 simulations for a total of >60 s of simulation time, > 65M structures • > 64 TB Not including 637 amyloid simulations
Comprehensive data/metadata In theory, build a warehouse Andrew Simms
Build a data warehouse (not so easy) • The data set is large… (~6 months to load protein coordinates) • Storing protein data only, no solvent data • Only single simulations per table (10M – 90M rows) • 4000 simulations x 10 analyses right now (40K tables) • And we are growing at a rate of ~2000 simulations per year (10K tables) • Approach for scaling... • Multiple servers • Multiple databases per server • 100 targets per database • Andrew Simms • Simms et al., Prot Eng Des Sel, 2008
Though our data set may be large, our requirements are typical in the scientific world Large, complex and often multidimensional data sets Analytical rather than transactional processing Need for performance and storage efficiency Multi-D cubes for complex data analysis On-line analytical processing – OLAP MOLAP – multidimensional OLAP Catherine Kehl
Molecular Dynamics • MD provides atomic resolution of native dynamics 3chy, waters and hydrogens hidden
Molecular Dynamics • MD provides atomic resolution of native dynamics native state simulation of 3chy at 298 K, Asp 57
Native-state dynamics: helix motion a3:a4 a2:a3 a3:a4 Standard Deviation Helix Angle (degrees) CheY at 298 K α5 α4 α2 α2 α4 α3 α3 0 ns 5 ns 10 ns 15 ns 20 ns a2 and a3 dynamic, a4 and a5 stable structural scaffold
CheY – Binding partners Structures of CheY complexes -show binding to α4 and α5 a4:a5 Distances between ends of helices α5 α4 α2 α3 20 ns α2 α4 α5 α3 CheY - CheZ • Functionally important face of protein stable • Asp 57, phosphorylation • Motion in a2 and a3 does not disrupt function, entropy sink? CheY - CheA CheY-FliM Rudesh Toofanny
Catechol O-methyltransferase CheY COMT • Both proteins: Rank 2 Rossman fold • COMT polymorphism: Val108 → Met • 108M - increased risk for diseases such as breast cancer and OCD • Improved memory MD 108M • a6 and a7 mobile in COMT, too • In 108M movement of a6 propagated 16 Å and disrupts the active site 15 ns Rutherford et al., Biochem. 2006 30 ns Importance of characterizing dynamics
Native-like • Intermediate Rutherford et al., BBA, JMB, JMB, Biochem, 2008 SNP-induced changes in COMT a8 a7 a6 108V 108M Mutation to Met leads to loosening of the active site Followed up with CD, NMR, crystallography, fluorescence
SNP leads to broader conformational ensemble at 310 K Starting Structure 25 °C 37 °C 50 °C 108V COMT 108M COMT Ca-RMSD Distribution (Å)
SNP-omics COMT – SNP leads to subtle differences in packing near the mutation site that propagate to the active site Similar behavior now seen in 4 other members of this methyltransferase family (fold rank 2) Effects NOT apparent in static structures Large scale effort to investigate dynamic effects of SNPs starting with 80 proteins ---- dynameomics protocol add multiple 310 K simulations
SLIRP • Structural Library of Intrinsic Residue Propensities (SLIRP) to determine structural propensities for design • GGXGG peptides at in water at 298 K and 498 K and in 8M urea at 298 K (multiple simulations, 100 ns) • Unbiased coil library, main chain and side chain, exhaustive sampling • Dynamic protein side chain rotamer library • Rotamer populations, improved over static from crystal structures • S2axis, waiting times between rotamers
“Random Coil” Peptides: Ala GGAGG Protein-MD Protein-PDB 16% 26% 4% 26% 24% F (°) F (°) F (°) HN, Ha, Hb, NH, Ca, Cb, and C’ for GGAGG are very close to the corresponding experimentally derived values (R = 0.999 over 28 points, 7 atoms x 4 independent simulations).
Chemical shifts for GGXGG: MD and Expt Predictions calculated with ShiftX v1.0 (Neal et al., 2003, J Biomol NMR) Experimental data taken from Schwarzinger et al., J. Biomol NMR, 2000
“Random Coil” Peptides vs. Protein: Ala GGAGG Protein-MD Protein-PDB F (°) F (°) F (°) Ala in protein MD distributions (188 proteins) similar to PDB Ala in GGXGG different GGAGG vs experimental helix propensities, R = 0.28 Protein MD vs helix propensities, R = 0.92 Host-guest studies reflecting the host more than the guest
Mining the database • SLIRP to determine structural propensities for design • Dynamic area conserved in members of protein family. In one case critical for biological function and in another mutation at the region leads to disease • Inflexible region across 188 proteins, identified novel structural elements associated with loop structure (antifreeze) Rudesh Toofanny Noah Benson
Unfolding N TS D Refolding ? ? TS N D Solving the protein folding problem? • Data mining of the Dynameomics database for information to predict TS structures • Bootstrapping to native state prediction by refolding from predicted transition state structures Dustin Schaeffer
Contact analysis • Determined contact probabilities by amino acid and separation between the amino acids from mining of Dynameomics DB Contacts i → i+x Leu Leu Residue Type 2 Residue separation Leu-Leu i → i+3 i→i+2 i i→i+3 Residue Type 1 i→i+1
Coordinates from contacts Most Probable contacts Protein structure DG A set of distances for a particular sequence can be converted into coordinates by singular value decomposition (SVD) of a distance matrix ― distance geometry
TS predictions for Fyn SH3 Prediction from mined data via distance geometry (too compact) RMSD = 3.8 0.37Å MD-generated TS ensemble
DB Info + DG We have TSs for 80% of known protein structures We have refolded from TS MD Solving the folding problem with MD High-throughput structure prediction should be possible by refolding from transition states Sequence TS Structure N Structure
Dynameomics Conclusions • Native state simulations to probe protein function, for drug design, SNP-omics • Unfolding simulations for structure prediction, protein design/redesign, unfolding diseases • SLIRP---Structural Library of Intrinsic Residue Propensities: intrinsic mainchain conformations, dynamic side chain rotamer library, coil library • Dynameomics.org • Noah Benson