BioPilot: Data-Intensive Computing for Complex Biological Systems

BioPilot: Data-Intensive Computingfor Complex Biological Systems Pacific Northwest National LaboratoryOak Ridge National Laboratory Sponsored bythe Department of EnergyOffice of Science Program Managers:Drs. Gary Johnson and Anil Dean

Goals for enabling data intensivecomputing in biology Biological research is becoming a high-throughput, data-intensive science, requiring development and evaluation of new methods and algorithms forlarge-scale computational analysis, modeling, and simulation Computational algorithms for proteomics • Improve the efficiency and reliability of protein identification and quantification. • Peptide identification using MS/MS using pre-computed spectral databases • Combinatorial peptide search with mutations, post- translational modifications and cross-linked constructs. Inference, modeling, and simulation of biological networks • Reconstruction of cellular network topologies using high-throughput biological data • Stochastic and differential equation-based simulations of the dynamics of cellular networks. Integration of bioinformatics and biomolecular modeling and simulation • Structure, function, and dynamics of complex biomolecular system in appropriate environments. • Comparative analysis of large-scale molecular simulation trajectories • Event recognition in biomolecular simulations • Multi-level modeling of protein models and their conformational space.

NWCHM Integral API ENERGY QMD Classical Force Field DFT Geometry GRADIENT QM/MM Comparative Trajectory Analysis Tools SCF: RHF UHF ROHF Basis Sets OPTIMIZE ET MP2: RHF UHF PEigS QHOP DYNAMICS MP3: RHF UHF pFFT THERMODYNAMICS MP4: RHF UHF LAPACK INPUT ESP RI-MP2 BLAS PROPERTY VIB CCSD(T): RHF MA PREPARE CASSCF/GVB Global Arrays ANALYZE MCSCF ecce MR-CI-PT ChemIO CI: Columbus Full Selected Comparative molecular trajectoryanalysis with BioSimGrid The Scientific Challenge: • Routine protein simulations generate TB trajectories • Routine analysis tools (ED) require multiple passes • Correlation analyses have random access patterns • Comparative analyses require multiple/distributed trajectory access Integrated molecular simulationand comparative moleculartrajectory analysis: The analysis of molecular dynamicsTrajectories presents a large,data-intensive analysis problem: • Enzymatic reaction mechanisms • Molecular machines • Molecular basis of signal and material transport T.P.Straatsma, Computational Biology and Bioinformatics

Building accurate protein structures • Homology models can be built from related protein structures, but even with proper alignment, usually have about 4A RMS error–too large to permit meaningful computational simulation/molecular dynamics.Goal to reduce the errors in initial models d • We multi-align the group of neighbors using sequence and structure information to find the most stable parts of the domain. Extracted stable core structure is 2030% closer in terms of RMSD to the target than to any of the original templates • More flexible parts of target are modeled locally by choosing most sequentially similar loops from the library of local segments found in all the homologues • A genetic algorithm conformation search strategy using Cray XT3 uses backbone and sidechain angles as parameters.Each GA step is evaluated by minimization A computed template compared to the target, improved from 3.4A initially to 2.3A E. Uberbacher, Biological and Environmental Sciences Division

High-resolution protein docking • High-throughput protein interaction data generation provides many interacting proteins. However, the best docking programs (Glide, GOLD) recognize proper docking site and position between known docking partners a small fraction of the time • We take advantage of a reduced search space for contacting protein surfaces but sample all relative conformations on a grid • Surface normals are anti-aligned with those on the other protein surface in the docking procedure. Complexity of the search is r * N1 * N2 with steric clashes filtered Surface normals for human Ras-related protein Rap-1A [PDB:1C1Y, shown in CPK (orange) and RAF kinase (cyan)] • Bayesian scoring evaluates each configuration based on specific residue pair distances at each interface: Score=sum of Log ratios for interface/decoys contact frequencies over all contacts in the patch • Scoring scheme is effective at separating false configurations from true. It recognizes 1/3 of correct configurations as the top choice and a second third within the top three choices. Blue curve decoys, yellow curve true A. Gorin, Computer Science and Mathematics Division

R R R R NH3-HCa-CO- NH-HCa-CO- NH-HCa-CO- NH-HCa-COOH Predictive proteome analysis using advanced computation and physical models Data sizes: 30,000 to 200,000 spectra from a single experiment to becompared with as many as millions of theoretical spectra Scientific Challenge: • Computational tools identify only 1020%of spectra from proteomics analysis: • Fragmentation patterns are highly sequence-specific • Generic patterns result in low identification rate • Physical models not used in current analyses becauseof computational expense W.R.Cannon, Computational Biology and Bioinformatics

In-depth analysis of MS proteomicsdata flows Mass-spectrometry-based proteomics is one of the richest sources of the biological information, but the data flows are enormous (100,000s of samples each consisting of 100,000s of spectra) Computationally, both advanced graph algorithms and memory-based indexes (> 1 TB are required forin-depth analysis of the spectra) Two principally new capabilities demonstrated • Quantity of identified peptides is 100% increased • Among highly reliable identifications, increase is many fold A. Gorin, Computer Science and Mathematics Division

6 5 4 Queries per (proc*minute) 3 work factor ideal 2 1 0 0 50 100 1000 10000 number of processors 120 80 scaling factor MPP2 ALTIX HTCBLAST 40 0 0 50 100 150 number of processors ScalaBLAST Scientific Challenge: • Standalone BLAST application would take >1 year for IMG (>1.6 million proteins) vs IMG and >3 years for IMG vs nr • “PERL script” approach breaks even modest clusters becauseof poor memory management Good scaling on both shared and distributed memory architectures C.S.Oehmen, Computational Biology and Bioinformatics

1500 1250 1000 average running time (seconds) 750 500 250 0 1 2 3 4 5 6 7 8 9 10 11 12 number of phosphorylation binding site Stochastic simulations of biological networks Scientific Challenge: • Speedups ~40-200 • To model microbial communities, the smallest realistic model would include > 109 cells x 100 reactions ~ 1011 reactions • Currently can handle ~ 104106 reactions on a single-processor machine Probability-weighted dynamic Monte Carlo method • J. Phys. Chem. B 105, 11026, 2001 • Speedup over exact Gillespie SSA ~25 Multinomial Tau-leaping method H. Resat, Computational Biology and Bioinformatics

2006 Feb 2; 439(7076):608 Uncovering mechanismsof transcriptional control Stochastic fluctuations in molecular populations have consequences for gene regulation. Our theoretical and simulation studies predicted that noise frequency content is determined by underlying gene circuits, leading to mapping between gene circuit structure and noise frequency range. Predictions are validated experimentally Model of shift of frequency range distribution shape due to negative feedback. Red bars (e) represent unregulated circuit distribution;blue bars represent distributionfor circuit with negative auto-regulation Dashed box and arrow show shiftof the low-frequency trajectoriesto center of distribution whilehigher-frequency trajectoriesare unaffected N. Samatova, Computer Science and Mathematics Division

Contacts Dr. Anil DeanProgram Manager (301) 903-1465 e-mail: deane@mics.doe.gov Dr. Gary Johnson Acting Program Manager (301) 903-5800 e-mail: garyj@er.doe.gov 11 Presenter_date

BioPilot: Data-Intensive Computing for Complex Biological Systems

BioPilot: Data-Intensive Computing for Complex Biological Systems

Presentation Transcript

High Performance Cluster Computing Architectures and Systems

Database Systems Research on Data Mining

Welcome to Intensive Reading!

Lecture #1: Embedded Computing Systems - An Overview

Data-Intensive Scalable Science: Beyond MapReduce

May 2 , 2013 Judy Qiu xqiu@indiana.edu http :// S A L S A hpc.indiana.edu School of Informatics and Computing Indian

Big Data and Cloud Computing: New Wine or just New Bottles?

Big Data and Cloud Computing: New Wine or just New Bottles?

Cloud Computing and Big Data Processing

CPS216: Data-Intensive Computing Systems Query Execution (Sort and Join operators)

Complex Adaptive Systems of Systems (CASoS) Modeling and Engineering

Grid Computing

Lecture 3 CUDA Programming 1

COMPUTING (Higher)

Biological P athways

High Performance Cluster Computing Architectures and Systems

DISTRIBUTED COMPUTING

Lecture VI: Adaptive Systems

Agent Dynamics in Complex Multilevel Systems of Systems of Systems Jeffrey Johnson

Studies in Big Data 4 Weng-Long Chang Athanasios V. Vasilakos Molecular Computing

Ecology of Populations

ETM 555 Supplementary Lecture Notes Version 5. / 2011 Contents: