200 likes | 449 Views
In silico biology: computational path toward holistic understanding of living cells. Andrey A. Gorin Computer Science and Mathematics Oak Ridge National Laboratory agor@ornl.gov. Dynamics Kinetics. Function. Models. Structure. Comparative Proteomics. Genes. Comparative Genomics.
E N D
In silico biology: computational path toward holistic understanding of living cells Andrey A. Gorin Computer Science and Mathematics Oak Ridge National Laboratory agor@ornl.gov
Dynamics Kinetics Function Models Structure Comparative Proteomics Genes Comparative Genomics Molecular Simulation NetworkSimulation Structure Prediction 1000 TB 0 TB 1 TB 10 TB 100 TB Regulatory Region Coding Region P O GenA GenB GenC Motivation – Predictive Biology Developments in Biological Sciences: • Experimental: From Reduction to Systems Science • Computational: From Validation to Prediction Development in Technologies: • High Throughput Experimentation • High Performance Computing Uniqueness of Biology: • First Principles Approach is Impossible/Impractical • Enormous Multitude of Scales • Descriptive Models from Diverse Data • Large Uncertainties in Data
Scientific Drivers • Bio-remediation • Environmental restoration using microbial processes requires study of: • Microbial attachment to mineral environments • Uptake of contaminant ions by microbes and metal reduction at the microbial membrane • Conversion of toxic chemicals by microbial systems • Bio-energy • Development of integrated experimental an computational approaches for: • Feedstock optimization with the goal of better cellulose deconstruction by special bacterial systems • Understanding of microbial communities in single batch processes (harsh environments) • Enzyme or regulatory circuit design to increase desirable output
Exploring Protein Dimension of Bio Universe Entire proteome is analyzed in a few hours 1 out of 105-1011 must be selected as the correct peptide Mass spectrometry process
0.5 1. 0.8 Chain Confidence Database FPVW1 >KKRRHA…LKAAKHREVFKR FPVW2 >KAH….……….. FPVW3 ………. Random Match Model Memory Indices De novo Platform: Probability Profile Method We made several advancements in the understanding of the “mass spec” mathematics. Taken together they lead to conceptually novel platform. Peak Assignment m1 m2 m3 m4 … [510] VDDLSSLT [305]
Results: Capabilities in Mass Spectrometry Advances in the mathematical understanding and dramatic acceleration of fundamental operations lead to principally new capabilities • Output gains, and especially in highly confident identifications. Our method gives several times more of highly confident identifications • Capability to detect unexpected phenomena in the samples. We have found novel biological phenomena in the legacy data and uncovered mistakes in the data sets regarded as benchmarks. Deamidations (6 spectra) IHPFAQTQSLVYPFPGPIPN IHPFAQTQSLVYPFPGPIPD Incorrect peptides (7 spectra) VIPAADLSQQISTAGTEASGTGNMK->VIPAADLSEQISTAGTEASGTGNMK … Disulphide bond (1 spectra) AAANFFSASCVPCADQSSFPK De novo methods can improve even manually verified benchmark data sets obtained by the existing technology.
Model-dependent Science Questions Many fundamental questions in systems biology are hampered by the lack of reliable predictive models. Network Models Reliant: Structural Models Reliant: • What are biochemical or regulatory functions of the proteins that are shown to be important for hydrogen production? • What are the hot spots of cellulases that could impair their binding? • What are the components of hydrogen producing protein assemblies? • … • What biochemical processes in a microbe are related to its traits (hydrogen or ethanol producer, ethanol resistant)? • How does a bacteria degrade lignin or cellulose? • What are the mechanisms behind the conversion of toxic waste to nontoxic substances by bacteria? • …
NMR X-ray Genomics Neutron Scattering Imaging Structural Models 3-d Structure Protein Docking Protein-RNA Protein-DNA Protein-Ligand Interactomics Genomics Metabolomics Quantitative Proteomics Transcriptomics Network/Pathway Models Metabolic Regulatory SignalingProtein Interaction Predictive Model Building
Protein Complexes in Genomics:GTL GTL is focused on protein interactions that make life work Express Bait Protein Is the interaction real or an artifact? What is the structure of the protein complex? What is its function? What is its dynamic mechanism? Can we answer these questions at scale? Exogenous / Endogenous Pulldown Mass Spectrometry Analysis Putative Interacting Proteins at High Throughput Xray Diffraction
Statistical potentials Modeling Protein Structures and Complexes Combinatorial and optimization techniques are applied for two areas: development of knowledge based potentials and analysis of ultra large structural sets. Discovery of protein complexes ? Geometry and bioinfo libraries Shared memory indices Protein folding Parallel implemenations Ligand binding Graph algorithms
Finding Common Motifs ROSETTA Monte Carlo protein folding Known structures Knowledge-based Energy Tables 100 GB Search, Optimization, Enumeration Energy Optimization 3 GB – 5 TB Finding MaximalCliques Merging & Scoring Decoy structures Cliques of structures Search, Optimization, Enumeration 103-105 (104~50TB) Search 10 GB – 500 TB Native Structures Computational Algorithms in Structure Modeling Multitude of combinatorial optimization problems with different data access patterns. Example: Ab Initio Prediction of Protein 3-d Structure
40 decoys with the theoretical probability near 0.8 have on average 60% of native contacts Up to 1000 threads Results: Protein Docking • Full implementation for Bayesian potential energy functions for protein docking • The size of the docking benchmarking set (~1,200) is unprecedented in the field. High quality results were obtained for over 70% of all tested complexes (native in top 5), indicating that the method is very efficient. • Successfully modeled protein complexes using the structures from other organisms • Docking calculations are scalable to 1000s processor due to surface patch approach for parallelization
Future:Protein Docking • Develop further Protein Interface Server (PINS) database (pins.ornl.gov). • Develop theory and computational implementation for docking potentials taking into account orientation of the contacting residues (Bayesian). Petascale implementations for docking platforms. • Improve mathematical methods for prediction validation. • Analyze and annotate PINS interfaces related to the metabolic functions involving carbon-processing pathways in cellulose-degrading bacteria. Construct predictions for the organisms important for Bioenergy Centers in cases when full or partial genome sequence are available.
NMR X-ray Genomics Neutron Scattering Imaging Structural Models 3-d Structure Protein Docking Protein-RNA Protein-DNA Protein-Ligand Interactomics Genomics Metabolomics Quantitative Proteomics Transcriptomics Network/Pathway Models Metabolic Regulatory SignalingProtein Interaction Predictive Model Building
LDRDD07-014Modeling Cellular Mechanisms for Efficient Bioethanol Production through Petascale Analysis of Biological Networks Andrey Gorin, Nabeela Ahmad, Andrew Bordner, Robert Day, Jessie Gu, Guruprasad Kora, Chongle Pan, Byung-Hoon Park, Nagiza Samatova, Edward Uberbacher, Cray. Inc Oak Ridge National Laboratory, FY2007-FY2008
Future: Graph Algorithms for Networks • Analysis of graphs reflecting physical interactions in structures • Very wide set of problems, but with a uniform set of “translation” rules • Enormously large graphs • Directly connected to modeling petascale applications • Developing graph representations for real cellular subsystems. Nodes: genes, proteins, DNA elements, metabolites. Edges: translation, positive regulation, production of metabolite • Flexible representations Integration of many tools to annotate 5’ regions Promotor alignment Models to predict transcription patterns based on promotor models
Future: Mining Networks for Bioenergy • Switch grass paralogs/homologs has to be identified in collections of >250,000 partial gene data (EST) from rice, maize, sorghum genomes • It is expected that 6000 Populus lines will be passed through cell wall phenotyping pipeline (expression arrays, proteomics data, etc) • Huge data sets are already (e.g. Obauashi et al. (2007) 1,388 expression arrays) (From Bioenergy Center proposal)
Future: Taking It to Petascale • Port to NLCF platforms • Scale Maximal Clique Enumeration from 100s to 1000s processors • Introduce similar implementations for other codes in pGraph and BioGraphE • Optimize performance on NLCF architectures: • Deliver efficient MPI-based multi-core implementations • Minimize thread locking/unlocking overheads • Exploit data locality in job redistribution strategies • Minimize I/O overheads via buffering, data compression, hierarchical parallel I/O.
Potential Areas of Joint Work Mathematics • Statistics of pattern recognition problems • Combinatorial optimization methods beyond dynamic programming • Bayesian system with strong feature correlations Computer science • Graph algorithms working for VERY large graphs 104-105 nodes and across many graphs (103) in parallel • Flexible software systems for representation transcription regulation networks in the bacterial cells (4 genes) • Petascale implementation strategies: MPI implementations, load balancing, etc Biological Applications: • Novel applications in mass-spectrometry (e.g. gene finding in new genomes, special gene expression in stem cells, mass-spec of cross-linked systems) • Expression array analysis and metabolic pathway construction for bacteria involved in bioenergy processing