1 / 23

Ioerger Lab – Bioinformatics Research

Ioerger Lab – Bioinformatics Research. Pattern recognition/machine learning issues of representation effect of feature extraction, weighting, and interaction on performance of induction algorithm Applications in Structural Biology molecular basis of biology: protein structures

rpark
Download Presentation

Ioerger Lab – Bioinformatics Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ioerger Lab – Bioinformatics Research • Pattern recognition/machine learning • issues of representation • effect of feature extraction, weighting, and interaction on performance of induction algorithm • Applications in Structural Biology • molecular basis of biology: protein structures • predicting structures • tools for solving structures (X-ray crystallography, NMR) • stability, folding, packing, motions • drug design (small-molecule inhibitors) • large datasets exist – exploit them – find the patterns

  2. TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition Principal Investigators: Thomas Ioerger (Dept. Computer Science) James Sacchettini (Dept. Biochem/Biophys) Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee, Lalji Kanbi, Reetal Pai & Jacob Smith Funding: National Institutes of Health Texas A&M University

  3. X-ray crystallography • Most widely used method for protein modeling • Steps: • Grow crystal • Collect diffraction data • Generate electron density map (Fourier transform) • Interpret map i.e. infer atomic coordinates • Refine structure • Model-building • Currently: crystallographers • Challenges: noise, resolution • Goal: automation

  4. X-ray crystallography • Most widely used method for protein modeling • Steps: • Grow crystal • Collect diffraction data • Generate electron density map (Fourier transform) • Interpret map i.e. infer atomic coordinates • Refine structure • Model-building • Currently: crystallographers • Challenges: noise, resolution • Goal: automation

  5. Overview of TEXTAL • Automated model-building program • Can we automate the kind of visual processing of patterns that crystallographers use? • Intelligent methods to interpret density, despite noise • Exploit knowledge about typical protein structure • Focus on medium-resolution maps • optimized for 2.8A (actually, 2.6-3.2A is fine) • typical for MAD data (useful for high-throughput) • other programs exist for higher-res data (ARP/wARP) Electron density map (or structure factors) Protein model (may need refinement) TEXTAL

  6. Crystal Collect data Electron density map Diffraction data LOOKUP: model side chains CAPRA: models backbone SCALE MAP TRACE MAP CALCULATE FEATURES PREDICT Cα’s BUILD CHAINS PATCH & STITCH CHAINS REFINE CHAINS Model of backbone Model of backbone & side chains POST-PROCESSING SEQUENCE ALIGNMENT REAL SPACE REFINEMENT Corrected & refined model

  7. F=<1.72,-0.39,1.04,1.55...> F=<1.58,0.18,1.09,-0.25...> F=<0.90,0.65,-1.40,0.87...> F=<1.79,-0.43,0.88,1.52...>

  8. Examples of Numeric Density Features • Distance from center-of-sphere to center-of-mass • Moments of inertia - relative dispersion along orthogonal axes • Geometric features like “Spoke angles” • Local variance and other statistics Features are designed to be rotation-invariant, i.e. same values for region in any orientation/frame-of-reference. TEXTAL uses 19 distinct numeric features to represent the pattern of density in a region, each calculated over 4 different radii, for a total of 76 features.

  9. The LOOKUP Process Find optimal rotation Database of known maps Two-step filter: 1) by features 2) by density correlation “2-norm”: weighted Euclidean distance metric for retrieving matches: Region in map to be interpreted

  10. SLIDER: Feature-weighting algorithm • Euclidean distance metric used for retrieval: • relevant features – good, irrelevant features – bad • Goal: find optimal weight vector w the generates highest probability of hits (matches) in top K candidates from database • Concept of Slider: • adjust features so the most matches are ranked higher than mismatches Slider Algorithm(w,F,{Ri},matches,mismatches) choose feature fF at random for each <Ri,Rj,Rk>, Rjmatches(Ri),Rkmismatches(Ri) compute cross-over point li where: dist’(Ri,Rj)=dist’(Ri,Rk) dist’(X,Y)= l(Xf-Yf)2+(1-l)dist\f(X,Y) pick l that is best compromise among li ranks most matches above mismatches update weight vector: w’update(w,f,l), wf’=l repeat until convergence

  11. Quality of TEXTAL models • Typically builds >80% of the protein atoms • Accuracy of coordinates: ~1Å error (RMSD) • Depends on resolution and quality of map

  12. Closeup of b-strand (TEXTAL model in green)

  13. Deployment • September 2004: Linux and OSX distributions • Can be downloaded from http://textal.tamu.edu • 40 trial licenses granted so far • June 2002: WebTex (http://textal.tamu.edu) • Till May 2005: TB Structural Genomics Consortium members only • Recently open to the public • users upload data; processed on server; can download results • 120 users from 70 institutions in 20 countries • July 2003: Model building component of PHENIX • Python-based Hierarchical ENvironment for Integrated Xtallography • Consortium members: • Lawrence Berkeley National Lab • University of Cambridge • Los Alamos National Lab • Texas A&M University

  14. Intelligent Methods for Drug Design • structure-based: • given protein structure, predict ligands that might bind active site • other methods: • QSAR, high-throughput/combi-chem, manual design using 3D • Virtual Screening • docking algorithm + large library of chemical structures • sort compounds by interaction energy • purchase top-ranked hits and assay in lab • looking for mM inhibitors (leads that can be refined) • goal: enrichment to ~5% hit rate

  15. Virtual Screening • diversity • ZINC database: ~2.6 million compounds • purchasable; satisfy Lipinski’s rules • docking algorithms: • FlexX, DOCK, GOLD, AutoDock, ICM... • search for position and conformation of ligand • scoring function • electrostatic + steric + desolvation • entropy effects? • major open issues: • active site flexibility, charge state, waters, co-factors • works best with co-crystal structures (already bound)

  16. Grid at Texas A&M gridmaster.tamu.edu DOCK binaries + receptor files + 20 ligands at a time West Campus Library typical configuration: 2.8 GHz dual-core Pentium CPUs running Windows XP Blocker Zachary ~1600 computers in student labs on TAMU campus (Open-Access Labs) GridMP software by United Devices (Austin, TX)

  17. Data Mining of Results • promiscuous binders • clusters of related compounds • patterns of contacts within active site • hydrogen-bonding interactions • adjust weights of scoring function for unique properties of each site • open/closed, hydrophobic/charged... • ideas for active site variations • development of pharmacophore search patterns

  18. Current Screens in Sacchettini Lab • proteins related to tuberculosis (Mycobacterium) • focus on unique pathways involved in dormancy/starvation • glyoxylate shunt – slow-growth metabolic pathway • cell-wall biosynthesis (unique mycolic acid layer in tb.) • biosynthesis of amino acids/co-factors that humans get from diet • isocitrate lyase • malate synthase • PcaA: mycolic acid cyclopropane synthase • ACPS: acyl-carrier protein synthase • InhA: enoyl-acyl reductase (target of isoniazid) • KasB: fatty-acid synthase • BioA: biotin (co-factor) synthase • PGDH: phospho-glycerol dehydrogenase (serine biosynthesis) • Related proteins in malaria, SARS, shigella

  19. Conclusions • Many opportunities for research in Structural Bioinformatics • large datasets • significant problems • Provides challenges for machine learning • drives development of novel methods, especially for dealing with noise, sampling biases, extraction of features... • Requires inherently interdisciplinary approach • training in biochemistry; knowledge of molecular interactions • understanding chemical intuition; use of visualization tools • insights about strengths and limitations of existing methods • Requires collaboration to construct appropriate representations to enable learning algorithms to find patterns • translate expectations about what is relevant, dependencies, smoothing, sources of noise...

More Related