1 / 14

CyberBridges Protein Pattern Discovery

CyberBridges Protein Pattern Discovery. Tom Milledge Giri Narasimhan Bioinformatics Research Group (BioRG) School of Computing and Information Sciences, FIU. Protein Pattern Discovery: Introduction. Goals:

filia
Download Presentation

CyberBridges Protein Pattern Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CyberBridges Protein Pattern Discovery Tom Milledge Giri Narasimhan Bioinformatics Research Group (BioRG) School of Computing and Information Sciences, FIU

  2. Protein Pattern Discovery: Introduction Goals: • Implement unsupervised pattern discovery tools for protein structure data by using the geometric hashing technique • Create database of protein structure patterns • Create multiple 3-D structural alignments • Identify functional regions in proteins.

  3. Gene 2 Gene 3 Gene 1 DNA Molecular Biology Primer Proteins: Hemoglobin, Immunoglobin, Keratin, Melanin, Insulin, etc. RNA Protein

  4. Where does protein structure information come from? PDB (protein data bank): a repository of 3-D protein structures

  5. Representing substructures as triangles Largest common substructure (many linked triangles) in query and target proteins One triangle (3 atoms) Length1 Length2 Length3 ID1 ID2 ID3 9.5 7.05 7.01 217 231 238

  6. Basic steps for triangle-based geometric hashing • Preprocessing phase • Extract triangle information from target (model) proteins and store them in a hash table • Searching phase • For any given query protein, find the matching triangles in the hash table • Extension phase • Find the largest matching substructures

  7. Preprocessing phase: Create the hash table Read PDB data 7.01 7.06 9.49 Hash key Extract triangles 035035047 Generate a hash key (based on the three lengths and bin-size parameters) and enter record into a hash table

  8. Search phase: finding the matches 1. Decompose Query Protein The Hash Table is split across cluster nodes by protein, with protein attribute information stored in a separate table. This data is accessed via the atom id foreign keys stored in the hash table record. The initial search entails matching the query triangles with the database of (target) triangles. The results are added to a new hash table containing all the target matches. The results table includes the query atom IDs for the substructure building phase. At the begin of the search, the query protein is decomposed into triangles with the attribute information stored in a separate table. The query protein data is then copied to all nodes. 2. Initial Search

  9. Extension phase: building the substructures Every vertex of the tree is a triangle A list of triangle hits Build an adjacency structure Use graph searching algorithm, find larger substructures Measure structural similarity (RMSD*) between every substructure in query protein with every substructure in model protein Output common substructure pairs *RMSD: root mean square distance

  10. Case study: Dehydrogenase superfamily 1B3R Hydrolase (Rat) 1CJC Reductase (Cow) 1CF2 Dehydrogenase (Bacteria)

  11. 1B3R 1CF2 1CJC Dehydrogenases: Shared structural element Reoccurring substructure

  12. Dehydrogenases: building the common substructure Other overlapping triangle matches are extended from initial triangle to find largest common substructure Triangle from query protein (green) matches triangle from target protein (pink) RMSD (Root Mean Square distance) less than 1.0 Angstrom indicates a good match RMSD is measured at each extension step to ensure validity of the larger match RMSD: 0.32 Angstroms RMSD: 0.66 Angstroms

  13. Results: Zinc finger protein family DNA-binding substructure Zinc-binding substructure 10 positions RMSD: 0.46 angstroms 4 positions RMSD: 0.35 angstroms

  14. Conclusions and Future Work Geometric hashing of proteins shows promise as an important technique with a very good fit to many parallel architectures. Areas of future work include: • Molecular Docking: Identify potential drugs that are least likely to cause side-effects. • Function prediction: Create a database of conserved substructures that indicate a specific protein function. • Structure prediction: Use sequence patterns with a structural templates to predict structure of new sequences.

More Related