CyberBridges Protein Pattern Discovery

CyberBridges Protein Pattern Discovery Tom Milledge Giri Narasimhan Bioinformatics Research Group (BioRG) School of Computing and Information Sciences, FIU

Protein Pattern Discovery: Introduction Goals: • Implement unsupervised pattern discovery tools for protein structure data by using the geometric hashing technique • Create database of protein structure patterns • Create multiple 3-D structural alignments • Identify functional regions in proteins.

Gene 2 Gene 3 Gene 1 DNA Molecular Biology Primer Proteins: Hemoglobin, Immunoglobin, Keratin, Melanin, Insulin, etc. RNA Protein

Where does protein structure information come from? PDB (protein data bank): a repository of 3-D protein structures

Representing substructures as triangles Largest common substructure (many linked triangles) in query and target proteins One triangle (3 atoms) Length1 Length2 Length3 ID1 ID2 ID3 9.5 7.05 7.01 217 231 238

Basic steps for triangle-based geometric hashing • Preprocessing phase • Extract triangle information from target (model) proteins and store them in a hash table • Searching phase • For any given query protein, find the matching triangles in the hash table • Extension phase • Find the largest matching substructures

Preprocessing phase: Create the hash table Read PDB data 7.01 7.06 9.49 Hash key Extract triangles 035035047 Generate a hash key (based on the three lengths and bin-size parameters) and enter record into a hash table

Search phase: finding the matches 1. Decompose Query Protein The Hash Table is split across cluster nodes by protein, with protein attribute information stored in a separate table. This data is accessed via the atom id foreign keys stored in the hash table record. The initial search entails matching the query triangles with the database of (target) triangles. The results are added to a new hash table containing all the target matches. The results table includes the query atom IDs for the substructure building phase. At the begin of the search, the query protein is decomposed into triangles with the attribute information stored in a separate table. The query protein data is then copied to all nodes. 2. Initial Search

Extension phase: building the substructures Every vertex of the tree is a triangle A list of triangle hits Build an adjacency structure Use graph searching algorithm, find larger substructures Measure structural similarity (RMSD*) between every substructure in query protein with every substructure in model protein Output common substructure pairs *RMSD: root mean square distance

Case study: Dehydrogenase superfamily 1B3R Hydrolase (Rat) 1CJC Reductase (Cow) 1CF2 Dehydrogenase (Bacteria)

1B3R 1CF2 1CJC Dehydrogenases: Shared structural element Reoccurring substructure

Dehydrogenases: building the common substructure Other overlapping triangle matches are extended from initial triangle to find largest common substructure Triangle from query protein (green) matches triangle from target protein (pink) RMSD (Root Mean Square distance) less than 1.0 Angstrom indicates a good match RMSD is measured at each extension step to ensure validity of the larger match RMSD: 0.32 Angstroms RMSD: 0.66 Angstroms

Results: Zinc finger protein family DNA-binding substructure Zinc-binding substructure 10 positions RMSD: 0.46 angstroms 4 positions RMSD: 0.35 angstroms

Conclusions and Future Work Geometric hashing of proteins shows promise as an important technique with a very good fit to many parallel architectures. Areas of future work include: • Molecular Docking: Identify potential drugs that are least likely to cause side-effects. • Function prediction: Create a database of conserved substructures that indicate a specific protein function. • Structure prediction: Use sequence patterns with a structural templates to predict structure of new sequences.

CyberBridges Protein Pattern Discovery