1 / 16

Similarity Search in Protein Databases

Similarity Search in Protein Databases. David Hoksza , hoksza@ksi.mff.cuni.cz Supervisor: Tom áš Skopal, skopal @ ksi.mff.cuni.cz KSI MFF UK. Thesis Contributions. Protein sequence similarity metric indexing approach sequential approach

justus
Download Presentation

Similarity Search in Protein Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Search in Protein Databases David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK

  2. Thesis Contributions • Protein sequence similarity • metric indexing approach • sequential approach • speedup within existing similarity model • Protein structure similarity • nonalignment-based similarity approach • alignment-based similarity approach • similarity modelling itself

  3. Protein structure & Motivation • Transportation, building, signalling, catabolism, ... • Molecule consisting of 20 types of amino acids (AA) • Central dogma of molecular biology • DNA → RNA → protein • Proteins’ 3D interactions secure biological function • protein structure similarity → biological function similarity • protein sequence similarity → protein structure similarity • Source: http://en.wikipedia.org/wiki/File:Main_protein_structure_levels_en.svg

  4. Protein Sequence Similarity • Edit distance • minimal number of editing operations (insert/update/delete) needed for converting one sequence to the other • alignment • Weighted edit distance • scoring matrix (PAM, BLOSSUM, …) • amino acids’ mutation probability • affine gap penalty system • global alignment • dynamic programming - Needleman-Wunsch (NW) algorithm • local alignment • dynamic programming - Smith-Waterman (SW) algorithm • the locality leads to nonmetric measures

  5. Metric Indexing Approach • Using metric access methods (MAMs) • organize objects into regions for effective filtering • require metric distance function • Distance function • E-value with SW • standard in biological databases • statistical relevance • dealing with • reflexivity, symmetry, triangular inequality (Trigen) D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007

  6. Metric Indexing Approach • Swissprot database • 3,000 DB sequences • 100 query sequences D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007

  7. Sequence-Based Approach • Speeding up distance computation itself • Skipping common parts in the DP matrix • context dependency problem • prefixes • database sort • skipping common prefixes • prefix ratio (PR) = speedup • suffixes • database division D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS

  8. Sequence-Based Approach • Uniprot database D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS

  9. Protein Structure Similarity • Nonalignment-based approach • indexing • feature extraction • simple or ad-hoc similarity measure • Alignment-based approach • sequential scan • feature extraction • alignment • superposition (3D transformation) • similarity measure • RMSD, TM-score • Quality criterion • classification accuracy (SCOP hierarchical classification)

  10. Density-Based Feature Extraction • Features • n-dimensional vectors of real numbers • AA ≈ viewpoint → VPT (viewpoint tag) • sDens • density of AAs in rings of predefined width • sRad • widths of rings containing predefined percentage ofAAs D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE

  11. Nonalignment-Based Approach • One-step search • database creation • AAs → feature vectors • indexing using weighted L2 metric and MAM • querying • AAs → feature vectors • feature vector → query object • results’ merging • SCOP classification • Two-step search • 2 one-step searches • results’ comparison • rescoring using Smith-Waterman • SCOP classification D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE

  12. Alignment-Based Approach D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using Iterative TM-score, BIBMW 2009, IEEE D. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010

  13. Alignment-Based Approach • Feature extraction • AAs → feature vectors • density-based feature extraction • Alignment (amino acid matching) • Smith-Waterman alignment • distance between feature vectors → scoring matrix • modified variable gap penalty system • Superposition + scoring • RMSD • TM-score • reducing number of initial states • iterative dynamic programming with belt-based restriction D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using Iterative TM-score, BIBMW 2009, IEEE D. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010

  14. Alignment-Based Approach - Indexing • Indexability measures • intrinsic dimensionality • objects’ distribution • ball overlap factor (BOF) • ball regions’ separation • T-error • nontriangular triplets • Semimetrization • reranking • Metrization • Trigen J. Galgonek, D. Hoksza. On the Effectiveness of Distances Measuring Protein Structure Similarity, SISAP 2009, IEEE

  15. Conclusion • Protein sequence search • indexing • data domain exploration • speeding distance computation • 20% speedup • expected speedup growth • Protein structure search • nonalignment-based • best accuracy on all SCOP levels • alignment-based • best accuracy on superfamily and fold SCOP levels • indexing possibilities • comparison with nonalignment-based

  16. Publications D. Hoksza, T. Skopal Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007 D. Hoksza Improved Alignment of Protein Sequences Based on Common Parts ISBRA 2008, LNCS D. Hoksza DDPIn: Distance and Density-Based Protein Indexing CIBCB 2009, IEEE D. Hoksza , J. Galgonek Density-Based Classification of Protein Structures Using Iterative TM-score BIBMW 2009, IEEE J. Galgonek, D. Hoksza On the Effectiveness of Distances Measuring Protein Structure Similarity SISAP 2009, IEEE D. Hoksza , J. Galgonek Alignment-Based Extension to DDPIn Feature Extraction IJCB, ACTA Press, 2010

More Related