1 / 43

SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Two methods to predict domain boundary sequence positions from sequence information alone. SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST. An example of two different bioinformatics approaches to the same problem.

tierra
Download Presentation

SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Two methods to predict domain boundary sequence positions from sequence information alone SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST An example of two different bioinformatics approaches to the same problem

  2. Combining protein secondary and tertiary structure prediction to predict structural domains in sequence data SnapDRAGON • Richard A. George • Jaap Heringa • George, R.A. & Heringa, J. (2002) J.Mol.Biol. 316,839-851 • George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.

  3. Protein structure evolution Insertion/deletion of secondary structural elements can ‘easily’ be done at loop sites

  4. Flavodoxin family - TOPS diagrams (Flores et al., 1994) 4 3 2 5 4 3 1 2 5 1

  5. Protein structure evolution Insertion/deletion of structural domains can ‘easily’ be done at loop sites N C

  6. A domain is a: • Compact, semi-independent unit (Richardson, 1981). • Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). • Recurring functional and evolutionary module (Bork, 1992). • “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

  7. The DEATH Domain • Present in a variety of Eukaryotic proteins involved with cell death. • Six helices enclose a tightly packed hydrophobic core. • Some DEATH domains form homotypic and heterotypic dimers. http://www.mshri.on.ca/pawson

  8. Delineating domains is essential for: • Obtaining high resolution structures (x-ray, NMR) • Sequence analysis • Multiple sequence alignment methods • Prediction algorithms (SS, Class, secondary/tertiary structure) • Fold recognition and threading • Elucidating the evolution, structure and function of a protein family (e.g. ‘Rosetta Stone’ method) • Structural/functional genomics • Cross genome comparative analysis

  9. Structural domain organisation can be nasty… Pyruvate kinase Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain 1 continuous + 2 discontinuous domains

  10. SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels

  11. SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels

  12. SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels

  13. SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels

  14. Domain prediction using DRAGON Distance Regularisation Algorithm for Geometry OptimisatioN (Aszodi & Taylor, 1994) • Folds proteins based on the requirement that (conserved) hydrophobic residues cluster together. • First constructs a random high dimensional Ca distance matrix. • Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

  15. The DRAGON target matrix is inferred from: • A multiple sequence alignment of a protein (old) • Conserved hydrophobicity • Secondary structure information (SnapDRAGON) • predicted by PREDATOR (Frishman & Argos, 1996). • strands are entered as distance constraints from the N-terminal Ca to the C-terminal Ca.

  16. Multiple alignment C distance matrix Target matrix Predicted secondary structure N N 3 N N 100 randomised initial matrices 100 predictions CCHHHCCEEE Input data N • The C distance matrix is divided into smaller clusters. • Seperately, each cluster is embedded into a local centroid. • The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.

  17. SnapDragon Generated folds by Dragon Multiple alignment Boundary recognition Predicted secondary structure Summed and Smoothed Boundaries CCHHHCCEEE

  18. SnapDRAGON Domains in structures assigned using method by Taylor (1997) 1 2 3 Domain boundary positions of each model against sequence Summed and Smoothed Boundaries (Biased window protocol)

  19. Prediction assessment • Test set of 414 multiple alignments;183 single and 231 multiple domain proteins. • Sequence searches using PSI-BLAST (Altschul et al., 1997) followed by redundancy filtering using OBSTRUCT (Heringa et al.,1992) and alignment by PRALINE (Heringa, 1999) • Boundary predictions are compared to the region of the protein connecting two domains (min 10 residues)

  20. Average prediction results per protein Coverage is the % linkers predicted (TP/TP+FN) Success is the % of correct predictions made (TP/TP+FP)

  21. SnapDRAGON • Is very slow (can be hours for proteins>400 aa) – cluster computing implementation • Uses consistency in the absence of standard of truth • Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences • SnapDRAGON webserver is underway

  22. Integrating protein sequence database searching and domain recognition DOMAINATION Richard A. George Protein domain identification and improved sequence searching using PSI-BLAST (George & Heringa, Prot. Struct. Func. Genet., in press; 2002)

  23. Domaination • Current iterative homology search methods do not take into account that: • Domains may have different ‘rates of evolution’. • Common conserved domains, such as the tyrosine kinase domain, can obscure weak but relevant matches to other domain types • Premature convergence (false negatives) • Matrix migration / Profile wander (false positives).

  24. PSI-BLAST • Query sequence is first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition (e.g. TM regions or coiled coils) likely to lead to spurious hits, which are excluded from alignment. • Initially operates on a single query sequence by performing a gapped BLAST search • Then takes significant local alignments found, constructs a ‘multiple alignment’ and abstracts a position specific scoring matrix (PSSM) from this alignment. • Rescans the database in a subsequent round to find more homologous sequences -- Iteration continues until user decides to stop or search converges

  25. PSI-BLAST iteration Query sequence Q xxxxxxxxxxxxxxxxx Gapped BLAST search Query sequence Q xxxxxxxxxxxxxxxxx Database hits A C D . . Y PSSM Pi Px Gapped BLAST search A C D . . Y PSSM Pi Px Database hits

  26. DOMAINATION Chop and Join Domains

  27. Post-processing low complexity Remove local fragments with > 15% LC

  28. Identifying domain boundaries Sum N- and C-termini of gapped local alignments True N- and C- termini are counted twice (within 10 residues) Boundaries are smoothed using two windows (15 residues long) Combine scores using biased protocol: if Ni x Ci = 0 then Si = Ni+Ci else Si = Ni+Ci +(NixCi)/(Ni+Ci)

  29. Identifying domain deletions • Deletions in the query (or insertion in the DB sequences) are identified by • two adjacent segments in the query align to the same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini) DB Query

  30. Identifying domain permutations • A domain shuffling event is declared • when two local alignments (>35 residues) within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order. b a DB Query a b

  31. Identifying continuous and discontinuous domains • Each segment is assigned an independence score (In). • If In>10% the segment is assigned as a continuous domain. • An association score is calculated between non-adjacent • fragments by assessing the shared sequence hits to the • segments. If score > 50% then segments are considered as • discontinuous domains and joined.

  32. Create domain profiles • A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq). • A multiple sequence alignment is generated using PRALINE (Heringa 1999). • Each domain multiple alignment is used as a profile in further database searches using PSI-BLAST (Altschul et al 1997). • The whole process is iterated until no new domains are identified.

  33. Domain boundary prediction accuracy • Set of 452 multidomain proteins • 56% of proteins were correctly predicted to have more than one domain • 42% of predictions are within 20 residues of a true boundary • 49.9% (44.6%) correct boundary predictions per protein

  34. 23.3% of all linkers found in 452 multidomain proteins. Not a surprise since: • Structural domain boundaries will not always coincide with sequence domain boundaries • Proteins must have some domain shuffling • For discontinuous proteins 34.2% of linkers were identified • 30% of discontinuous domains were successfully joined

  35. Change in domain prediction accuracy using various PSI-BLAST E-value cut-offs

  36. Benchmarking versus PSI-BLAST • A set 452 non-homologous multidomain protein structures. • Each protein was delineated into its structural domains. Database searches of the individual domains were used as a standard of truth. • We then tested to what extent PSI-BLAST and DOMAINATION, when run on the full-length protein sequences, can capture the sequences found by the reference PSI-BLAST searches using the individual domains.

  37. Two sets based on individual domain searches: • Reference set 1: consists of database sequences for which PSI-BLAST finds all domains contained in the corresponding full length query. • Reference set 2: consists of database sequences found by searching with one or more of the domain sequences • Therefore set 2 contains many more sequences than set 1 Ref set 1 Ref set 2 Query DB seqs

  38. Sequences found over Reference sets 1 and 2

  39. Reference 1 • PSI-BLAST finds 97.9% of sequences • Domaination finds 99.1% of sequences Reference 2 • PSI-BLAST finds 83.2% of sequences • Domaination finds 90.6% of sequences

  40. Sequences found over Reference sets 1 and 2 from 15 Smart sequences

  41. SSEARCH significance test • Verify the statistical significance of database sequences found by relating them to the original query sequence. • SSEARCH (Pearson & Lipman 1988). Calculates an E-value for each generated local alignment. • This filter will lose distant homologies. • Use the 452 proteins with known structure.

  42. Significant sequences found in database searches At an E-value cut-off of 0.1 the performance of DOMAINATION searches with the full-length proteins is 15% better than PSI-BLAST

  43. Summary • Domains are recurring evolutionary units: by collecting the N- and C- termini of local alignments we can identify domain boundaries. • By finding domains we can significantly improve database search methods • SnapDRAGON is more sensitive than DOMAINATION but at high computational cost

More Related