1 / 42

Predicting Function (& location & post- tln modifications ) from Protein Sequences

Predicting Function (& location & post- tln modifications ) from Protein Sequences. June 24, 2014. Outline. Usefulness of protein domain analysis Types of protein domain databases Interpro scan of multiple domain DB Using the SMART database Predicting post-translational modifications.

krista
Download Presentation

Predicting Function (& location & post- tln modifications ) from Protein Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Function (& location & post-tln modifications) from Protein Sequences June 24, 2014

  2. Outline • Usefulness of protein domain analysis • Types of protein domain databases • Interpro scan of multiple domain DB • Using the SMART database • Predicting post-translational modifications

  3. When annotation is NOT enough • You’ve got a list of genes, most of which have been annotated with gene ontology and a potential protein function • Why would you want to go on and look more specifically at the protein domains?

  4. Limitations of annotation • Even in a model organism with large amount of resources, most genes are still annotated by similarity • Often, the name given is based on the BEST match to a particular domain or known protein • But…

  5. Limitations of BLAST • Likelihood of finding a homolog to a sequence: • >80% bacteria • >70% yeast • ~60% animal • Rest are truly novel sequences • ~900/6500 proteins in yeast without a known function • NAME: Similar to yeast protein YAL7400 not very informative

  6. Limitations of similarity • Proteins with more than one domain cause problems. • Numerous matches to one domain can mask matches to other domains. • Increased size of protein databases • Number related sequences rises and less related sequence hits may be lost • Low-complexity regions can mask domain matches

  7. Proteins are modular • Individual domains can and often do fold independently of other domains within the same protein • Domains can function as an independent unit (or truncation experiments would never work) • Thus identity of ALL protein domains within a sequence can provide further clues about their function

  8. Protein with>1 domain The name: protein kinase receptor UFO doesn’t necessarily tell you that this protein also contains IgG and fibronectin domains or that it has a transmembrane domain

  9. Domains are not always functional • If a critical residue is missing in an active site, it’s not likely to be functional • A similarity score won’t pick that up

  10. Protein signature databases • Identify domains or classify proteins into families to allow inference of function • Approaches include: • regular expressions and profiles • position-specific scoring matrix-based fingerprints • automated sequence clustering • Hidden Markov Models (HMMs)

  11. PROSITE • Regular expression patterns describing functional motifs M-x-G-x(3)-[IV]2-x(2)-{FWY} • Enzyme catalytic sites • Prosthetic group attachment sites • Ligand or metal binding sites • Either matches or not • Some families/domains defined by co-occurrence

  12. Citrate synthase G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R

  13. PRINTS • Similar to PROSITE patterns • Multiple-motif approach using either identity or weight-matrix as basis • Groups of conserved motif provide diagnostic protein family signatures • Can be created at super-family, family and sub-family level http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

  14. Profile-HMMs • Models generated from alignments of many homologues then counting frequency of occurrence for each amino acid in each column of the alignment (profile). • Profile-HMMs used to create probabilities of occurrence against background evolutionary model that accounts for possible substitutions. • Provides convenient and powerful way of identifying homology between sequences. • Find domains in sequences that would never be found by BLAST alone

  15. HMM domain databases • Pfam • Classify novel sequences into protein domain profiles • Most comprehensive; >13,000 protein families (v26) • SMART • Signaling, extracellular and chromatin proteins • Identification of catalytic site conservation for enzymes • TIGRFAMs • Families of proteins from prokaryotes • PANTHER • Classification based on function using literature evidence

  16. PFAM • >14,800 manually curated profiles • Can use the profile to search a genome for matches

  17. Can submit a protein to PFAM • Limited to single protein submission • Output gives you an e-value that estimates the likelihood that the domain is there • Up to you to determine if domain is functional

  18. Keyword search

  19. PFAM Summary

  20. PFAM Domain Organization

  21. PFAM Interactions

  22. SMART database • SMART: Simple Modular Architecture Research Tool • Focus on signaling, extracellular and chromatin-associated proteins • Has only 1150 domains • Use? • I have several kinase domains in my protein list and want to know which ones are functional. • What other domains are found in signaling proteins?

  23. Search for matches Uniprot or Ensemble Protein Accession number Protein sequence Add other searches

  24. SMART Output Mouse over for information Prediction of FUNCTIONAL catalytic activity

  25. Can browse the domains

  26. InterPro Scan • Combines search methods from several protein databases • Uses tools provided by member databases • Uses threshold scores for profiles & motifs • Interpro convenient means of deriving a consensus among signature methods

  27. InterProScan – source DBs

  28. Advantage of InterProScan • Interpro integrates the different databases to create a protein family signature. • Pfam/SMART/PANTHER/Gene3D & TIGR-FAM will find domain families • PROSITE can find very specific signature patterns • PRINTS can distinguish related members of same protein family Cannot change the statistical cut-off for what is considered a significant match

  29. InterProScan Families

  30. Links from InterPro family

  31. Are 2 proteins homologs? • S. cerevisiae Ste3 is a GPCR pheromone receptor • Similarity to C. neoformans protein: • 26% identical, 45% similar, E-value 10-22

  32. Result of InterProScan

  33. F35E12.5 from Ex5 against NCBI CDD: F35E12.5 from Ex5 search with InterProScan: Different results may reflect different E-value cut-offs for significance

  34. Prediction algorithms need negative controls too.

  35. Function from sequence • Membrane bound or secreted? • GPI anchored? • Cellular localization? • Post-translational modification sites?

  36. CBS prediction services • Protein sorting • SignalP, TargetP, others • Post-translational modification • Acetylation, phosphorylation, glycosylation • Immunological features • Epitopes, MHC allele binding, ect • Protein function & structure • Transmembrane domains, co-evolving positions

  37. Transmembrane domain prediction

  38. Phosphorylation prediction

  39. O-glycosylation

  40. EMBOSS Open source software for molecular biology • Predict antigenic sites • Useful if want to design a peptide antibody • Look for specific motifs, even degenerate • Known phosphorylation motifs • Find motifs in multiple sequences with one submission • Get stats on proteins/nucleic acid sequences • Sequence manipulation of all kinds

  41. Today in lab • Tutorial on protein information sites • From a sublist generated in Ex6 using DAVID, generate a list of protein IDs and obtain the sequences • Obtain protein accession numbers for the cluster • Submit to SMART database to characterize/analyze the domains • Pick 2 proteins to do additional predictions with InterProScan & develop “virtual” biochemical tools

More Related