1 / 36

Protein families, domains and motifs for functional prediction

Protein families, domains and motifs for functional prediction. June 27, 2019. Outline. Usefulness of protein domain analysis Types of protein domain databases SMART, HMMER and Interpro protein domain database Uniprot protein annotation Predicting post-translational modifications.

Download Presentation

Protein families, domains and motifs for functional prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein families, domains and motifsfor functional prediction June 27, 2019

  2. Outline • Usefulness of protein domain analysis • Types of protein domain databases • SMART, HMMER and Interpro protein domain database • Uniprot protein annotation • Predicting post-translational modifications

  3. Protein families • Groups of homologous sequences (within and across species) that share similar functions and domains • Examples: • Carbonic anhydrases (14 in humans) • Chitin synthases (8 in C. neoformans) • Ser/Thr kinases

  4. Protein domains • Conserved part of protein sequence that can evolve, function and exist independent of the rest of the protein chain • Often independently stable and folded • Can recombine or evolve from gene duplications into proteins with different combinations of domains

  5. Protein motifs • Short linear peptide sequences that serve a specific function for the protein, but will not be stable or fold independent of the rest of chain • Protein-protein interaction, ligand interactions, cleavage sites, targeting • Examples: • 14-3-3: Interaction with kinases • KELCH: ubiquitin targeting • SUMO: site recognized for modification by SUMO • Often found within intrinsically disordered regions

  6. Predicting function for unknown proteins • Do they belong (by sequence homology) to a protein family? • Do they contain known protein domains? • Do they have motifs that suggest a specific function?

  7. Limitations of annotation • Even in a model organism with large amount of resources, most genes are still annotated by similarity • Often, the name given is based on the BEST match to a particular domain or known protein • But…

  8. Limitations of BLAST • Likelihood of finding a homolog to a sequence: • >80% bacteria • >70% yeast • ~60% animal • Rest are truly novel sequences • ~900/6500 proteins in yeast without a known function • NAME: Similar to yeast protein YAL7400 not very informative

  9. Limitations of similarity • Proteins with more than one domain cause problems. • Numerous matches to one domain can mask matches to other domains • Increased size of protein databases • Number related sequences rises and less related sequence hits may be lost • Low-complexity regions can mask domain matches

  10. Proteins are modular • Individual domains can and often do fold independently of other domains within the same protein • Domains can function as an independent unit (or truncation experiments would never work) • Thus identity of ALL protein domains within a sequence can provide further clues about their function

  11. Proteins can have >1 domain The name: protein kinase receptor UFO doesn’t necessarily tell you that this protein also contains IgG and fibronectin domains or that it has a transmembrane domain

  12. Domains are not always functional • If a critical residue is missing in an active site, it’s not likely to be functional • A similarity score won’t pick that up

  13. Protein signature databases • Identify domains or classify proteins into families to allow inference of function • Approaches include: • regular expressions and profiles • position-specific scoring matrix-based fingerprints • automated sequence clustering • Hidden Markov Models (HMMs)

  14. PROSITE • Regular expression patterns describing functional motifs M-x-G-x(3)-[IV]2-x(2)-{FWY} • Enzyme catalytic sites • Prosthetic group attachment sites • Ligand or metal binding sites • Either matches or not • Some families/domains defined by co-occurrence

  15. Citrate synthase G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R

  16. Profile-HMMs • Models generated from alignments of many homologues then counting frequency of occurrence for each amino acid in each column of the alignment (profile). • Profile-HMMs used to create probabilities of occurrence against background evolutionary model that accounts for possible substitutions. • Provides convenient and powerful way of identifying homology between sequences. • Find domains in sequences that would never be found by BLAST alone

  17. HMM domain databases • PFAM • Classify novel sequences into protein domain profiles • Most comprehensive; >16,000 protein families (v31) • Now working with Uniprot • SMART • Signaling, extracellular and chromatin proteins • Identification of catalytic site conservation for enzymes • TIGRFAMs • Families of proteins from prokaryotes • PANTHER • Classification based on function using literature evidence

  18. PFAM • Manually curated profiles • a statistical measure of the likelihood that an alignment occurred by chance alone • Does not indicate functionality

  19. PFAM Summary

  20. PFAM Domain Organization

  21. PFAM “parts” definitions 1) Domain: A collection of related sequence regions that form a distinct structural unit. 2) Family: A collection of related sequence regions that may contain one or more domains, but where there is insufficient evidence to support subdivision. 3) Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present. 4) Motif: A short unit that carries a distinct role, for example in metal binding. 5) Coiled-coil: Regions of a protein that form alpha-helices that align against each other to form a distinctive structure called a coiled-coil. 6) Disordered: Regions on proteins that are inherently disordered but have sequence conservation.

  22. SMART database • SMART: Simple Modular Architecture Research Tool • Focus on signaling, extracellular and chromatin-associated proteins • Curated models for >1300 domains • Use? • I have several kinase domains in my protein list and want to know which ones are functional. • What other signaling proteins are in my list? • What other domains are found in signaling proteins?

  23. SMART: Search interface Uniprot or Ensemble Protein Accession number Add other searches

  24. SMART Output

  25. HMMER • Fast, sensitive protein homology searches using HMMs • Results include taxonomic distribution of matches • PFAM domains • Transmembrane, coiled-coil, signal peptide and intrinsically disordered regions • Typically faster* than BLAST or InterProScan * PHMMER search speeds vary depending on usage

  26. ADCYC3 PHMMER search

  27. InterPro Scan • Combines search methods from several protein databases • Uses tools provided by member databases • Uses threshold scores for profiles & motifs • InterPro convenient means of deriving a consensus among signature methods • InterPro records integrated with Uniprot. • Slow to return search results • Can bookmark it and come back to it later

  28. ADCYC3 Interpro match Different Domain databases: PF -> PFAM PS -> PROSITE PR -> PRINTS SM -> SMART Group different domains into related signature

  29. Adenylyl cyclase class-3/4guanylyl cyclase

  30. Function from sequence • Membrane bound or secreted? • GPI anchored? • Cellular localization? • Post-translational modification sites?

  31. CBS prediction services • Protein sorting • SignalP, TargetP, others • Post-translational modification • Acetylation, phosphorylation, glycosylation • Immunological features • Epitopes, MHC allele binding, ect • Protein function & structure • Transmembrane domains, co-evolving positions

  32. Transmembrane domain prediction

  33. Phosphorylation prediction

  34. O-glycosylation

  35. EMBOSS Open source software for molecular biology • Predict antigenic sites • Useful if want to design a peptide antibody • Look for specific motifs, even degenerate • Known phosphorylation motifs • Find motifs in multiple sequences with one submission • Get stats on proteins/nucleic acid sequences • Sequence manipulation of all kinds

  36. Today in lab • Tutorials on protein information sites • Create two sublists with DAVID • Obtain protein accession numbers for the cluster using Uniprot • Submit to SMART database to characterize/analyze the domains for signaling proteins • Pick 2 proteins to do additional predictions

More Related