1 / 40

Classifying the protein universe

Synapse-Associated Protein 97. Classifying the protein universe . Wu et al, 2002. EMBO J 19:5740-5751. Domain Analysis and Protein Families. Introduction What are protein families? Motifs and Profiles The modular architecture of proteins Domain Properties and Classification.

palmer
Download Presentation

Classifying the protein universe

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synapse-Associated Protein 97 Classifying the protein universe Wu et al, 2002. EMBO J 19:5740-5751

  2. Domain Analysis and Protein Families • Introduction • What are protein families? • Motifs and Profiles • The modular architecture of proteins • Domain Properties and Classification

  3. Protein family 1 Protein family 2 Protein Families • Protein families are defined by homology • In a family, everyone is related to everyone • Everybody in a family shares a common ancestor

  4. 1chg 1sgt 1chg 1sgt Homology versus Similarity • Homologous proteins have similar 3D structures and (usually) share common ancestry Superfamily: Trypsin-like Serine Proteases 1chg and 1sgt  31% identity, 43% similarity We can infer homology from similarity!

  5. 1chg 1sgc 1chg 1sgc Homology versus Similarity • But Homologous proteins may not share sequence similarity Superfamily: Trypsin-like Serine Proteases 1chg and 1sgc  15% identity, 25% similarity We cannot infer similarity from homology

  6. 1chg 2baa 1chg 2baa Homology versus Similarity • Similar sequences may not have structural similarity 1chg and 2baa  30% similarity, 140/245 aa We cannot assume homology from similarity!

  7. Homology versus Similarity • Summary • Sequences can be similar without being homologous • Sequences can be homologous without being similar Families ?? Evolution / Homology BLAST Similarity

  8. Domain Analysis and Protein Families • Introduction • What are protein families? • Motifs and Profiles • The modular architecture of proteins • Domain Properties and Classification

  9. Technique to identify protein family • Search for profiles/motifs of biological significance that categorize a protein into a family • Pattern (motif) - a deterministic syntax that describes multiple combinations of possible residues within a protein string • Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur • Intermediate sequence search - link many profile searches

  10. Automated Motif Discovery • Given a set of sequences: • GIBBS Sampler • http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein • MEME - motif-based sequence analysis tools • http://meme.sdsc.edu/meme/ • PRATT - tool to discover patterns that are conserved in a set of protein sequences • http://kr.expasy.org/tools/pratt/ • http://www.ebi.ac.uk/pratt (advanced tool) • TEIRESIAS • http://cbcsrv.watson.ibm.com/Tspd.html • Combinatorial output

  11. Motif Description of a Protein Family • Regular expressions: ........C.............S...L..I..DRY..I.......................W... I E W V / C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /

  12. Automated Profile Generation • Any multiple alignment is a profile! • PSI-BLAST • Algorithm: • Start from a single query sequence • Perform BLAST search • Build profile of neighbours • Repeat from 2 … • Very sensitive method for database search

  13. Profile2 After n iterations Query Profile1 ... Threshold for inclusion in profile PSI-BLAST Position Specific Iterative Blast PSI-Blast profile models only positions in the query sequence

  14. HMMs • Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)

  15. Using HMMs • You can use HMM to create a model profile/PSSM (Position Specific Scoring Matrix) • To create one, you need to have an multiple alignment • The more sequences in the multiple alignment, the better the model created by HMM will be • After creating HMM model, you can search a database with it (Eg: PFAM)

  16. HMM libraries • PFAM • http://pfam.sanger.ac.uk • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). • Pfam-A entries are high quality, manually curated families. • Pfam-B entries are generated automatically.

  17. GTG • Graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments • Eliminatesnoise • Enables fast sequence database searching methods which are superior to profile-profile comparison at recognizing distant homologues

  18. GTG steps • Generate alignment trace graph • Nodes = residues • Edges = aligned in PSI-Blast library • Unweighted • Edge weighting • Using consistency • Clustering • Driven by consistency • Single site occupancy rule • Post-processing • Generate non-redundant set of inter-cluster edges • Identify sub-trees with conserved residues

  19. Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Alignment trace graph Residues more residues • Graph representation of input pairwise alignment data • Vertices = residues • Edges = aligned in a pairwise alignment from input library

  20. Consistency = neighbour overlap i j Weight = intersection / union

  21. GTG – global trace graph • Input: PSI-Blast all versus all alignments in NRDB40 • Output: superalignment of all proteins • Applications • Pairwise alignment of query and target sequences • Transitive sequence database searching (fast) • Tracking conserved residues (feature space)

  22. Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site occupancy) Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Alignment trace graph Cluster 1 Cluster 2

  23. consistency consistency consistency A H G A A A K K K K K K K K A ‘Motif tracking’ Each vertex is labelled with source protein and position in sequence. Motifs are subtrees enriched in one particular amino acid type.

  24. Remote homolog detectionbased on GTG alignment score GTG clustering is informative; detect as many remote homologs as threading methods

  25. GTG summary • Super-families form elongated clusters in “protein space” • Profile models fluctuations around an equilibrium point • Consistency ~ path model • Exploits multiple profile models • Discriminative in database searching • Global trace graph data structure • Feature space for pattern discovery http://ekhidna.biocenter.helsinki.fi/gtg/start

  26. Relationships between families • Pfam clans • A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM • Superfamily • http://supfam.cs.bris.ac.uk/SUPERFAMILY/hmm.html • The sequence search method uses a library (covering all proteins of known structure) consisting of 1776 SCOPsuperfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models. • Pfam-squared • Based on GTG comparisons of representative sequences from each PFAM-A family against all PFAM-A families. • Rules of thumb: motif score>1000 means probably related, motif score >500 means possibly related, score <500 means dubious

  27. Benchmarking a motif/profile • You have a description of a protein family, and you do a database search… • Are all hits truly members of your protein family? • Benchmarking: TP: true positive TN: true negative FP: false positive FN: false negative Result family member Dataset not a family member unknown

  28. Benchmarking a motif/profile • Precision / Selectivity • Precision = TP / (TP + FP) • Sensitivity / Recall • Sensitivity = TP / (TP + FN) • Balancing both: • Precision ~ 1, Recall ~ 0: easy but useless • Precision ~ 0, Recall ~ 1: easy but useless • Precision ~ 1, Recall ~ 1: perfect but very difficult

  29. Domain Analysis and Protein Families • Introduction • What are protein families? • Motifs and Profiles • The modular architecture of proteins • Domain Properties and Classification

  30. Triosephosphate isomerase Phosphoglyceratekinase The Modular Architecture of Proteins • BLAST search of a multi-domain protein

  31. What are domains? • Functional - from experiments: example: Decay Accelerating Factor (DAF) or CD55 • Has six domains (units): • 4x Sushi domain (complement regulation) • 1x ST-rich ‘stalk’ • 1x GPI anchor (membrane attachment) • PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J BiolChem 278(12): 10691-10696

  32. There is only so much we can conclude… • Classifying domains to aid structure prediction • predict structural domains and molecular function of the domain • Classifying complete sequences • predicting molecular function of proteins, large scale annotation • Majority of proteins are multi-domain proteins

  33. What are domains? Protein 1 Protein 2 Protein 3 Protein 4 Mobile module

  34. Domains are... • Parts of protein sequences that can evolve, function, and exist independently of the rest of the protein chain • Each domain forms a compact three-dimensional structure and often can be independently stable and folded

  35. Domains are... • ...evolutionary building blocks: • Families of evolutionarily-related sequence segments • Domain assignment often coupled with classification • To be precise, • we say: “protein family” • we mean: “protein domain family”

  36. Example: global alignment • Phthalate dioxygenasereductase (PDR_BURCE) • Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME) Global alignment fails! Only aligns largest domain.

  37. Sometimesdomainstucturesarequitecomplex PGBM_HUMAN: “Basement membrane-specificheparan sulphate proteoglycan core protein precursor” 45 domains of 7 different type, according to PROSITE http://pfam.sanger.ac.uk/protein/PGBM_HUMANhttp://au.expasy.org/cgi-bin/prosite/ScanView.cgi?scanfile=530255511812.scan.gz http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html

  38. Properties of domains • Most domains size approx 75 – 200 residues

  39. Properties of domains Very short domains, less than 40 residues, are often stabilised by metal ions or disulfide bonds. Larger domains, greater than 300 residues, are likely to consist of multiple hydrophobic core

  40. So, you have a sequence... • ...look it up in existing database • INTERPROSCAN: http://www.ebi.ac.uk/Tools/InterProScan/ • PSI-BLAST: http://www.ncbi.nlm.nih.gov/BLAST • GTG: http://ekhidna.biocenter.helsinki.fi/gtg/start • ...search against existing family descriptions • PFAM: http://pfam.sanger.ac.uk/ • SUPERFAMILY: http://supfam.org/SUPERFAMILY/

More Related