Classifying the protein universe

Synapse-Associated Protein 97 Classifying the protein universe Ashwin Sivakumar Wu et al, 2002. EMBO J 19:5740-5751

Domain Analysis and Protein Families • Introduction • What are protein families? • Protein families • Description & Definition • Motifs and Profiles • The modular architecture of proteins • Domain Properties and Classification

Protein family 1 Protein family 2 Protein Families • Protein families are defined by homology: • In a family, everyone is related to everyone • Everybody in a family shares a common ancestor:

1chg 1sgt 1chg 1sgt Homology versus Similarity • Homologous proteins have similar 3D structures and (usually) share common ancestry: • 1chg and 1sgt  31% identity, 43% similarity • We can infer homology from similarity! Superfamily: Trypsin-like Serine Proteases

1chg 1sgc 1chg 1sgc Homology versus Similarity • But Homologous proteins may not share sequence similarity: Superfamily: Trypsin-like Serine Proteases 1chg and 1sgc  15% identity, 25% similarity We cannot infer similarity from homology

1chg 2baa 1chg 2baa Homology versus Similarity • Similar sequences may not have structural similarity: 1chg and 2baa  30% similarity, 140/245 aa We cannot assume homology from similarity!

Homology versus Similarity • Summary • Sequences can be similar without being homologous • Sequences can be homologous without being similar Families ?? Evolution / Homology BLAST Similarity

Description of a Protein Family • Let’s assume we know some members of a protein family • What is common to them all? • Multiple alignment!

Describing Sequences in a Protein Family • As a motif or rule • describes essential features of the protein family • catalytic residues, important structural residues • As a profile • describes variability in the family alignment

Techniques for searching sequence databases to Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family • Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string • Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur

Consensus - mathematical probability that a particular amino acid will be located at a given position. • Probabilistic pattern constructed from a MSA. Opportunity to assign penalties for insertions and deletions • PSSM - (Position Specific Scoring Matrix) – Represents the sequence profile in tabular form – Columns of weights for every aa corresponding to each column of a MSA.

HMMs • Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000) • •Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended) • More the number of sequences better the models. • One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)

Motif Description of a Protein Family • Regular expressions: ........C.............S...L..I..DRY..I.......................W... I E W V / C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W / x = [AC-IK-NP-TVWY]

Motif Description of a Protein Family • Database: PROSITE “PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor. It is apparent, when studying protein sequence families, that some regions have been better conserved than others during evolution. These regions are generally important for the function of a protein and/or for the maintenance of its three-dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated proteins.” http://au.expasy.org/prosite/prosite_details.html

Automated Motif Discovery • Given a set of sequences: • GIBBS Sampler • http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein • MEME • http://meme.sdsc.edu/meme/ PRATT • http://www.ebi.ac.uk/pratt • TEIRESIAS • http://cbcsrv.watson.ibm.com/Tspd.html

Automated Profile Generation • Any multiple alignment is a profile! • PSIBLAST • Algorithm: • Start from a single query sequence • Perform BLAST search • Build profile of neighbours • Repeat from 2 … • Very sensitive method for database search

PSI-Blast • Starts with a sequence, BLAST it, • align select results to query sequence, estimate a profile with the MSA, search database with the profile - constructs PSSM • Iterate until process stabilizes • Focus here is on domains, not entire sequences • Greatly improves sensitivity

Profile2 After n iterations Query Profile1 ... Threshold for inclusion in profile PSIBLAST • Position Specific Iterative Blast

Benchmarking a motif/profile • You have a description of a protein family, and you do a database search… • Are all hits truly members of your protein family? • Benchmarking: TP: true positive TN: true negative FP: false positive FN: false negative Result family member Dataset not a family member unknown

Benchmarking a motif/profile • Precision / Selectivity • Precision = TP / (TP + FP) • Sensitivity / Recall • Sensitivity = TP / (TP + FN) • Balancing both: • Precision ~ 1, Recall ~ 0: easy but useless • Precision ~ 0, Recall ~ 1: easy but useless • Precision ~ 1, Recall ~ 1: perfect but very difficult

Triosephosphate isomerase Phosphoglycerate kinase The Modular Architecture of Proteins • BLAST search of a multi-domain protein

What are domains? • Functional - from experiments: example: Decay Accelerating Factor (DAF) or CD55 • Has six domains (units): • 4x Sushi domain (complement regulation) • 1x ST-rich ‘stalk’ • 1x GPI anchor (membrane attachment) • PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696

There is only so much we can conclude… • Classifying domains [To aid structure prediction (predict structural domains, molecular function of the domain)] • Classifying complete sequences (predicting molecular function of proteins, large scale annotation) • Majority of proteins are multi-domain proteins.

What are domains? • Structural - from structures: MKTQVAIIGAGPSGLLLGQLLHKAGIDNVILERQTPDYVLGRIRAGVLEQGMVDLLREAGVDRRMARDGLVHEGVEIAFAGQRRRIDLKRLSGGKTVTVYGQTEVTRDLMEAREACGATTVYQAAEVRLHDLQGERPYVTFERDGERLRLDCDYIAGCDGFHGISRQSIPAERLKVFERVYPFGWLGLLADTPPVSHELIYANHPRGFALCSQRSATRSRYYVQVPLTEKVEDWSDERFWTELKARLPAEVAEKLVTGPSLEKSIAPLRSFVVEPMQHGRLFLAGDAAHIVPPTGAKGLNLAASDVSTLYRLLLKAYREGRGELLERYSAICLRRIWKAERFSWWMTSVLHRFPDTDAFSQRIQQTELEYYLGSEAGLATIAENYVGLPYEEIE Are these domains? Yes - structural domains! 1phh M A Marti-Renom (2003) Identification of Structural Domains in Proteins. DIMACS, Rutgers University, Piscataway, NJ, Feb 27 2003.

What are domains? • Mobile – Sequence Domains: Protein 1 Protein 2 Protein 3 Protein 4 Mobile module

Domains are... • ...evolutionary building blocks: • Families of evolutionarily-related sequence segments • Domain assignment often coupled with classification • With one or more of the following properties: • Globular • Independently foldable • Recurrence in different contexts • To be precise, • we say: “protein family” • we mean: “protein domainfamily”

Example: global alignment • Phthalate dioxygenase reductase (PDR_BURCE) • Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME) Global alignment fails! Only aligns largest domain.

Sometimes even more complex! PGBM_HUMAN:“Basement membrane-specific heparan sulphate proteoglycan core protein precursor” 980 1960 2940 3920 4391 45 domains of 9 different type, according to PFam http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160 http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html

Categories of Domain Definitions Structure(discontinuous domains) Sequence(continuous domains) PFAM SCOP Curated SMART CATH PROSITE PRINTS ADDA DALI PUU DETEKTIVE DOMAINPARSER 1 & 2 DIAL STRUDL DOMAK DOMO TRIBE-MCL GENERAGE SYSTERS PROTOMAP Automatic

Pfam-Protein family database • Families of HMM profiles built from hand curated multiple alignments. (Pfam A) • Pfam A covers 7973 protein families. • You can search your sequence against these profiles to decipher family membership for your sequence. 7973

Sequence Space Graph • Why we need to consider domains: Sequence Alignment Topology: • 80% of all sequences in one giant component • 10% smaller groups • 10% in singletons

Distant relatives Automatic domain definitions • Rely on alignment information • Alignment information is unreliable • Incomplete sequences (fragments) • Spurious alignments • Conserved motifs in mostly disordered region • How to remove the noise? UREA_CANEN: three domain protein

Sequence Space Graph: • Where to cut connections? • What is real, what is noise? • Precision vs Sensitivity…

ADDA • HolmGroup in-house database! • http://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdb • Classification of non-redundant sequences • 100% level: 1562243 sequences, 2697368 domains • 40% level: 479740 sequences, 827925 domains • PFAM-A benchmark • Sensitivity: 87% (average unification in single cluster) • Selectivity: 98% (average purity of cluster) • Coverage: 100% (all known proteins) [ Pfam ~50% ]

Example: ABC transporter PFAM PRODOM DOMO ADDA UniProt id: CFTR_BOVIN

Properties of domains • Most domains: size approx 75 – 200 residues

So, you have a sequence... • ...look it up in existing database • SRS: http://srs.ebi.ac.uk • INTERPRO: http://www.ebi.ac.uk/interpro • ...search against existing family descriptions • PFAM: http://www.sanger.ac.uk/Software/Pfam • SMART: http://smart.embl-heidelberg.de • PRINTS: http://bioinf.man.ac.uk/dbbrowser/PRINTS • PROSITE: http://us.expasy.org/prosite • ...look it up in ADDA

Manually Curated Protein Family Databases • PFAM (Hidden Markov Models) • http://www.sanger.ac.uk/Software/Pfam • SMART (Hidden Markov Models) • http://smart.embl-heidelberg.de • PROSITE (Regular Expressions, Profiles) • http://au.expasy.org/prosite • PRINTS (combination of Profiles) • http://bioinf.man.ac.uk/dbbrowser/PRINTS

Why a multiple alignment? • With a multiple alignment, we can • guess which residues are “important” • secondary structure prediction • transmembrane segments prediction • homology modelling • guide to wet-lab EXPERIMENTATION! • build a motif/profile and find more family members • build phylogenetic trees Multiple Alignments are THE central object in protein sequence analysis!

From sequence to function… 3-motif resource The server seems to be down today! Methylmalanoyl CoA DecarboxylasePattern [ILV]-x(3)-E-x(7)-V-[GA]-x-[IVL]-x-L-N-R-Pmapped on the structure of 1DUB. Ball representation in pink shows the potential ligands and its binding pockets. The balls in blue represent the residues making up the motif on the known structure.

Classifying the protein universe