810 likes | 978 Views
Developing and Using Special Purpose Hidden Markov Model Databases. Martin Gollery Associate Director of Bioinformatics University of Nevada, Reno Mgollery@unr.edu. Today’s Tutorial. Instructor: Martin Gollery Associate Director of Bioinformatics, University of Nevada, Reno
E N D
Developing and Using Special Purpose Hidden Markov Model Databases Martin Gollery Associate Director of Bioinformatics University of Nevada, Reno Mgollery@unr.edu
Today’s Tutorial • Instructor: Martin Gollery • Associate Director of Bioinformatics, University of Nevada, Reno • Consultant to several organizations • Formerly with TimeLogic • Developed several HMM databases
Hidden Markov Models • What HMM’s are • Which HMM programs are commonly used • What HMM databases are available • Why you would use one DB over another • Integrated Resources- InterPro and more • How you can build your own HMM DB • Problems with building your own • Live demonstration
Hidden Markov Models-What are they, anyway? • Statistical description of a protein family's consensus sequence • Conserved regions receive highest scores • Can be seen as a Finite State Machine
Representation of Family Members • yciH KDGII • ZyciH KDGVI • VCA0570 KDGDI • HI1225 KNGII • sll0546 KEDCV
Representation of gaps in Family Members • yciH KDGII • ZyciH KDGVI • VCA0570 KDGDI • HI1225 KNGII • sll0546 KED-V
For Maximum sensitivity- No residue at any position should have a zero probability, even if it was not seen in the training data.
Start with an MSA… • CLUSTAL W (1.7) multiple sequence alignment • yciH KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG • ZyciH KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG • VCA0570 KDGDIEIQGDVRDQLKTLLESKGHKVKLAGG • HI1225 KNGIIEIQGEKRDLLKQLLEQKGFKVKLSGG • sll0546 KEDCVEIQGDQREKILAYLLKQGYKAKISGG • PA4840 KDGVVEIQGEHVELLIDELLKRGFKAKKSGG • AF0914 KNGVIELQGNHVNRVKELLIKKGFNPERIKT • *:. :*:**: : : * :* : :
Hidden Markov Models • HMMER2.0 • NAME example2 • DESC Small example for demonstration purposes • LENG 31 • ALPH Amino • COM hmmbuild example2 example2.aln • NSEQ 7 • DATE Wed Jan 08 13:33:06 2003 • HMM A C D E F G H I K … • 1 -3217 -3413 -3082 -2664 -4291 -3257 -2104 -4231 3883… • 2 -1938 -3859 2747 1592 -4024 -1857 -1206 -3953 -1455… • 3 -2160 -3144 1834 -953 -4284 3247 -2013 -4362 -2365… • 4 -1255 2750 436 -2789 -1273 -2972 -2049 1510 -2543… • 5 -2035 -1558 -4660 -4320 -2085 -4409 -4229 3081 -4224… • 6 -3264 -3765 -1447 3822 -4535 -2948 -2636 -4814 -2810… • 7 -2423 -1951 -4843 -4395 -1156 -4544 -3680 3291 -4151… • 8 -3220 -3396 -2530 -2667 -3851 -3171 -2735 -4442 -2277… • 9 -3196 -3194 -3915 -4259 -4867 3789 -4005 -5414 -4591… • 10 -1923 -3837 2743 2134 -4005 -1854 -1196 -3929 -1434… • 11 -999 -2164 -952 -353 -2483 -1909 3321 -2139 1730… • 12 -1629 -1909 -2827 -2102 -2279 -2588 -1442 -1012 -488…
Emission Probabilities • What is the likelihood that sequence X was emitted by HMM Y? • Likelihood is calculated by adding the probability of each residue at each position, and each of the transition probabilities
HMM’s vs BLAST • Position specific scoring vs. general matrix • Example: • dDGVIvIddDKRDLLKSLiEAKkMKVKLAGG • KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG has 80% BLAST similarity, but misses highly conserved regions • Scoring emphasizes important locations • Clearer score cutoffs • However, it is MUCH slower!
HMM programs • HMMer -Sean Eddy, Wash U • SAM - Haussler, UCSC • Wise tools - Birney, EBI • SledgeHMMer - Subramaniam, SDSC • Meta-MEME - Noble & Bailey • PSI-BLAST - NCBI • SPSpfam - Southwest Parallel Software • Ldhmmer - Logical Depth • DeCypherHMM - TimeLogic
What exactly do you want? • Are you searching thousands of sequences with one or a few models? • Use hmmsearch • Searching a few sequences with thousands of models? • Use hmmpfam • Thousands of sequences vs. Thousands of models? • Use an accelerator, if you do it very often
HMM databases • PFAM • TIGRFAM • Superfamily • SMART • Panther • PRED-GPCR
HMM databases at the CFB • COGfam • KinFam • HydroHMMer • NVfam-pro • NVfam-arc • NVfam-fun • NVfam-pln
PFAM • From Sanger, WashU, KI, INRA • Version 17 has 7868 families • Most widely used HMM database • Good annotation team
PFAM • PFAM-A is hand curated • From high quality multiple Alignments • PFAM-B is built automatically from ProDom • Generated using the Domainer algorithm • ProDom is built from SP/TREMBL
PFAM • Pfam-ls = global alignments • Pfam-fs = local alignments, so that matches may include only part of the model • Both the –ls and –fs versions are local W.R.T. the sequence
PFAM • Note ‘type’ annotation • Labeled TP • Family • Domain • Repeat • Motif
TIGRFAMs • Available at (www.tigr.org/TIGRFAMs/) • Organized by functional role • Equivalogs: a set of homologous proteins that are conserved with respect to function since their last common ancestor • Equivalog domains: domains of conserved function
TIGRFAMs • 2453 models in release 4.1 • Complementary to PFAM, so run both • Part of the Comprehensive Microbial Resource (CMR)
TIGRFAMs TIGRfam and PFAM alignments for Pyruvate carboxylase. The thin line represents the sequence. The bars represent hit regions.
SuperFamily • By Julian Gough, formerly MRC, now Riken GSC • www.supfam.org • Provides structural (and hence implied functional) assignments to protein sequences at the superfamily level • Built from SCOP (Structural Classification of Proteins) database, which is built from PDB • Available in HMMer, SAM, and PSI-BLAST formats
SuperFamily • 1447 SCOP Superfamilies • Each represented by a group of HMMs • Over 8500 models total • Table provides comparison to GO, Interpro, PFAM
SMART • Simple Modular Architecture Research Tool • Version 3.4 contains 654 HMMs • Emphasis on mobile eukaryotic domains • smart.embl-heidelberg.de • Annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues
SMART • Use for signaling domains or extracellular domains • Normal and Genomic mode
PRED-GPCR • Papasaikas et al, U of Athens • 265 HMMs in 67 GPCR families • Based on TiPs Pharmacological classification. • Filters with CAST • signatures regularly updated • Entire system redone each year
Panther • Protein ANalysis THrough Evolutionary Relationships • Family and subfamily: families are evolutionarily related proteins; subfamilies are related proteins with the same function • Molecular function: the function of the protein by itself or with directly interacting proteins at a biochemical level, e.g. a protein kinase • Biological process: the function of the protein in the context of a larger network of proteins that interact to accomplish a process at the level of the cell or organism, e.g. mitosis. • Pathway: similar to biological process, but a pathway also explicitly specifies the relationships between the interacting molecules.
Panther • (Thomas et al., Genome Research 2003; Mi et al. NAR 2005) • 6683 protein families • 31,705 functionally distinct protein subfamilies.
Panther • Due to the size, searches could be slow • First, BLAST against consensus seqs • Then, search against models represented by those hits • With an accelerator, you don’t have to do that…
Panther • So- how does it perform? • I took 3451 Arabidopsis proteins with no hit to PFAM, Superfamily, SMART or TIGRfam • Ran it against Panther • Found 160 significant hits!
COG-HMMs • Clusters of Orthologous Groups of proteins • www.ncbi.nlm.nih.gov/cog/ • Each COG is from at least 3 lineages • Ancient conserved domain • 4873 alignments available • Alignments from NCBI, HMMs from me at mgollery@unr.edu
CDD • Conserved Domain Database (NCBI) • Psi-BLAST profiles are similar to HMMs • 10991 PSSMs - SMART + COG +KOG+ Pfam+CD • Runs with RPS-BLAST • Much faster searches
KinFam • Kinfam- models represent 53 different classes of PKs • Assigns Kinase Class and Group • Based on Hanks’ classification scheme • Database is small, so searches are fast
KinFam • Categorizes Kinase data • Available for download from bioinformatics.unr.edu RANK SCORE QF TARGET|ACCESSION E_VALUE DESCRIPTION 1 852.93 1 KinFam||ptkgrp15 9.3e-256 Fibroblast GF recept 2 479.14 1 KinFam||ptkgrp14 3.1e-143 Platelet derived GF 3 423.33 1 KinFam||ptkother 1.9e-126 Other membrane-span
HydroHmmer • Hydrohmmer finds LEAs, other hydrophilin classes • Small target size makes for very fast searches
NVFAMs • HMM’s reflect the training data • Specific training sets provide better results • So… use Archaeal data to study Archaeons, Fungal data to study Fungi, etc. • Designed for use with PFAM, not stand alone • Recent redesign, name change
NVFAMs • NVFAM-pro used to study E. faecalis • Demonstrated higher scores, better aligns • However, PFAM had more total hits • P.falciparum used as negative control • PFAM showed better scores, aligns as predicted • Automated design by Garrett Taylor- scripts are available! • Contact me for input, collaboration, or help to build your own
Which database to use?One Comparison Test-(Your results may vary…) • Compare 563 I. pini sequences to COGhmm, PFAM, PFAMfrag, SMART, TIGRfam, TIGRfamfrag, Superfamily • COGs- 9 • PFAM- 22 • PFAMfrag- 57 • SMART- 4 • Superfamily- 30 • TIGRfam- 6 • TIGRfamfrag- 12
Integrated Resources • InterProscan • MAGPIE • PANAL • Make your own!
InterPro • Database built from PFAM, Prints, Prosite, SuperFamily, ProDom, SMART, TIGRFAMs, PANTHER, PIRsf, Gene3D & SP/TrEMBL • Version 10.0 • Nearly 12,000 entries • http://www.ebi.ac.uk/interpro/ • InterProScan can be installed locally
InterProScan • Splits up big jobs & reassembles them • Works with SGE, PBS, LSF • A free analysis pipeline! • Provides GO mappings • Written in PERL, so it’s easy to modify • Average 4 min. per NT sequence per CPU
InterPro release 10.0 contains 11972 entries, representing 3079 domains, 8597 families, 228 repeats, 27 active sites, 21 binding sites and 20 post-translational modification sites. Overall, there are 7521179 InterPro hits from 1466570 UniProt protein sequences. A complete list is available from the ftp site. DATABASE VERSION ENTRIES SWISS-PROT 46.5 180652 PRINTS 37.0 1850 TrEMBL 29.5 1689375 Pfam 17.0 7868 PROSITE patterns 18.45 1800 PROSITE preprofiles N/A 120 ProDom 2004.1 1522 InterPro 10.0 11972 SMART 4.0 663 TIGRFAMs 4.1 2454 PIRSF 2.52 962 PANTHER 5.0 438 SUPERFAMILY 1.65 1160 Gene3D 3.0 117 GO Classification N/A 18705 InterPro
Modifying InterProScan • Two ways to Add your own HMM database to InterProScan: • Modify PERL scripts • Concatenate your models onto PFAM • Similarly, if you are looking for a specific target, delete all the rest to speed up searches