Understanding Pfam: Proteins Analysis with Multiple Alignments & HMM Profiles

Pfam: multiple sequence alignments and HMM-profiles of protein domains Xianhui Li 03-02-2004

Outline • What is Pfam? • What is a Hidden Markove model (the methodology underlying Pfam)? • How to use Pfam and sample output

pfam • Pfam is a database of multiple alignments of protein domains or conserved protein regions. • The alignments represent some evolutionary conserved structure which has implications for the protein's function. • Profile hidden Markov models (profile HMMs) built from the Pfam alignments can be very useful for automatically recognizing that a new protein belongs to an existing protein family, even if the homology is weak.

Overview of Pfam Database • Pfam A contains curated families each with an associated profile HMM that can be used for alignment and database searching • Annotation --contains several compulsory fields • Seed alignment– a manually verified multiple alignment of a representative set of sequences • HMM –profile— turned a multiple sequence alignment into a position-specific scoring system. • Full alignment– generated automatically from the seed HMM-profile by searching Swisssprot for all detectable members and aligning them to the HMM profile • PfamB areclustered automatically, allowing Pfam to be comprehensive

Pfam Sequence Database Coverage residue Sequence Data shown is from Pfam v2.0 as of 1998 with 527 families. Current version is Pfam 12.0 (January 2004) contains alignments and models for 7316 protein families, based on the Swissprot 42.5 and SP-TrEMBL 25.6 protein sequence databases

Emit 1 Emit 4 Begin End Emit 2 Emit 3 Markov Model • Simplest example: Each state emits (or, equivalently, recognizes) a particular element with probability 1. Example sequences: 1234 234 14 121214 2123334

0.9 0.5 A (0.8) B(0.2) C (0.1) D(0.9) 1.0 Begin End 0.8 0.1 0.25 0.75 0.5 B (0.7) C(0.3) C (0.6) A(0.4) 0.2 Probabilistic Emission • If we let the states define a set of emission probabilities for elements, we can no longer be sure which state we are in given a particular element of a sequenceBCCD or BCCD ?

0.9 0.5 1.0 A (0.8) B(0.2) C (0.1) D(0.9) Begin End 0.8 0.1 0.25 0.75 0.5 B (0.7) C(0.3) C (0.6) A(0.4) 0.2 Hidden Markov Models (HMM) • Emission uncertainty means the sequence doesn't identify a unique path. The states are “hidden” • Probability of a sequence is sum of all paths that can produce it: p(bccd) = 0.5 * 0.2 * 0.1 * 0.3* 0.75 * 0.6 * 0.8 * 0.9 + 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6* 0.8 * 0.9 = 0.000972 + 0.013608 = 0.01458

insert insert insert end match match start delete delete HMMs for homology • Homology model: ancestral residue (match) states, insertion states, deletion states.

Profile HMM

Searching Pfam • Web site: provide users the ability to search query protein sequences against one, all, or a few PfamHMM. _http://www.sanger.ac.uk/Pfam _http://genome.wustl.edu/Pfam --http://www.cgr.ki.se/Pfam . Software: Users can use Pfam HMM-profile to search locally using the freely available HMMERsoftware package at: http://genome.wustle.edu/eddy/hmmer.html#hmmer

Sample Pfam Query Results

Acknowledgements • Some slides adapted from lectures by Larry Hunter at University of Colorado Health Sciences Center • Altmann Lab for critical comments

Understanding Pfam: Proteins Analysis with Multiple Alignments & HMM Profiles