sequence analysis

1. 1

3. Multiple Sequence Alignments Global Multiple Sequence Alignment Deals with the entire length of the homologous sequences Abstraction and Representation of Multiple Sequence Alignments Character based Numeric Local Multiple Sequence Alignment (generally called pattern identification) Deals with a segment (most often without gaps) from the sequences Sequences need not be homologous over their entire length

4. Focus: What to Remember How to abstract an alignment or motif What sequence or structural elements are likely to be found as a good diagnostic pattern or motif Discovering motifs without previously aligning the sequences How do you recognize a good pattern How do you search for a good pattern

5. What are motifs? Why look for them? Motifs are well conserved regions of sequence generally organized around one or two very highly conserved residues. The high residue conservation results from a high likelihood of a defective protein from a mutation within the motif. Thus motifs are likely to be important for structure, function, or both. Useful for finding sequences that are distant family members Useful for editing multiple sequence alignments

6. Heterotrimeric g-proteins: alpha, beta and gamma subunits

7. Cellular signaling and g-proteins

8. G-proteins structure

9. Multiple Sequence Alignment: Fungal G-alpha subunits

10. Types of Residues Highly Conserved probably structure or function Highly Mutable �filler� Non-Randomly Mutated recognition sites, substrate specificity

11. Abstracting and RepresentingMultiple Sequence Alignments Consensus Sequence Residue most common at each position of the alignment Composite Sequence (set representation) All residues present at each position in the alignment Composition Matrix Table showing how many of each residue are present at each position

13. Representing and AbstractingMultiple Sequence Alignments

14. Consensus sequences from motifs

15. Composition Matrix

16. Representations Position-Specific Scoring Matrix Based on log-odds scores Uses dynamic programming (usually Smith-Waterman) Gap model probably stills needs work PSSM can be developed from any number of sequences Hidden Markov Model Fully probabilistic Uses maximum likelihood method Gap model is integral to entire model HMM usually requires a minimum of 50-100 sequences to get a good model

17. Position-Specific Scoring Matrix Idea: Perform a large, high quality multiple sequence alignment Compute a log-odds score for each position Problem: assuming normal amino acid distribution, this would require a minimum of 200 sequences, and probably more on the order of 500-1000. Implementation problems when qij = 0 !!! Solution: Develop a method that uses the known data and a belief about the missing observations (based on the mutational data at the core of the PAM or Blossum series of Matrices)

18. Log-Odds Scoring Matrix

19. Position Specific Scoring Matrices Weight average Similarity Matrix Bayesian approach Mutational Frequencies Dirichlet mixtures Evolutionary Approach

20. Weighted average similarity matrix

21. Gribskov Profile Gap Penalty Have a preset maximum open/extend gap penalty Open = -10, Extend = -1 For each position in the profile, define a multiplier to reduce the gap penalties. Multiplier is 100 for positions in which there are no insertions Based on the maximum length of the gap (LGap) across all sequences in the multiple sequence analysis Equation: multiplier = Gmax/(1.0+GincLGap) where: Gmax = maximum multiplier (Default = 33.3) Ginc = rate the multiplier changes (Default = 0.1)

22. Bayesian � Mutational Frequencies utilizes predicted (pseudo-counts) based on observed replacement frequencies to create weighted average of the probability of finding residue �i� in the �position�

23. Henikoff Algorithm

24. Bayesian � Dirichlet mixtures K. Sjolander, K. Karplus, M. Brown, R. Hughley, A. Krogh, S. Mian, D. Haussler, 1996. Dirichlet mixtures: a method for improved detection of weak but significant homology. CABIOS 12:327-345. http://www.cse.ucsc.edu/research/compbio/dirichlets/index.html Assumes that a selection of residues should be treated as a group Rigorous statistical treatment of the pseudo-count problem

25. Evolutionary Based Method M. Gribskov, S. Veretnik. 1996. Identification of Sequence Patterns with Profile Analysis. Methods in Enzymology 266:198-212. Evolution-based method Determines the probability of which residue was the true ancestor Mixes a selection of 20 different matrices based on the above probabilities Directly computes scores based on the mixture

26. MakePSSM For both DNA and Proteins Currently implemented Gribskov�s �Average� and Bayesian Approaches (both mutational and dirichlet frequencies) A large number of Matrices and Frequencies available Blossum and PAM Numerous Gap Models Gribskov Gap Model Linear weighing depending on number of Gaps Based on extreme values of Log-Odds scores Future Additions: Gribskov�s evolutionary approach

27. Ways to define: Use the mutational frequencies that underlie the BLOSUM or PAM matrices (Hennikoff approach). Use Dirchilet mixtures (nine component). MakePSSM

28. PSSM for fungal g-alpha motif 1

29. Searching with a PSSM Most approaches use the Dynamic Programming Algorithm � usually the Smith-Waterman variant Excellent method for finding distantly related sequences Gap model is AFFINE with the Open and Extend Gap Penalties a function of which position they are in the alignment. Gribskov has a complicated form� Hennikoff�s did not have a gap model� Can be used to locate a motif in an alignment and then edit the alignment

30. Hidden Markov Model (HMM) HMM�s are a fully probabilistic model of a family of homologous sequences HMM�s are not a specific method of creating a multiple sequence alignment After the HMM is created an alignment can be generated from the HMM and any sequence from the same homologous family There are several algorithms for creating the HMM from a set of homologous sequences - each will yield a different HMM and hence different alignments An HMM can be calculated from a good alignment created by other means Creating HMM�s requires many sequences (>100)

31. HMM: Description HMM has 3 different kinds of states - a state is a probability model that specifies how frequently different types of sequence residues are found at a specific position of a family of sequences Main States: probability of a specific sequence residue Deletion State: probability of no sequence residue Insertion State: probability of adding extra sequence residues The states are connected by transition probabilities that determine how frequently you go to the 3 states corresponding to the location in the description of the sequence family from the previous state.

32. Alignment ? HMM

33. HMM: Diagram

35. Several methods are in use for training an HMM Some training algorithms are similar to the use of dynamic programming in the progressive pairwise method So far, all are like the progressive pairwise alignment method, prone to being trapped in local minima different from the correct alignment Requires some experience and expertise to use these programs effectively Best current practice may be to take a carefully crafted alignment and use it to create an HMM for use in database searches and other statistical applications HMM: Building an HMM

36. Classification Libraries/Patterns ProSite � composite sequence Prints � PSSM Blocks � Henikoff style PSSM Pfam � Hidden Markov Models

37. Local Multiple Sequence Alignment Modern programs combine two theoretical methods derived from statistics EM (Expectation-Maximization) to deal with �missing data� We know the sequences but don�t know where the patterns or motifs are within them Stochastic Sampling to reduce the volume of alignment space that must be searched Number of possible pattern starting points ??? Sequence Lengths

38. Expectation Maximization (EM) Used to identify conserved domains Uses sequences that have a common sequence pattern not easily recognized by eye Iterates two steps: Calculates the probability of finding the site at any position in the sequences New counts estimated in step 1 are used to update the previous set

39. A good motif or pattern is easy to recognize It has a high information content (entropy) p i fraction of residue i in the sequences q i,j fraction of residue i at position j of the pattern i is the sequence residue type index j is the index of the position within the pattern Expectation-Maximization

40. Stochastic Sampling Also known as the Gibbs sampler Too many possible motifs to calculate the information for all of them Exploit the �memory� of empirical position specific scoring matrix motif representations (profiles) A sequence segment that is part of the pattern used to calculate an empirical log-odds position specific matrix representation of the pattern will generally have a higher than �average� or expected score when scored using the matrix

41. Stochastic Sampling: Example



44. Score every segment of the left out sequence Use the scores for each segment to randomly select one of the segments Choose a new sequence to leave out of the data and repeat the process with the already defined sequence segments Stochastic Sampling: Example

45. Refine the Pattern: Picking the next word


47. MEME Multiple EM for Motif Elicitation (MEME) Will locate one or more ungapped patterns (motifs) in a set of sequences A search is conducted for a range of possible motif widths and the EM algorithm is used to find the best estimate for the width of the motif OOPS (one occurrence per sequence) ZOOPS (zero or one occurrence per sequence) TCM (any number of occurrences per sequence) Can use prior knowledge about possible motifs Will produce a PSSM

48. Which Representation Should I Use Gribskov Profile is the simplest (has the fewest parameters to fit to the available data) and requires the least data to adequately model these parameters. HMM�s are the most complex and require the most data to adequately model the parameters. PSSM�s are intermediate. HMM�s and PSSM�s tend to �look� like the background when you don�t have enough data to adequately model the parameters.

49. Sequence Logos Use information content of PSSM to make a graphical representation http://weblogo.berkeley.edu Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator, Genome Research, 14:1188-1190, (2004) Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Res. 18:6097-6100

50. MEME motif patterns for fungal g-alphas

51. Fungal G-alpha motifs - structure view

52. MAST Motif Alignment and Search Tool Searches through databases to identify motifs in other proteins Helps in finding distantly related sequences Helps in finding additional information about the motifs

53. MetaMEME Uses the EM and HMM methods to find motifs Starts with the EM method of MEME to find most of the motifs A simplified HMM is produced using the MEME results as prior information MAST is used to determine the most probable order and spacing of the patterns The above information and a set of modified Dirichlet mixtures are used to train the HMM The HMM can then be used for database searches

54. PSI-BLAST Searches database with BLAST using query to get a group of sequences that will be used to create a PSSM Iterates the search using the PSSM User selects the sequences to be included for iteration Helps find distantly related sequences User may input own PSSM instead of a query sequence Must be careful about sequence selection for PSSM creation or for iterations Inclusion of wrong sequences may quickly create artifacts and end up with an incorrect set of sequences

55. Glutathione S-Transferase Detoxifies organic chemicals containing halogen or double bonds by addition of Glutathione. Subsequent processing pathway leads to excretion. The catalytic residue (thiol) is from Glutathione. Only the cytoplasmic form is presented here. Classified into six groups, initially based on Swiss-Prot database annotation. Exact number of groups is still subject to debate. Found in bacteria and all kinds of eukaryotes. 126 Sequences from the Swiss-Protein Database.

56. MEME ZOOPS Motifs for GST

57. MEME ZOOPS Motifs -- Rat Mu-1

58. Specialized for Nucleic Acids AlignACE Roth, F.P., Hughes, P.W., Estep, J.D., and Church, G.M. 1998. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotech. 16:939-945. BioProspector Liu X., Brutlag, D.L., Liu, J.S. 2001. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 2001;:127-38. Up to tetranucleotide (3rd order Markov Model) background Gibbs Recursive Sampler Thompson, W., Rouchka, E.C., and Lawrence, C.E. 2003. Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res.��31:3580�3585. Allows palindromes and spacers in the model.

sequence analysis - overview

sequence analysis - overview

Presentation Transcript

Protein Sequence Analysis - Overview

Sequence analysis

Protein Sequence Analysis - Overview

Protein Sequence Analysis - Overview -

Sequence Analysis

Sequence Analysis

Sequence Analysis

Sequence Analysis

Sequence analysis

Sequence Analysis

Sequence Analysis

Sequence analysis – an overview

SEQUENCE ANALYSIS

Sequence analysis

Sequence Analysis

Sequence Analysis

SEQUENCE ANALYSIS

Sequence Analysis

Protein Sequence Analysis - Overview

Sequence Analysis