580 likes | 871 Views
Sequence Analysis - Overview. Multiple Sequence Alignments. Global Multiple Sequence AlignmentDeals with the entire length of the homologous sequencesAbstraction and Representation of Multiple Sequence AlignmentsCharacter basedNumeric Local Multiple Sequence Alignment (generally called pattern identification)Deals with a segment (most often without gaps) from the sequencesSequences need not be homologous over their entire length.
E N D
1. 1
3. Multiple Sequence Alignments Global Multiple Sequence Alignment
Deals with the entire length of the homologous sequences
Abstraction and Representation of Multiple Sequence Alignments
Character based
Numeric
Local Multiple Sequence Alignment (generally called pattern identification)
Deals with a segment (most often without gaps) from the sequences
Sequences need not be homologous over their entire length
4. Focus: What to Remember How to abstract an alignment or motif
What sequence or structural elements are likely to be found as a good diagnostic pattern or motif
Discovering motifs without previously aligning the sequences
How do you recognize a good pattern
How do you search for a good pattern
5. What are motifs? Why look for them? Motifs are well conserved regions of sequence generally organized around one or two very highly conserved residues.
The high residue conservation results from a high likelihood of a defective protein from a mutation within the motif.
Thus motifs are likely to be important for structure, function, or both.
Useful for finding sequences that are distant family members
Useful for editing multiple sequence alignments
6. Heterotrimeric g-proteins: alpha, beta and gamma subunits
7. Cellular signaling and g-proteins
8. G-proteins structure
9. Multiple Sequence Alignment: Fungal G-alpha subunits
10. Types of Residues Highly Conserved
probably structure or function
Highly Mutable
filler
Non-Randomly Mutated
recognition sites, substrate specificity
11. Abstracting and RepresentingMultiple Sequence Alignments Consensus Sequence
Residue most common at each position of the alignment
Composite Sequence (set representation)
All residues present at each position in the alignment
Composition Matrix
Table showing how many of each residue are present at each position
13. Representing and AbstractingMultiple Sequence Alignments
14. Consensus sequences from motifs
15. Composition Matrix
16. Representations Position-Specific Scoring Matrix
Based on log-odds scores
Uses dynamic programming (usually Smith-Waterman)
Gap model probably stills needs work
PSSM can be developed from any number of sequences
Hidden Markov Model
Fully probabilistic
Uses maximum likelihood method
Gap model is integral to entire model
HMM usually requires a minimum of 50-100 sequences to get a good model
17. Position-Specific Scoring Matrix Idea:
Perform a large, high quality multiple sequence alignment
Compute a log-odds score for each position
Problem:
assuming normal amino acid distribution, this would require a minimum of 200 sequences, and probably more on the order of 500-1000.
Implementation problems when qij = 0 !!!
Solution:
Develop a method that uses the known data and a belief about the missing observations (based on the mutational data at the core of the PAM or Blossum series of Matrices)
18. Log-Odds Scoring Matrix
19. Position Specific Scoring Matrices Weight average Similarity Matrix
Bayesian approach
Mutational Frequencies
Dirichlet mixtures
Evolutionary Approach
20. Weighted average similarity matrix
21. Gribskov Profile Gap Penalty Have a preset maximum open/extend gap penalty
Open = -10, Extend = -1
For each position in the profile, define a multiplier to reduce the gap penalties.
Multiplier is 100 for positions in which there are no insertions
Based on the maximum length of the gap (LGap) across all sequences in the multiple sequence analysis
Equation:
multiplier = Gmax/(1.0+GincLGap)
where: Gmax = maximum multiplier (Default = 33.3)
Ginc = rate the multiplier changes (Default = 0.1)
22. Bayesian Mutational Frequencies utilizes predicted (pseudo-counts) based on observed replacement frequencies to create weighted average of the probability of finding residue i in the position
23. Henikoff Algorithm
24. Bayesian Dirichlet mixtures K. Sjolander, K. Karplus, M. Brown, R. Hughley, A. Krogh, S. Mian, D. Haussler, 1996. Dirichlet mixtures: a method for improved detection of weak but significant homology. CABIOS 12:327-345.
http://www.cse.ucsc.edu/research/compbio/dirichlets/index.html
Assumes that a selection of residues should be treated as a group
Rigorous statistical treatment of the pseudo-count problem
25. Evolutionary Based Method M. Gribskov, S. Veretnik. 1996. Identification of Sequence Patterns with Profile Analysis.
Methods in Enzymology 266:198-212.
Evolution-based method
Determines the probability of which residue was the true ancestor
Mixes a selection of 20 different matrices based on the above probabilities
Directly computes scores based on the mixture
26. MakePSSM For both DNA and Proteins
Currently implemented Gribskovs Average and Bayesian Approaches (both mutational and dirichlet frequencies)
A large number of Matrices and Frequencies available
Blossum and PAM
Numerous Gap Models
Gribskov Gap Model
Linear weighing depending on number of Gaps
Based on extreme values of Log-Odds scores
Future Additions:
Gribskovs evolutionary approach
27. Ways to define: Use the mutational frequencies that underlie the BLOSUM or PAM matrices (Hennikoff approach).
Use Dirchilet mixtures (nine component).
MakePSSM
28. PSSM for fungal g-alpha motif 1
29. Searching with a PSSM Most approaches use the Dynamic Programming Algorithm usually the Smith-Waterman variant
Excellent method for finding distantly related sequences
Gap model is AFFINE with the Open and Extend Gap Penalties a function of which position they are in the alignment. Gribskov has a complicated form Hennikoffs did not have a gap model
Can be used to locate a motif in an alignment and then edit the alignment
30. Hidden Markov Model (HMM) HMMs are a fully probabilistic model of a family of homologous sequences
HMMs are not a specific method of creating a multiple sequence alignment
After the HMM is created an alignment can be generated from the HMM and any sequence from the same homologous family
There are several algorithms for creating the HMM from a set of homologous sequences - each will yield a different HMM and hence different alignments
An HMM can be calculated from a good alignment created by other means
Creating HMMs requires many sequences (>100)
31. HMM: Description HMM has 3 different kinds of states - a state is a probability model that specifies how frequently different types of sequence residues are found at a specific position of a family of sequences
Main States: probability of a specific sequence residue
Deletion State: probability of no sequence residue
Insertion State: probability of adding extra sequence residues
The states are connected by transition probabilities that determine how frequently you go to the 3 states corresponding to the location in the description of the sequence family from the previous state.
32. Alignment ? HMM
33. HMM: Diagram
35. Several methods are in use for training an HMM
Some training algorithms are similar to the use of dynamic programming in the progressive pairwise method
So far, all are like the progressive pairwise alignment method, prone to being trapped in local minima different from the correct alignment
Requires some experience and expertise to use these programs effectively
Best current practice may be to take a carefully crafted alignment and use it to create an HMM for use in database searches and other statistical applications HMM: Building an HMM
36. Classification Libraries/Patterns ProSite composite sequence
Prints PSSM
Blocks Henikoff style PSSM
Pfam Hidden Markov Models
37. Local Multiple Sequence Alignment Modern programs combine two theoretical methods derived from statistics
EM (Expectation-Maximization) to deal with missing data
We know the sequences but dont know where the patterns or motifs are within them
Stochastic Sampling to reduce the volume of alignment space that must be searched
Number of possible pattern starting points
??? Sequence Lengths
38. Expectation Maximization (EM) Used to identify conserved domains
Uses sequences that have a common sequence pattern not easily recognized by eye
Iterates two steps:
Calculates the probability of finding the site at any position in the sequences
New counts estimated in step 1 are used to update the previous set
39. A good motif or pattern is easy to recognize
It has a high information content (entropy)
p i fraction of residue i in the sequences
q i,j fraction of residue i at position j of the pattern
i is the sequence residue type index
j is the index of the position within the pattern Expectation-Maximization
40. Stochastic Sampling Also known as the Gibbs sampler
Too many possible motifs to calculate the information for all of them
Exploit the memory of empirical position specific scoring matrix motif representations (profiles)
A sequence segment that is part of the pattern used to calculate an empirical log-odds position specific matrix representation of the pattern will generally have a higher than average or expected score when scored using the matrix
41. Stochastic Sampling: Example
42. Stochastic Sampling: Example
43. Stochastic Sampling: Example
44. Score every segment of the left out sequence
Use the scores for each segment to randomly select one of the segments
Choose a new sequence to leave out of the data and repeat the process with the already defined sequence segments Stochastic Sampling: Example
45. Refine the Pattern: Picking the next word
46. Stochastic Sampling: Example
47. MEME Multiple EM for Motif Elicitation (MEME)
Will locate one or more ungapped patterns (motifs) in a set of sequences
A search is conducted for a range of possible motif widths and the EM algorithm is used to find the best estimate for the width of the motif
OOPS (one occurrence per sequence)
ZOOPS (zero or one occurrence per sequence)
TCM (any number of occurrences per sequence)
Can use prior knowledge about possible motifs
Will produce a PSSM
48. Which Representation Should I Use Gribskov Profile is the simplest (has the fewest parameters to fit to the available data) and requires the least data to adequately model these parameters.
HMMs are the most complex and require the most data to adequately model the parameters.
PSSMs are intermediate.
HMMs and PSSMs tend to look like the background when you dont have enough data to adequately model the parameters.
49. Sequence Logos Use information content of PSSM to make a graphical representation
http://weblogo.berkeley.edu
Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator, Genome Research, 14:1188-1190, (2004)
Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Res. 18:6097-6100
50. MEME motif patterns for fungal g-alphas
51. Fungal G-alpha motifs - structure view
52. MAST Motif Alignment and Search Tool
Searches through databases to identify motifs in other proteins
Helps in finding distantly related sequences
Helps in finding additional information about the motifs
53. MetaMEME Uses the EM and HMM methods to find motifs
Starts with the EM method of MEME to find most of the motifs
A simplified HMM is produced using the MEME results as prior information
MAST is used to determine the most probable order and spacing of the patterns
The above information and a set of modified Dirichlet mixtures are used to train the HMM
The HMM can then be used for database searches
54. PSI-BLAST Searches database with BLAST using query to get a group of sequences that will be used to create a PSSM
Iterates the search using the PSSM
User selects the sequences to be included for iteration
Helps find distantly related sequences
User may input own PSSM instead of a query sequence
Must be careful about sequence selection for PSSM creation or for iterations
Inclusion of wrong sequences may quickly create artifacts and end up with an incorrect set of sequences
55. Glutathione S-Transferase Detoxifies organic chemicals containing halogen or double bonds by addition of Glutathione.
Subsequent processing pathway leads to excretion.
The catalytic residue (thiol) is from Glutathione.
Only the cytoplasmic form is presented here.
Classified into six groups, initially based on Swiss-Prot database annotation. Exact number of groups is still subject to debate.
Found in bacteria and all kinds of eukaryotes.
126 Sequences from the Swiss-Protein Database.
56. MEME ZOOPS Motifs for GST
57. MEME ZOOPS Motifs -- Rat Mu-1
58. Specialized for Nucleic Acids AlignACE
Roth, F.P., Hughes, P.W., Estep, J.D., and Church, G.M. 1998. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotech. 16:939-945.
BioProspector
Liu X., Brutlag, D.L., Liu, J.S. 2001. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 2001;:127-38.
Up to tetranucleotide (3rd order Markov Model) background
Gibbs Recursive Sampler
Thompson, W., Rouchka, E.C., and Lawrence, C.E. 2003. Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res.31:35803585.
Allows palindromes and spacers in the model.