430 likes | 646 Views
BCB 444/544. Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19. Required Reading ( before lecture). √ Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST Chp 6 - pp 75-78 (but not HMMs)
E N D
BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Required Reading (before lecture) √Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Fri Sept 21 - EXAM 1 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Assignments & Announcements √Sun Sept 16 - Study Guide for Exam 1 was posted √Mon Sept 17-Answers to HW#2 were posted Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming? BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • √Scoring Function • √Exhaustive Algorithms • Heuristic Algorithms • Star Alignment • Clustal • √Practical Issues • First, review MSA scoring briefly, then back to Star Alignment & ClustalW BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Scoring an Alignment - in Lecture 12, so will be covered on Exam 1 Gap penalty F F F I D D D F F F I I - - A F P G Q I K - F F I Y Y Y A F P G Q I K A F P G Q I K - - - I D D D G G G G G G G F F F I Y Y Y G G Q G Q G K F F F I D D D W W W W W W W In practice, simple scoring functions are used Usually, columns are scored independently: ith column of alignment m BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Sum of Pairs (SP) Score F F I - mi PAM or BLOSUM score residue l F F F I F F I - A F P G - F F Y A F P G A F P G - - D D G G G G F F F I G G Q G F F F I W W W W • SP = sum of pairs = sum of scores of all possible pairs of sequences in an MSA, based on a particular scoring matrix • Compute for each column c: S(mi) = k<l s(mik, mil) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Example: Calculating SP Score I added more colors to this slide m1 m2 m3 F - G G G D M = F - G F Y D Gap penalty = -8 s(-,-) = 0 BLOSUM 60 S(m) = S(m1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Algorithms & Software for MSA? #1 Exhaustive Methods • √ Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • Progressive (Star Alignment, Clustal) • Iterative • Block-based BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Dynamic Programming for MSA 3D • As with pairwise alignments, MSAs can be computed by dynamic programming* *(if you're not in a rush!) F 2D BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Generalized Needleman-Wunsch Algorithm 3D Given 3 sequences x, y, and z: Main iteration loop: S(i,j,k) = max ( S(i-1, j-1, k-1) + (xi, yj, zk), S(i-1, j-1, k ) + (xi, yj, - ), S(i-1, j , k-1) + (xi, -, zk), S(i-1, j , k ) + (xi, -, - ), S(i , j-1, k-1) + ( -, yj, zk), S(i , j-1, k ) + ( -, yj, -), S(i , j , k-1) + ( -, -, zk) ) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
What Happens to Computational Complexity? 3D Given k sequences of length n • Space for matrix: O(nk) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22knk) • Wow!!! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
What's so bad about those exponents? Example: Running Time of DP for MSA • Overall runtime: O(k22knk) Sequences? Globins only »150 aa !! But: There are fast heuristics BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Progressive Alignment Multiple Alignment by adding sequences 1 2 3 4 Heuristic procedure: • Align most similar sequences first • Add sequences progressively Often: use guide tree to determine order of alignments 2 Examples:Star Alignment ClustalW BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Guide Trees Binary tree • Leaves correspond to sequences • Internal nodes represent alignments • Root corresponds to final MSA -TCG -TCC ATC- ATG- ATC TCG ATG TCC TCC ATC ATG TCG BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Star Alignment - skipped on Monday: will NOT be covered on Exam 1 • Back to2 Examples of • Progressive Alignment Heuristicsfor MSA: • STAR Alignment • Clustal BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Star Alignment • Fast heuristic to compute MSA • Good approximation of optimal MSA, if scoring scheme satisfies triangle inequality Algorithm: • Compute pairwise similarities • Select center sc that maximizes Σic S(sc,si) • Add sequences in decreasing orderof similarity to center sc • Produce a multiple alignment Msuch that, for every i, the induced pairwise alignment of scand si is same as the optimal alignment of sc and si BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Step 2 - Select center sc that maximizes Σic S(sc,si) FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF Does that function look familiar? Recall: Consensus sequence = single sequence (more accurately; "model") that represents most common residue of each column in MSA Steiner consensus sequence or string:Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) "String" equivalent of arithmetic mean:consensus sequence is string that minimizes sum of edit distances to members of a family of strings (thus, maximizing similarity score…) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Step 3 - Add sequences in decreasing orderof similarity to center sc s1: MPE s2: MKE s3: MSKE s4: SKE MPE | | MKE MSKE | || M-KE s1 s3 s2 MKE || SKE S-KE M-PE MSKE M-KE M-PE MSKE M-KE MSKE M-KE s4 S2+S3 +S1 +S4 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Step 4 - Produce a multiple alignment M such that for every i: the induced pairwise alignment of scand si is same as optimal alignment of sc and si ScAA--CCTT S1AATGCC-- ScA-ACC-TT S2AGACCGT- S1A-ATGCC--- ScA-A--CC-TT S2AGA--CCGT- BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Complexity of Star Alignment? Given k sequences of length n, and an upper bound l for alignment length We need: • O(k2n2) to compute the alignments • O(k2) to compute the center • O(k2l) to build multiple alignment Overall: O(k2n2) Duh - Is this really much better than O(k22knk)? YES!Remember: k = # of sequences n = length of sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
CLUSTAL: Overview Progressive Alignment Guide Tree 1 2 3 4 5 1 2 3 4 2 1 2 3 4 5 3 DistanceMatrix 4 1 1 + 2 1 + 3 1 + 4 2 + 3 2 + 4 3 + 4 Pairwise Alignments • Compute pairwise alignments (DP) • Convert similarities into distances • Distance between a pair = # of mismatched positions in alignment (divided by total # of matches) • Build guide tree from distances by Neighbor Joining • Align with respect to guide tree BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
CLUSTAL: Example 1 2 3 4 5 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
One "small" problem? Finding the Guide Tree Guide Tree 1 2 3 4 5 1 2 3 4 1 2 3 4 5 DistanceMatrix Goal: Given k sequences and their pairwise distances, find a tree, such that all distances correspond to path lengths between leaves Problem:Such a tree might not exist! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
CLUSTAL W Tree Tree calculated from an alignment of >1100 ring finger domains, using ClustalW 1.83 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Algorithms & Software for MSA? #2 √ Exhaustive Methods • Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • √Progressive (Star Alignment, Clustal) • Iterative • Block-based BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Algorithms & Software for MSA? #3 will NOT be covered on Exam1 Heuristic Methods - continued • Progressive alignments (Star Alignment, Clustal) • Others: T-Coffee, DbClustal -see text: can be better than Clustal • Match closely-related sequences first using a guide tree • Partial order alignments (POA) • Doesn't rely on guide tree; adds sequences in order given • PRALINE • Preprocesses input sequences by building profiles for each • Iterative methods • Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions(eg: PRRN) • Block-based Alignment • Multiple re-building attempts to find best alignment (eg:DIALIGN2 & Match-Box) • Local alignments • Profiles, Blocks, Patterns - more on these soon! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √Position Specific Scoring Matrices (PSSMs) • √PSI-BLAST First, review above briefly, then: • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
PSI-BLAST (Covered in Lecture 12, so will be covered on Exam1) • Position Specific Iterated BLAST • Intuition: substitution matrices should be "sensitive" to protein context • e.g., larger penalty for Ala→Gly substitution if in a helix rather than in a loop • Basic idea: • Use BLAST with high stringency to generate a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Use this matrix (iteratively) to find additional sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
PSI-BLAST Pseudocode Position-Specific Scoring Matrix Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
What is a PSSM? Position-Specific Scoring Matrix I added more text to this slide “K” at position 3 gets a score of 2 8 residue sequence A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position 20 letter alphabet Xiong:PSSM = table that contains probability information re: residues at each position of an ungapped MSA Also, sometimes called: Position Weight Matrix (PWM) Note: Assumes positions are independent BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Assigning a "Match" Score with a PSSM PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Creating a PSSM from 1 Sequence R L RNRGQFGH R BLOSUM62 matrix 20 by 20 20 by L BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Creating a PSSM from Multiple Sequences • Discard columns that contain gaps in query sequence • Compute relative sequence weights • Compute PSSM entries, taking into account • Observed residues in column • Sequence weights • Substitution matrix BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
1- Discard Columns with Gaps in Query EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Smaller weights are assigned to redundant sequences Larger weights are assigned to unique sequences 2- Compute Sequence Weights Info re: weights was added to this slide EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 • How are weights determined? • Based on branch lengths in guide tree: value for each sequence is then used to multiply raw alignment scores • Goal of weighting? to decrease matching scores of frequent characters in MSA & increase scores of infrequent characters BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
3- Compute PSSM Entries (simplified version) This slide was modified A 0.085 C 0.019 D 0.054 E 0.065 F 0.040 G 0.072 H 0.023 I 0.058 K 0.056 L 0.096 M 0.024 P 0.053 Q 0.042 R 0.054 S 0.072 T 0.063 V 0.073 W 0.016 Y 0.034 E Q R G K A F A = PSSM Observed residues Background frequencies PSSM column / Usually derived from large sequence database BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
PSSM Entries = Log-Odds Scores This slide was modified Observed frequency of residue “A” Foreground model (i.e., the PSSM) • Estimate probability of observing each residue(probability of A given M, where M is PSSM model) • Divide by background probability of observing each residue(probability of A given B, where B is background model) • Take log so that can add (rather than multiply) scores Background model BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Why (not) PSI-BLAST? • Psi-BLAST weights sequences according to observed diversity specific to family under investigation • Advantage: If sequences used to construct PSSMs are all homologous, sensitivity for a given level of specificity improves significantly • Disadvantage: However, if any non-homologous sequences are included in PSSMs, they become “corrupted” and "pull in" additional non-homologous sequences, resulting in false positive hits BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
How to Use PSI-BLAST Effectively • Set initial thresholds high • Inspect each iteration's result for suspicious sequences (When in doubt, leave it out!) • Do several iterations (~5), or until no new sequences are found • Make initial search very broad • First, use NR (large, inclusive database) with up to 5 iterations to set PSSM • Then use that PSSM to search in a more restricted domain, if possible • Be particularly cautious about matches to sequences with highly biased amino acid content BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Summary: DP, BLAST & PSI-BLAST • Dynamic programming is O(NM) for pairwise alignment • BLAST is O(M) • BLAST produces an index of words in query sequence that allows fast matching to the database • At NCBI, target databases are also pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold • PSI-BLAST iterates BLAST, adding new homologs at each iteration BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Applications of MSA • Building phylogenetic trees • Finding conserved patterns: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs(single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Application: Discover Conserved Patterns Is there a conservedcis-acting regulatory sequence? Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Sequence Logo BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Sequence Motifs (Patterns) Other types of representations? • √ Consensus Sequence • √ PSSM - Position-Specific Scoring Matrix • √ Sequence Logo - "enhanced"consensus sequence, in which symbol size information entropy • Information entropy???In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipedia • Check out this fun website: Tom Scheider, NCIF • http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo • Profile • HMM - Hidden Markov Model BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs