1 / 55

BCB 444/544

BCB 444/544. Lecture 12 Multiple Sequence Alignment (MSA) PSSMs & Psi-BLAST #12_Sept17. Required Reading ( before lecture). √ Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST Chp 6 - pp 75-78 (but not HMMs)

holiday
Download Presentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 12 Multiple Sequence Alignment (MSA) PSSMs & Psi-BLAST #12_Sept17 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  2. Required Reading (before lecture) √Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Wed Sept 21 - EXAM 1 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  3. Assignments & Announcements Sun Sept 16 - Study Guide for Exam 1 was posted Mon Sept 17-Answers to HW#2 will be posted ~ Noon Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming~ BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  4. Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • Scoring Function • Exhaustive Algorithms • Heuristic Algorithms • Practical Issues BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  5. Multiple Sequence Alignments Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &Hunter BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  6. Overview • What is a multiple sequence alignment (MSA)? • Where/why do we need MSA? • What is a good MSA? • Algorithms to compute a MSA BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  7. Multiple Sequence Alignment • Generalize pairwise alignment of sequences to include > 2 homologous sequences • Analyzing more than 2 sequences gives us much more information: • Which amino acids are required? Correlated? • Evolutionary/phylogenetic relationships • Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to provide more "sensitivity" BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  8. Definition: MSA Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that • resulting sequences have same length • no column contains only gaps ATT-GC AT-TGC AT-T-GC ATTTGC ATTTGC ATTT-GC ATTTG ATTTG- ATTT-G- NO YES NO BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  9. Displaying MSAs: using CLUSTAL W RED:AVFPMILW (small) BLUE:DE (acidic, negative chg) MAGENTA: RHK (basic, positive chg) GREEN: STYHCNGQ (hydroxyl + amine + basic) * entirely conserved column : all residues have ~ same size ANDhydropathy . all residues have ~ same size ORhydropathy BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  10. What is a Consensus Sequence? FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF A single sequence that represents most common residue of each column in a MSA Example: Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  11. Applications of MSA • Building phylogenetic trees • Finding conserved patterns, e.g.: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs(single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  12. Application: Recover Phylogenetic Tree What was series of events that led to current species? NYLS NFLS NYLS BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  13. Application: Discover Conserved Patterns Is there a conserved cis-acting regulatory sequence? Rationale: if they are homologous (derived from a common ancestor), they may be structurally equivalent TATA box = transcriptional promoter element BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  14. Goal: Characterize Protein Families Which parts of globin sequences are most highly conserved? BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  15. Databases of Multiple Alignments • Pfam (Protein Domain Families data base) • Contains alignments and HMMs of protein families • InterPro • Integrates: Prosite, Prints, ProDom, Pfam, and SMART • BLOCKS • Segments of highly conserved multiple alignments • Hovergen(Homologous Vertebrate Genes Database) • COGs (Clusters of Orthologous Groups) • BaliBASE (Benchmark alignments database) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  16. Scoring an Alignment NYLS NFLS NYLS Goal: Align homologous positions. But: Without knowledge of phylogenetic tree is this very hard (sometimes impossible) to achieve! BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  17. Scoring an Alignment gap penalty F F F I D D D F F F I I - - A F P G Q I K - F F I Y Y Y A F P G Q I K A F P G Q I K - - - I D D D G G G G G G G F F F I Y Y Y G G Q G Q G K F F F I D D D W W W W W W W ithcolumn of alignmentm In practice, simple scoring functions are used: usually, columns are scored independently, i.e. BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  18. Sum of Pairs (SP) Score F F I - mi F F F I F F I - A F P G - F F Y A F P G A F P G - - D D G G G G F F F I G G Q G F F F I W W W W • SP = sum of scores of all possible pairs of sequences in an MSA based on a particular scoring matrix • Compute for each column cS(mi) = k<l s(mik,mil) residue l PAM or BLOSUM score BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  19. How Score Gaps in MSAs? Want to align gaps with each other over all sequences. A gap in a pairwise alignment that “matches” a gap in another pairwise alignment should cost less than introducing a totally new gap. • Possible that a new gap could be made to “match” an older one by adjusting older pairwise alignment • Change gap penalty near conserved domains of various kinds (e.g. secondary structure elements, hydrophobic regions) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  20. Example: SP Score F-G G G D F-G m= FYD Gap penalty: -8 s(-,-) = 0 BLOSUM 60 S(m) = S(m1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  21. Overcoming problems with SP scoring • Use weights to incorporate evolution in sum of pairs scoring: • Some pairwise alignments are more important than others • e.g., more important to have a good alignment between mouse & human sequences than between mouse & bird • Assign different weights to different pairwise alignments • Weight decreases with evolutionary distance BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  22. How Compute a Multiple Alignment? Algorithms for MSA: • Multidimensional dynamic programming • Optimal global alignment (time & space intensive!!!) • Progressive alignments (Star alignment, ClustalW) • Match closely-related sequences first using a guide tree • Iterative methods • Combined local alignments (Dialign) • Multiple re-building attempts to find best alignment • Partial order alignment (POA) • Local alignments • Profiles, Blocks, Patterns BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  23. Dynamic Programming for MSA 3D • As with pairwise alignments, multiple sequence alignments can be computed by dynamic programming F 2D BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  24. Generalized Needleman-Wunsch Algorithm 3D Given 3 sequences x, y, and z: Main iteration loop: F(i,j,k) = max ( F(i-1, j-1, k-1) + S(xi, yj, zk), F(i-1, j-1, k ) + S(xi, yj, - ), F(i-1, j , k-1) + S(xi, -, zk), F(i-1, j , k ) + S(xi, -, - ), F(i , j-1, k-1) + S( -, yj, zk), F(i , j-1, k ) + S( -, yj, -), F(i , j , k-1) + S( -, -, zk) ) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  25. What Happens to Computational Complexity? 3D Given k sequences of length n: • Space for matrix: O(nk) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22knk) • Ouch!!! BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  26. What's so bad about those exponents? An example: Running Time of DP • Overall runtime: O(k22knk) Sequences: globins ( 150 aa) But: There are fast heuristics. BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  27. Progressive Alignment Multiple Alignment by adding sequences 1 2 3 4 Heuristic procedure: • Align most similar sequences first • Add sequences progressively Often: use guide tree to determine order of alignments Examples: Star alignment ClustalW BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  28. Guide Tree Binary tree • Leaves correspond to sequences • Internal nodes represent alignments • Root corresponds to final MSA -TCG -TCC ATC- ATG- ATC TCG ATG TCC TCC ATC ATG TCG BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  29. Star Alignment - will skip for now,come back to this on WedStar alignment will NOT be covered on Exam 1 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  30. Chp6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • Position Specific Scoring Matrices (PSSMs) • PSI-BLAST • Profiles • Markov Model & Hidden Markov Model BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  31. PSI Blast • Position Specific Iterated BLAST • Intuition: substitution matrices should be specific to a particular site: penalize alanine→glycine more in a helix • Basic idea: • Use BLAST with high stringency to get a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Then use that matrix (iteratively) to find additional sequences BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  32. Psi-BLAST Query PSSM Multiple alignment Sequence database BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  33. PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  34. PSI-BLAST pseudocode Position-specific scoring matrix Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  35. PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  36. Position-specific scoring matrix - PSSM • A PSSM is an n by m matrix, where n is the size of alphabet, and m is length of sequence • Entry at (i, j) is score assigned by PSSM to letter i at the jth position BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  37. Position-specific scoring matrix • A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. • The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. “K” at position 3 gets a score of 2 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  38. Position-specific scoring matrix This PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  39. Position-specific scoring matrix • What score does this PSSM assign to KRPGHFLA? 2 + 0 + -2 + 6 + 0 + 6 + -4 + -2 = 6 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  40. Position-specific iterated BLAST ? Query PSSM Multiple alignment Sequence database BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  41. Creating a PSSM from 1 sequence R L RNRGQFGH R BLOSUM62 matrix 20 by 20 20 by L BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  42. Position-specific iterated BLAST ? Query PSSM Multiple alignment Sequence database BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  43. Creating a PSSM from multiple sequences • Discard columns that contain gaps in query • For each column C • Compute relative sequence weights • Compute PSSM entries, taking into account • Observed residues in this column • Sequence weights • Substitution matrix BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  44. Discard query gap columns EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  45. Low weights are assigned to redundant sequences High weights are assigned to unique sequences Compute sequence weights EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  46. Compute PSSM entries (simplified version) A 0.085 C 0.019 D 0.054 E 0.065 F 0.040 G 0.072 H 0.023 I 0.058 K 0.056 L 0.096 M 0.024 P 0.053 Q 0.042 R 0.054 S 0.072 T 0.063 V 0.073 W 0.016 Y 0.034 + = PSSM Background frequencies Observed residues PSSM column E Q R G K A F A These are usually derived from a large sequence database BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  47. Log-odds score • Estimate the probability of observing each residue • Divide by the background probability of observing the same residue • Take log so scores will be additive BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  48. Log-odds score Residue was generated by foreground model (i.e., the PSSM) Residue “A” is observed • Estimate the probability of observing each residue • Divide by the background probability of observing the same residue • Take log so scores will be additive Residue was generated by the background model (i.e., randomly selected) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  49. Why (not) PSI-BLAST • Weights sequence according to observed diversity specific to family of interest • Advantage: If sequences used to construct Position Specific Scoring Matrices (PSSMs) are all homologous, sensitivity at a given specificity improves significantly • Disadvantage: However, if any non-homologous sequences are included in PSSMs, they are “corrupted.” Then they "pull in" addition non-homologous sequences, and become worse than generic BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

  50. How to use PSI BLAST • Set initial thresholds high • Inspect each iteration's result for suspicious sequences • Do several iterations (~5), or until no new sequences are found • Even if only looking for a small set of sequences, make initial search very broad • First, use NR (large, inclusive database) with up to 5 iterations to set PSSM • Then use that PSSM to search in restricted domain BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST

More Related