1 / 48

BCB 444/544

BCB 444/544. Lecture 11 First BLAST vs FASTA Plus some Gene Jargon Multiple Sequence Alignment (MSA) #11_Sept14. Required Reading ( before lecture). √ Mon Sept 10 - for Lecture 9/10 BLAST variations; BLAST vs FASTA, SW Chp 4 - pp 51-62

weylin
Download Presentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 11 First BLAST vs FASTA Plus some Gene Jargon Multiple Sequence Alignment (MSA) #11_Sept14 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  2. Required Reading (before lecture) √Mon Sept 10 - for Lecture 9/10 BLAST variations; BLAST vs FASTA, SW • Chp 4 - pp 51-62 √Wed Sept 12 - for Lecture 11 & Lab 4 Multiple Sequence Alignment (MSA) • Chp 5 - pp 63-74 Fri Sept 14 - for Lecture 12 Position Specific Scoring Matrices & Profiles • Chp 6 - pp 75-78 (but not HMMs) • Good Additional Resource re: Sequence Alignment? • Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  3. Assignments & Announcements - #1 Revised Grading Policy has been sent via email Please review! √Mon Sept 10 - Lab 3 Exercise due 5 PM:to:terrible@iastate.edu ?Thu Sept 13 - GradedLabs 2 & 3 will be returned at beginning of Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  4. Review: Gene Jargon #1 (for HW2, 1c) Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes vs Introns = "intervening sequences" = segments of eukaryotic genes that "interrupt" exons • Introns are transcribed into pre-RNA • but are later removed by RNA processing • &do not appear in mature mRNA • so are not translated into protein BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  5. Assignments & Announcements - #2 Mon Sept 17-Answers to HW#2 will be posted by 5 PM Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  6. Chp 4- Database Similarity Searching SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching • √Unique Requirements of Database Searching • √Heuristic Database Searching • √Basic Local Alignment Search Tool (BLAST) • FASTA • Comparison of FASTA and BLAST • Database Searching with Smith-Waterman Method BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  7. Why search a database? • Given a newly discovered gene, • Does it occur in other species? • Is its function known in another species? • Given a newly sequenced genome, which regions align with genomes of other organisms? • Identification ofpotential genes • Identification of other functional parts of chromosomes • Find members of a multigene family BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  8. FASTA and BLAST • Both FASTA, BLAST are based on heuristics • Tradeoff: Sensitivity vs Speed • DP is slower, but more sensitive • FASTA • user defines value for k = word length • Slower, but more sensitive than BLAST at lower values of k, (preferred for searches involving a very short query sequence) • BLAST family • Family of different algorithms optimized for particular types of queries, such as searching for distantly related sequence matches • BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  9. BLAST algorithms can generate both "global" and "local" alignments Globalalignment Local alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  10. BLAST - a Family of Programs: Different BLAST "flavors" • BLASTP - protein sequence query against protein DB • BLASTN - DNA/RNA seq query against DNA DB (GenBank) • BLASTX - 6-frame translated DNA seq query against protein DB • TBLASTN - protein query against 6-frame DNA translation • TBLASTX - 6-frame DNA query to 6-frame DNA translation • PSI-BLAST - protein "profile" query against protein DB • PHI-BLAST - protein pattern against protein DB • Newest: MEGA-BLAST - optimized for highly similar sequences Which tool should you use? http://www.ncbi.nlm.nih.gov/blast/producttable.shtml BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  11. Remove low-complexity regions (LCRs) Make a list (dictionary): all words of length 3aa or 11 nt Augment list to include similar words Store list in a search tree (data structure) Scan database for occurrences of words in search tree Connect nearby occurrences Extend matches (words) in both directions Prune list of matches using a score threshold Evaluate significance of each remaining match Perform Smith-Waterman to get alignment Detailed Steps in BLAST algorithm BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  12. 1: Filter low-complexity regions (LCRs) This slide has been changed! K = computational complexity; varies from 0 (very low complexity) to 1 (high complexity) • Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology. • Low complexity sequences can yield false positives. • Screen them out of your query sequences! When appropriate! Alphabet size (4 or 20) Window length (usually 12) • e.g., for GGGG: • L! = 4!=4x3x2x1= 24 • nG=4 nT=nA=nC=0 • ni! = 4!x0!x0!x0! = 24 K=1/4 log4 (24/24) = 0 For CGTA: K=1/4 log4(24/1) = 0.57 Frequency of ith letter in the window BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  13. 2: List all words in query YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  14. 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … AAA AAB AAC … YYY 203 = 8000 possible matches BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  15. 3: Augment word list G G F A A A 0 + 0 + -2 = -2 Non-match BLOSUM62 scores G G F G G Y 6 + 6 + 3 = 15 Match A user-specified threshold, T, determines which 3-letter words are considered matches and non-matches BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  16. 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … GGI GGL GGM GGF GGW GGY … BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  17. 3: Augment word list Observation: Selecting only words with score > T greatly reduces number of possible matches otherwise, 203for 3-letter words from amino acid sequences! BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  18. Example Find all words that match EAM with a score greater than or equal to 11 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 EAM 5 + 4 + 5 = 14 DAM 2 + 4 + 5 = 11 QAM 2 + 4 + 5 = 11 ESM 5 + 1 + 5 = 11 EAL 5 + 4 + 2 = 11 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  19. 4: Store words in search tree Augmented list of query words “Does this query contain GGF?” Search tree “Yes, at position 2.” BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  20. Search tree G G F L M W Y GGF GGL GGM GGW GGY BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  21. Example D Q E K A A A C G S T V A M M M M M M M M I V L M Put this word list into a search tree DAM QAM EAM KAM ECM EGM ESM ETM EVM EAI EAL EAV BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  22. 5: Scan the database sequences Database sequence    Query sequence      BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  23. Example Scan this "database" for occurrences of your words MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA E A M P Q L S V D A M  BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  24. 6: Connect nearby occurences (diagonal matches in Gapped BLAST) Database sequence Two dots are connected IFF if they are less than A letters apart & are on diagonal    Query sequence      BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  25. 7: Extend matches in both directions Scan DB BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  26. 7: Extend matches, calculating score at each step L P P Q G L L Query sequence M P P E G L L Database sequence <word> 7 2 6 BLOSUM62 scores word score = 15 <--- ---> 2 7 7 2 6 4 4 HSP SCORE = 32 (High Scoring Pair) • Each match is extended to left & right until a negative BLOSUM62 score is encountered • Extension step typically accounts for > 90% of execution time BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  27. 8: Prune matches • Discard all matches that score below defined threshold BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  28. 9: Evaluate significance This slide has been changed! • BLAST uses an analytical statistical significance calculation RECALL: • E-value:E = m x n x P m = total number of residues in database n= number of residues in query sequence P = probability that an HSP is result of random chance lower E-value,less likely to result from random chance, thus higher significance • Bit Score: S' = normalized score, to account for differences in size of database (m) & sequence length(n); Note (below) that bit score is linearly related to raw alignment score, so:higher S' means alignment has higher significance S'= ( X S - ln K)/ln2 where:  = Gumble distribution constant S = raw alignment score K = constant associated with scoring matrix For more details - see text & BLAST tutorial BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  29. 10: Use Smith-Waterman algorithm (DP) to generate alignment • ONLY significant matches are re-analyzed using Smith-Waterman DP algorithm. • Alignments reported by BLAST are produced by dynamic programming BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  30. BLAST: What is a "Hit"? • A hit is a w-length word in database that aligns with a word from query sequence with score > T • BLAST looks for hits instead of exact matches • Allows word size to be kept larger for speed, without sacrificing sensitivity • Typically, w = 3-5 for amino acids, w = 11-12 for DNA • T is the most critical parameter: • ↑T ↓ “background” hits (faster) • ↓T ↑ ability to detect more distant relationships (at cost of increased noise) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  31. Tips for BLAST Similarity Searches • If you don’t know, use default parameters first • Try several programs & several parameter settings • If possible, search on protein sequence level • Scoring matrices: PAM1 / BLOSUM80: if expect/want less divergent proteins PAM120 / BLOSUM62: "average" proteins PAM250 / BLOSUM45: if need to find more divergent proteins • Proteins: >25-30% identity (and >100aa) -> likely related 15-25% identity -> twilight zone <15% identity -> likely unrelated BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  32. Practical Issues Searching on DNA or protein level? In general, protein-encoding DNA should be translated! • DNA yields more random matches: • 25% for DNA vs. 5% for proteins • DNA databases are larger and grow faster • Selection (generally) acts on protein level • Synonymous mutations are usually neutral • DNA sequence similarity decays faster BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  33. BLAST vs FASTA • Seeding: • BLAST integrates scoring matrix into first phase • FASTA requires exact matches (uses hashing) • BLAST increases search speed by finding fewer, but better, words during initial screening phase • FASTA uses shorterword sizes - so can be more sensitive • Results: • BLAST can return multiple best scoring alignments • FASTA returns only one final alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  34. BLAST & FASTA References • FASTA - developed first • Pearson & Lipman (1988) 
Improved Tools for Biological Sequence Comparison.
PNAS 85:2444- 2448 • BLAST • Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) • Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman (1997) 
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 
Nucleic Acids Res. 25:3389-402 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  35. BLAST Notes - & DP Alternatives • BLAST uses heuristics: it may miss some good matches • But, it’s fast: 50 - 100X faster than Smith-Waterman (SW) DP • Large impact: • NCBI’s BLAST server handles more than 100,000 queries/day • Most used bioinformatics program in the world! • But - Xiong says: "It has been estimated that for some families of protein sequences BLAST can miss 30% of truly significant matches." • Increased availability of parallel processing has made DP-based approaches feasible: • 2 DP-based web servers:both more sensitive than BLAST • Scan Protein Sequence: http://www.ebi.ac.uk/scanps/index.html Implements modified SW optimized for parallel processing • ParAlignwww.paralign.org - parallel SW or heuristics BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  36. NCBI - BLAST Programs Glossary & Tutorials BLAST • http://www.ncbi.nlm.nih.gov/BLAST/ • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  37. Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • Scoring Function • Exhaustive Algorithms • Heuristic Algorithms • Practical Issues BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  38. Multiple Sequence Alignments Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &Hunter BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  39. Overview • What is a multiple sequence alignment (MSA)? • Where/why do we need MSA? • What is a good MSA? • Algorithms to compute a MSA BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  40. Multiple Sequence Alignment • Generalize pairwise alignment of sequences to include > 2 homologous sequences • Analyzing more than 2 sequences gives us much more information: • Which amino acids are required? Correlated? • Evolutionary/phylogenetic relationships • Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to provide more "sensitivity" BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  41. What is a MSA? ATT-GC AT-TGC AT-T-GC ATTTGC ATTTGC ATTT-GC ATTTG ATTTG- ATTT-G- MSA Not a MSA Not a MSA Why? BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  42. Definition: MSA Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that • resulting sequences have same length • no column contains only gaps ATT-GC AT-TGC AT-T-GC ATTTGC ATTTGC ATTT-GC ATTTG ATTTG- ATTT-G- NO YES NO BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  43. Displaying MSAs: using CLUSTAL W RED:AVFPMILW (small) BLUE:DE (acidic, negative chg) MAGENTA: RHK (basic, positive chg) GREEN: STYHCNGQ (hydroxyl + amine + basic) * entirely conserved column : all residues have ~ same size ANDhydropathy . all residues have ~ same size ORhydropathy BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  44. What is a Consensus Sequence? FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF A single sequence that represents most common residue of each column in a MSA Example: Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  45. Applications of MSA • Building phylogenetic trees • Finding conserved patterns, e.g.: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  46. Application: Recover Phylogenetic Tree What was series of events that led to current species? NYLS NFLS NYLS BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  47. Application: Discover Conserved Patterns Is there a conserved cis-acting regulatory sequence? Rationale: if they are homologous (derived from a common ancestor), they may be structurally equivalent TATA box = transcriptional promoter element BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

  48. Goal: Characterize Protein Families Which parts of globin sequences are most highly conserved? BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

More Related