BCB 444/544

BCB 444/544 Lecture 11 First BLAST vs FASTA Plus some Gene Jargon Multiple Sequence Alignment (MSA) #11_Sept14 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Required Reading (before lecture) √Mon Sept 10 - for Lecture 9/10 BLAST variations; BLAST vs FASTA, SW • Chp 4 - pp 51-62 √Wed Sept 12 - for Lecture 11 & Lab 4 Multiple Sequence Alignment (MSA) • Chp 5 - pp 63-74 Fri Sept 14 - for Lecture 12 Position Specific Scoring Matrices & Profiles • Chp 6 - pp 75-78 (but not HMMs) • Good Additional Resource re: Sequence Alignment? • Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Assignments & Announcements - #1 Revised Grading Policy has been sent via email Please review! √Mon Sept 10 - Lab 3 Exercise due 5 PM:to:terrible@iastate.edu ?Thu Sept 13 - GradedLabs 2 & 3 will be returned at beginning of Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Review: Gene Jargon #1 (for HW2, 1c) Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes vs Introns = "intervening sequences" = segments of eukaryotic genes that "interrupt" exons • Introns are transcribed into pre-RNA • but are later removed by RNA processing • &do not appear in mature mRNA • so are not translated into protein BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Assignments & Announcements - #2 Mon Sept 17-Answers to HW#2 will be posted by 5 PM Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Chp 4- Database Similarity Searching SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching • √Unique Requirements of Database Searching • √Heuristic Database Searching • √Basic Local Alignment Search Tool (BLAST) • FASTA • Comparison of FASTA and BLAST • Database Searching with Smith-Waterman Method BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Why search a database? • Given a newly discovered gene, • Does it occur in other species? • Is its function known in another species? • Given a newly sequenced genome, which regions align with genomes of other organisms? • Identification ofpotential genes • Identification of other functional parts of chromosomes • Find members of a multigene family BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

FASTA and BLAST • Both FASTA, BLAST are based on heuristics • Tradeoff: Sensitivity vs Speed • DP is slower, but more sensitive • FASTA • user defines value for k = word length • Slower, but more sensitive than BLAST at lower values of k, (preferred for searches involving a very short query sequence) • BLAST family • Family of different algorithms optimized for particular types of queries, such as searching for distantly related sequence matches • BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

BLAST algorithms can generate both "global" and "local" alignments Globalalignment Local alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

BLAST - a Family of Programs: Different BLAST "flavors" • BLASTP - protein sequence query against protein DB • BLASTN - DNA/RNA seq query against DNA DB (GenBank) • BLASTX - 6-frame translated DNA seq query against protein DB • TBLASTN - protein query against 6-frame DNA translation • TBLASTX - 6-frame DNA query to 6-frame DNA translation • PSI-BLAST - protein "profile" query against protein DB • PHI-BLAST - protein pattern against protein DB • Newest: MEGA-BLAST - optimized for highly similar sequences Which tool should you use? http://www.ncbi.nlm.nih.gov/blast/producttable.shtml BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Remove low-complexity regions (LCRs) Make a list (dictionary): all words of length 3aa or 11 nt Augment list to include similar words Store list in a search tree (data structure) Scan database for occurrences of words in search tree Connect nearby occurrences Extend matches (words) in both directions Prune list of matches using a score threshold Evaluate significance of each remaining match Perform Smith-Waterman to get alignment Detailed Steps in BLAST algorithm BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

1: Filter low-complexity regions (LCRs) This slide has been changed! K = computational complexity; varies from 0 (very low complexity) to 1 (high complexity) • Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology. • Low complexity sequences can yield false positives. • Screen them out of your query sequences! When appropriate! Alphabet size (4 or 20) Window length (usually 12) • e.g., for GGGG: • L! = 4!=4x3x2x1= 24 • nG=4 nT=nA=nC=0 • ni! = 4!x0!x0!x0! = 24 K=1/4 log4 (24/24) = 0 For CGTA: K=1/4 log4(24/1) = 0.57 Frequency of ith letter in the window BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

2: List all words in query YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … AAA AAB AAC … YYY 203 = 8000 possible matches BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

3: Augment word list G G F A A A 0 + 0 + -2 = -2 Non-match BLOSUM62 scores G G F G G Y 6 + 6 + 3 = 15 Match A user-specified threshold, T, determines which 3-letter words are considered matches and non-matches BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … GGI GGL GGM GGF GGW GGY … BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

3: Augment word list Observation: Selecting only words with score > T greatly reduces number of possible matches otherwise, 203for 3-letter words from amino acid sequences! BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Example Find all words that match EAM with a score greater than or equal to 11 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 EAM 5 + 4 + 5 = 14 DAM 2 + 4 + 5 = 11 QAM 2 + 4 + 5 = 11 ESM 5 + 1 + 5 = 11 EAL 5 + 4 + 2 = 11 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

4: Store words in search tree Augmented list of query words “Does this query contain GGF?” Search tree “Yes, at position 2.” BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Search tree G G F L M W Y GGF GGL GGM GGW GGY BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Example D Q E K A A A C G S T V A M M M M M M M M I V L M Put this word list into a search tree DAM QAM EAM KAM ECM EGM ESM ETM EVM EAI EAL EAV BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

5: Scan the database sequences Database sequence    Query sequence      BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Example Scan this "database" for occurrences of your words MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA E A M P Q L S V D A M  BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

6: Connect nearby occurences (diagonal matches in Gapped BLAST) Database sequence Two dots are connected IFF if they are less than A letters apart & are on diagonal    Query sequence      BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

7: Extend matches in both directions Scan DB BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

7: Extend matches, calculating score at each step L P P Q G L L Query sequence M P P E G L L Database sequence <word> 7 2 6 BLOSUM62 scores word score = 15 <--- ---> 2 7 7 2 6 4 4 HSP SCORE = 32 (High Scoring Pair) • Each match is extended to left & right until a negative BLOSUM62 score is encountered • Extension step typically accounts for > 90% of execution time BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

8: Prune matches • Discard all matches that score below defined threshold BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

9: Evaluate significance This slide has been changed! • BLAST uses an analytical statistical significance calculation RECALL: • E-value:E = m x n x P m = total number of residues in database n= number of residues in query sequence P = probability that an HSP is result of random chance lower E-value,less likely to result from random chance, thus higher significance • Bit Score: S' = normalized score, to account for differences in size of database (m) & sequence length(n); Note (below) that bit score is linearly related to raw alignment score, so:higher S' means alignment has higher significance S'= ( X S - ln K)/ln2 where:  = Gumble distribution constant S = raw alignment score K = constant associated with scoring matrix For more details - see text & BLAST tutorial BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

10: Use Smith-Waterman algorithm (DP) to generate alignment • ONLY significant matches are re-analyzed using Smith-Waterman DP algorithm. • Alignments reported by BLAST are produced by dynamic programming BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

BLAST: What is a "Hit"? • A hit is a w-length word in database that aligns with a word from query sequence with score > T • BLAST looks for hits instead of exact matches • Allows word size to be kept larger for speed, without sacrificing sensitivity • Typically, w = 3-5 for amino acids, w = 11-12 for DNA • T is the most critical parameter: • ↑T ↓ “background” hits (faster) • ↓T ↑ ability to detect more distant relationships (at cost of increased noise) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Tips for BLAST Similarity Searches • If you don’t know, use default parameters first • Try several programs & several parameter settings • If possible, search on protein sequence level • Scoring matrices: PAM1 / BLOSUM80: if expect/want less divergent proteins PAM120 / BLOSUM62: "average" proteins PAM250 / BLOSUM45: if need to find more divergent proteins • Proteins: >25-30% identity (and >100aa) -> likely related 15-25% identity -> twilight zone <15% identity -> likely unrelated BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Practical Issues Searching on DNA or protein level? In general, protein-encoding DNA should be translated! • DNA yields more random matches: • 25% for DNA vs. 5% for proteins • DNA databases are larger and grow faster • Selection (generally) acts on protein level • Synonymous mutations are usually neutral • DNA sequence similarity decays faster BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

BLAST vs FASTA • Seeding: • BLAST integrates scoring matrix into first phase • FASTA requires exact matches (uses hashing) • BLAST increases search speed by finding fewer, but better, words during initial screening phase • FASTA uses shorterword sizes - so can be more sensitive • Results: • BLAST can return multiple best scoring alignments • FASTA returns only one final alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

BLAST & FASTA References • FASTA - developed first • Pearson & Lipman (1988)  Improved Tools for Biological Sequence Comparison. PNAS 85:2444- 2448 • BLAST • Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) • Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman (1997)  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.  Nucleic Acids Res. 25:3389-402 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

BLAST Notes - & DP Alternatives • BLAST uses heuristics: it may miss some good matches • But, it’s fast: 50 - 100X faster than Smith-Waterman (SW) DP • Large impact: • NCBI’s BLAST server handles more than 100,000 queries/day • Most used bioinformatics program in the world! • But - Xiong says: "It has been estimated that for some families of protein sequences BLAST can miss 30% of truly significant matches." • Increased availability of parallel processing has made DP-based approaches feasible: • 2 DP-based web servers:both more sensitive than BLAST • Scan Protein Sequence: http://www.ebi.ac.uk/scanps/index.html Implements modified SW optimized for parallel processing • ParAlignwww.paralign.org - parallel SW or heuristics BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

NCBI - BLAST Programs Glossary & Tutorials BLAST • http://www.ncbi.nlm.nih.gov/BLAST/ • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • Scoring Function • Exhaustive Algorithms • Heuristic Algorithms • Practical Issues BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Multiple Sequence Alignments Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &Hunter BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Overview • What is a multiple sequence alignment (MSA)? • Where/why do we need MSA? • What is a good MSA? • Algorithms to compute a MSA BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Multiple Sequence Alignment • Generalize pairwise alignment of sequences to include > 2 homologous sequences • Analyzing more than 2 sequences gives us much more information: • Which amino acids are required? Correlated? • Evolutionary/phylogenetic relationships • Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to provide more "sensitivity" BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

What is a MSA? ATT-GC AT-TGC AT-T-GC ATTTGC ATTTGC ATTT-GC ATTTG ATTTG- ATTT-G- MSA Not a MSA Not a MSA Why? BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Definition: MSA Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that • resulting sequences have same length • no column contains only gaps ATT-GC AT-TGC AT-T-GC ATTTGC ATTTGC ATTT-GC ATTTG ATTTG- ATTT-G- NO YES NO BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Displaying MSAs: using CLUSTAL W RED:AVFPMILW (small) BLUE:DE (acidic, negative chg) MAGENTA: RHK (basic, positive chg) GREEN: STYHCNGQ (hydroxyl + amine + basic) * entirely conserved column : all residues have ~ same size ANDhydropathy . all residues have ~ same size ORhydropathy BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

What is a Consensus Sequence? FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF A single sequence that represents most common residue of each column in a MSA Example: Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Applications of MSA • Building phylogenetic trees • Finding conserved patterns, e.g.: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Application: Recover Phylogenetic Tree What was series of events that led to current species? NYLS NFLS NYLS BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Application: Discover Conserved Patterns Is there a conserved cis-acting regulatory sequence? Rationale: if they are homologous (derived from a common ancestor), they may be structurally equivalent TATA box = transcriptional promoter element BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

Goal: Characterize Protein Families Which parts of globin sequences are most highly conserved? BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment

BCB 444/544

BCB 444/544

Presentation Transcript

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 322: Landscape Ecology

BCB 444/544

BCB 444/544

BCB 444/544

EPSY 544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544

BCB 444/544