1 / 50

Understanding Sequence Similarity: Pairwise Alignment in Bioinformatics

Learn how to determine sequence similarity, perform pairwise alignments, distinguish homology from convergence, and assess similarity in DNA and polypeptide sequences. Explore scoring methods, visual comparisons, gap penalties, alignment algorithms, and substitution matrices. Understand the significance of local vs. global alignments and the principles behind BLAST. Discover protein evolution, similarity inference, and the importance of substitution matrices in protein comparisons.

mmckay
Download Presentation

Understanding Sequence Similarity: Pairwise Alignment in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise Alignment How do we tell whether two sequences are similar? Assigned reading: Ch 4.1-4.7, Ch 5.1, get what you can out of 5.2, 5.4 BIO520 Bioinformatics Jim Lund

  2. DNA:DNA polypeptide:polypeptide Pairwise alignment The BASIC SequenceAnalysis Operation

  3. Pairwise sequence alignments One-to-One One-to-Database Multiple sequence alignments Many-to-Many Alignments

  4. Homology common evolutionary descent Chance Short similar segments are very common. Similarity in function Convergence (very rare) Origins of Sequence Similarity

  5. Visual sequence comparison: Dotplot

  6. Visual sequence comparison: Filtered dotplot 4 bp window, 75% identity cutoff

  7. Visual sequence comparison: Dotplot 4 bp windw, 75% identity cutoff

  8. Dotplots of sequence rearrangements

  9. Assessing similarity GAACAAT ||||||| 7/7 OR 100% GAACAAT Which is BETTER? How do we SCORE? GAACAAT | 1/7 or 14%GAACAAT

  10. MISMATCH Similarity GAACAAT ||||||| 7/7 OR 100% GAACAAT GAACAAT ||| ||| 6/7 OR 84% GAATAAT

  11. Mismatches GAACAAT ||| ||| 6/7 OR 84% GAATAAT Same?? GAACAAT ||| ||| 6/7 OR 84% GAAGAAT

  12. Count this? Terminal Mismatch GAACAATttttt ||| ||| aaaccGAATAAT 6/7 OR 84%

  13. INDEL INDELS GAAgCAAT ||| |||| 7/7 OR 100% GAA*CAAT

  14. vs. GAAggggCAAT ||| |||| GAA****CAAT Indels, cont’d GAAgCAAT ||| |||| GAA*CAAT

  15. Common Method: Terminal mismatches (0) Match score (1) Mismatch penalty (-3) Gap penalty (-1) Gap extension penalty (-1) Similarity Scoring DNA Defaults

  16. DNA Scoring GGGGGGAGAA |||||*|*|| 8(1)+2(-3)=2 GGGGGAAAAAGGGGG GGGGGGAGAA--GGG |||||*|*|| ||| 11(1)+2(-3)+1(-1)+1(-1)=3 GGGGGAAAAAGGGGG

  17. Absurdity of Low Gap Penalty GATCGCTACGCTCAGC A.C.C..C..T Perfect similarity, Every time!

  18. Local alignment Smith-Waterman Global alignment Needleman-Wunsch Sequence alignment algorithms

  19. Local alignment (Smith-Waterman) BLAST (simplified Smith-Waterman) FASTA (simplified Smith-Waterman) BESTFIT (GCG program) Global alignment (Needleman-Wunsch) GAP Alignment Programs

  20. Local vs. global alignment 10 gaggc 15 ||||| 3 gaggc 7 Local alignment: alignment of regions of substantial similarity 1 gggggaaaaagtggccccc 19 || |||| || 1 gggggttttttttgtggtttcc 22 Global alignment: alignment of the full length of the sequences

  21. Local vs. global alignment

  22. Look for local alignment, a High Scoring Pair (HSP) Finding word (W) in query and subject. Score > T. Extend local alignment until score reaches maximum-X. Keep High Scoring Segment Pairs (HSPs) with scores > S. Find multiple HSPs per query if present Expectation value (E value) using Karlin-Altschul stats BLAST Algorithm

  23. BLAST statistical significance:assessing the likelihood a match occurs by chance Karlin-Altschul statistic: E = k m N exp(-Lambda S) m = Size of query seqeunce N = Size of database k = Search space scaling parameter Lambda = scoring scaling parameter S = BLAST HSP score Low E -> good match

  24. BLAST statistical significance: • Rule of thumb for a good match: • Nucleotide match • E < 1e-6 • Identity > 70% • Protein match • E < 1e-3 • Identity > 25%

  25. Identity - Easy WEAK Alignments Chemical Similarity L vs I, K vs R… Evolutionary Similarity How do proteins evolve? How do we infer similarities? Protein Similarity Scoring

  26. BLOSUM62

  27. CAU=H CAC=H CGU=R UAU=Y CAA=Q CCU=P GAU=D CAG=Q CUU=L AAU=N Single-base evolution changes the encoded AA

  28. Two main classes: PAM-Dayhoff BLOSUM-Henikoff Substitution Matrices

  29. Built from closed related proteins, substitutions constrained by evolution and function “accepted” by evolution (Point Accepted Mutation=PAM) 1 PAM::1% divergence PAM120=closely related proteins PAM250=divergent proteins PAM-Dayhoff

  30. Built from ungapped alignments in proteins: “BLOCKS” Merge blocks at given % similar to one sequence Calculate “target” frequencies BLOSUM62=62% similar blocks good general purpose BLOSUM30 Detects weak similarities, used for distantly related proteins BLOSUM-Henikoff&Henikoff

  31. BLOSUM62

  32. No general theory for significance of matches!! G+L(n) indel mutations rare variation in gap length “easy”, G > L Gapped alignments

  33. Real Alignments Protein-Protein Close-Distant DNA-DNA

  34. Phylogeny Myoglobin

  35. Cow-to-Pig Protein 88% identical

  36. Cow-to-Pig cDNA 80% Identity (88% at aa!)

  37. DNA similarity reflects polypeptide similarity

  38. Coding vs Non-coding Regions 90% in coding (70% in non-coding)

  39. Third Base of Codon is Hypervariable 28 third base 11 second 8 first

  40. Cow-to-Fish Protein 42% identity, 51% similarity

  41. Cow-to-Fish DNA 48% similarity

  42. Polypeptide similarity > DNA Coding DNA > Non-coding 3rd base of codon hypervariable Moderate Distance  poor DNA similarity Protein vs. DNAAlignments

  43. DNA-DNA similarities 50% significant if “long” E < 1e-6, 70% identity Protein-protein similarities 80% end-end: same structure, same function 30% over domain, similar function, structure overall similar 15-30% “twilight zone” Short, strong match…could be a “motif” Rules of Thumb

  44. BLASTN DNA to DNA database BLASTP protein to protein database TBLASTN DNA (translated) to protein database BLASTX protein to DNA database (translated) TBLASTX DNA (translated) to DNA database (translated) Basic BLAST Family

  45. nr (non-redundantish merge of Genbank, EMBL, etc…) EXCLUDES HTGS0,1,2, EST, GSS, STS, PAT, WGS est (expressed sequence tags) htgs (high throughput genome seq.) gss (genome survey sequence) vector, yeast, ecoli, mito chromosome (complete genomes) And more DNA Databases http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases

  46. nr (non-redundant Swiss-prot, PIR, PDF, PDB, Genbank CDS) swissprot ecoli, yeast, fly month And more Protein Databases

  47. Program Database Options - see more Sequence FASTA gi or accession# BLAST Input >one line gggtcgagtac

  48. Algorithm and output options # descriptions, # alignments returned Probability cutoff Strand Alignment parameters Scoring Matrix PAM30, PAM70, BLOSUM45, BLOSUM62, BLOSUM80 Filter (low complexity) PPPPP->XXXXX BLAST Options

  49. Gapped Blast (default) PSI-Blast (Position-specific iterated blast) “self” generated scoring matrix PHI BLAST (motif plus BLAST) BLAST2 client (align two seqs) megablast (genomic sequence) rpsblast (search for domains) Extended BLAST Family

More Related