1 / 38

Sequence similarity, BLAST alignments & multiple sequence alignments

Sequence similarity, BLAST alignments & multiple sequence alignments. June 20, 2019. Sequence similarity. Why do we care? Workhorse of bioinformatics: Genome assembly & annotation Protein function prediction Phylogeny & evolution ( metagenomics ). Most common methods. Pairwise alignment

rmaines
Download Presentation

Sequence similarity, BLAST alignments & multiple sequence alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence similarity, BLAST alignments& multiple sequence alignments June 20, 2019

  2. Sequence similarity • Why do we care? • Workhorse of bioinformatics: • Genome assembly & annotation • Protein function prediction • Phylogeny & evolution (metagenomics)

  3. Most common methods • Pairwise alignment • BLAST • Multiple sequence alignment • ClustalW, MUSCLE • Protein domain profiles • PFAM, INTERPRO, PANTHER

  4. Pairwise alignments • How are two sequences related to each other? • Homologous – share a common ancestor • Cannot be measured • Measure similarity; infer homology • Orthologs: separated by speciation • Paralogs: separated by duplication • Are there gaps in one versus the other? • What is the percent similarity? • What is a significant alignment?

  5. Multiple sequence alignments • Do these sequences share a core level of similarity? • Can be used to build a profile for that family • Protein domain profiles used to annotate the function of genes in a newly sequenced genome • Starting point for phylogenetic analyses

  6. Pairwise alignment First string = a b c d e Second string = a c d d e f Two alignments: a b c d - e – a – c d d e f a b c – d e – a – c d d e f Which alignment is better?

  7. Scoring schemes • Method of scoring matches, mismatches & gaps that is biologically relevant • Nucleotide alignments: • Identity only, with positive score for matches & negative score for mismatches • Transitions (A -> G, T -> C) and transversions (purine -> pyrimidine) scored differently • Transitions more common and more likely to be silent

  8. Amino acid substitution matrices • Based on observed frequencies of amino acid distributions and substitutions • Models conservative nature of substitutions • Implicitly represent evolutionary patterns • Scores are based in Information Theory

  9. Scoring amino acid substitutions • Amino acids share similarity based on chemical and physical properties • Not all substitutions are equally likely due to physical/chemical constraints • i.e. L -> I is much more conservative than L -> Y vs

  10. Information theory H = information, as associated with some probability p, is the base 2 logarithm of the inverse of p. Values converted to base 2 logarithms are given the unit bits. Information is described as a message of symbols. If there are n symbols and all n have an equal probability then the probability of any symbol appearing is 1/n

  11. Information Theory If all symbols are NOT equally probable, then the entropy (H) is the negative sum over all symbols (n) of the probability of a symbol (pi) multiplied by the log base 2 of the symbol (log pi) The entropy of a normal coin is therefore: -( (0.5)(-1) + (0.5)(-1) ) = 1 bit The entropy of a trick coin where heads comes up ¾ of the time is: -( (0.75)(-.415) + (0.25)(-2) ) = 0.81 bit The entropy of random DNA is: -( (0.25)(-2) + (0.25)(-2) + (0.25)(-2) + (0.25)(-2) ) = 2 bits

  12. Commonly observed substitutions: S > 0 Rarely observed substitutions: S< 0 Observed and random frequency same: S = 0 Scoring matrices S = score for amino acid pairing in the alignment qij is the observed pairing frequency of amino acids iand j. piand pj are the expected frequencies for amino acids iand j.

  13. BLOSUM62 Matrix • BLOcksSUbstitutionMatrix are based on protein alignments • Number indicates minimal percent identity between proteins in the alignment

  14. Amino acid chemical relationships

  15. Large positive; Rare amino acids Large negative; unlikely subs Near zero; no penalty for subs BLOSUM62 Matrix

  16. BLOSUM90 More positive; more negative than BLOSUM62 Based on blocks of aligned protein sequences that are at least 90% identical to another sequence in the block

  17. NCBI BLAST matrices

  18. BLAST • Build a list of words from query sequence (3 for proteins, 11 for DNA) • Evaluate each word for match using scoring matrix and discard all below threshold • Generally 50 matches per word • T value is threshold; determines sensitivity and speed of search Calculate statistical significance of matches Build word list from query sequence Find hits in database sequence Extend the hits to form HSPs

  19. Query sequence: PSATPVLICWAAG Word list: PSA ATP VLI CWA Threshold score (T): 11 Matches to PSA Score PSA 15 PST 9 PDA 11 WSA 4

  20. BLAST • Find match for each word in database • Database is indexed so all possible words in all sequences is known • This search is very fast (500K words/sec) • Matches > threshold(T) are used as seed for alignments Calculate statistical significance of matches Build word list from query sequence Extend the hits to form HSPs Find hits in database sequence

  21. BLAST • Extend alignment from each word in both directions so long as score increases • These alignments are the high scoring pairs (HSPs) • Keep HSPs if score is above a given threshold Calculate statistical significance of matches Build word list from query sequence Find hits in database sequence Extend the hits to form HSPs

  22. Extending the hit Score of previous alignment (A) Score of new aligned pair Score of new alignment = + (1) p S A P S A 15 C C 9 P S A C P S A C 24 = + (2) Score of new aligned pair Score of previous alignment (B) Score of alignment (C) + = P S A C P S A C 24 Y W 2 P S A C Y P S A C W 26 = + (3) Repeat adding aligned pairs until score goes down or reach end of sequence.

  23. BLAST • Highest scoring HSPs extended in both directions as long as score > threshold • Do NOT usually get an alignment over the ENTIRE length of the sequence Combine HSPs into a gapped alignment Build word list from query sequence Find hits in database sequence Extend the hits to form HSPs

  24. Positives = 200/310 (64%) Identities = 135/310 (43%) Score = 272 bits Expect = 2e-73

  25. Significance of alignment probability that the observed match could have happened by chance; values between 0 and 1 P = Expect value: • number of matches as good as the observed one that would be expected to appear by chance in a database of the size probed • E = P x size of the database • E values range from 0 to the size of the database E =

  26. When is an alignment significant? • Identify a true ortholog between species • In a protein-protein alignment, E-values < 10-25 • Are all the domains present in both? • Does the number of exons match? • Are the splice boundaries the same? • Annotation (transfer between species) • E-values < 10-25 • Functional homolog? • Protein alignment, E-values < 10-10

  27. Limitations of BLAST • Many, many sequences in the database with NO annotation • Searching the NR database may just bring back matches to multiple versions of the query without identifying any potential function • Pairwise alignments have a limited amount of information

  28. Multiple Sequence Alignments (MSA) • Alignment of ≥ 3 sequences to bring as many similar characters into register as possible • Hypothetical model of mutations (substitutions, insertions & deletions) • Best represents most likely evolutionary scenario. • Cannot be unambiguously established

  29. MSA: Motivation • Correspondence. Which parts “do the same thing” • Similar genes are conserved across widely divergent species, often performing similar functions • Structure prediction • Use knowledge of structure of one or more members of a protein MSA to predict structure of other members • Structure is more conserved than sequence • Create “profiles” for protein families • Allow us to search for other members of the family • MSA is the starting point for phylogenetic analysis

  30. Globin alignment

  31. ClustalW Alignment * identity : high similarity . low similarity - gap in sequence Amino acids often color coded based on physical -chemical properties

  32. MSA -> Profiles • Profile: A table that lists the frequencies of each amino acid in each position of protein sequence. • Frequencies are calculated from a MSA containing a domain of interest • Allows us to identify consensus sequence • Derived scoring scheme allows us to align a new sequence to the profile • Profile can be used in database searches • Find new sequences that match the profile

  33. Why not just use BLAST? • Database searches using a profile or position-specific scoring matrices (PSSM) are much more sensitive for detecting weak or distant relationships than are database searches using a single sequence as query • Information content higher in a PSSM

  34. Pairwise alignment

  35. Position Specific Scoring Matrix (PSSM)

  36. Where and how are profiles used? • Used extensively in defining functional domain profiles of proteins • PFAM, InterPro, PANTHER protein domain databases

  37. This weeks exercise • Using BLAST to identify taxonomic distribution a known sequence • Using BLAST to identify homologs of specific proteins in other species • Use Primer-BLAST to check specificity of PCR primers

More Related