1 / 23

Sequence Alignment

Sequence Alignment. Goal: line up two or more sequences An alignment of two amino acid sequences:. 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P

theresek
Download Presentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP Position within the alignment = columns

  2. Sequence Alignment The alignment is a hypothesis: • The positions with identical nt/AA were present in the common ancestor • Differences represent the nt/AA that have diverged since the common ancestor Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

  3. From extant sequences to evolution

  4. Constructing and evaluating alignments • how to identify regions of sequence similarity between two sequences? • How to evaluate the degree of similarity? • What is the biological significance of the alignment?

  5. Dot Plots: visualization of an alignment • Take two English words: • place the two sequences on vertical and horizontal axes of graph • put dots wherever there is a match • diagonal line is the region of identity • local alignment • THISSEQUENCE and THATSEQUENCE

  6. Alignments reveal insertions and deletions seq1 THIS---SEQUENCE seq2 THISISASEQUENCE a gap in Seq1 accounts for the insertion of ISA into Seq2

  7. Alignments reveal substitutions THISSEQUENCE THATSEQUENCE • How many substitutions? • Are all substitutions equal? • If these were real AA sequences in two extant organisms, how can we determine whether they reflect evolutionary ancestry? • Would two unrelated sequence share this level of identity?

  8. Substitutions Need a methods for evaluating the likelihood of - A to V (alanine to valine) - R to F (Arginine to Phenylalanine)

  9. Scoring schemes to assess similarity • Percent identity = number of identical amino acids • Percent similarity (biochemical equivalence) • Substitution matrices • value assigned based on the probability of substitution • score the alignment 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

  10. Substitution matrices How might one construct a scoring matrix? • what types of sequence events should we consider? • DNA level? Transition vs. transversion • Amino acid level? Biochemical equivalence • Using known proteins? • comparing protein homologs • post hoc determination of probabilities • should we use the same substitution matrix for two very closely related proteins vs. proteins that diverged long ago? • Probability of substitutions increases over time • Probability that multiple substitutions occurred in a single position

  11. Substitution matrix Each cell represents the likelihood of substitution of each possible pair of amino acids • THISSEQUENCE • THATSEQUENCE • 581145505695 Sum up the score = 52

  12. PAM and BLOSUM matrices for AA sequences Most protein alignment matrices are empirically derived: • PAM Scoring Matrices • compared full length of closely related proteins • Measured the frequency of all possible substitution pairs • BLOSUM Scoring Matrices • compared highly conserved regions of proteins • blocks

  13. How to score gaps? • THISISASEQUENCE vs THATSEQUENCE? • THISISASEQUENCE • TH----ATSEQUENCE • THA---TSEQUENCE • TH---ATSEQUENCE • TH-A-T-SEQUENCE • Scoring the alignment must take into account • 1) Substitutions • 2) Gaps • Gap penalties: • 1) start a new gap (-4) • 2) extend an existing gap (-1) • Score all, choose highest score More than one possible alignment

  14. Alignments • Finding regions of sequence identity or similarity • Inserting gaps to reflect indels • Scoring the possible alignments to find the optimal alignment by

  15. Common tools that produce alignments • BLAST to identify similar sequences, given a query sequence • ClustalW to align two or more sequences across their entire length

  16. the blast algorithm • Uses PAM or BLOSUM matrix • divides query sequence into short strings, called words • searches through the database to find subject sequences that contain similar words • When finds similar words, it extends and scores the alignment • Output consists of all subject sequences that align to the query at or above a threshold score • If no words are similar, then no alignment

  17. BLAST Algorithm divide entire length into words (segments of X length)

  18. Extend hits one base at a time S is the alignment score: If it falls below a threshold, the extension processes ends

  19. HSPs are Aligned Regions • High scoring segment pairs = the original word match plus the extension • high scoring = score of the alignment above threshold • segment = the region of the query sequence aligned to the subject • pair = alignment between two sequences (query and subject) • BLAST often produces several short HSPs rather than a single aligned region

  20. BLAST Results report local alignments • Query was the entire protein sequence (position 1 to 749) • Score,E-value, Identities, Positives, Gaps >gi|17556182|ref|NP_497582.1| Predicted CDS, phosphatidylinositol transfer protein [Caenorhabditis elegans] Score = 283 bits (723), Expect = 8e-75 Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%) Query: 48 KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101 K+ RV+LP+SV+EYQVGQL+SVAEASK P++ +G+ KGQYT Sbjct: 70 KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129 Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160 HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Sbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189 Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN H L+ + E V I+IA+ + L S D + P+KF+S KTGRGPL N Sbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249 Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279 WK + P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304 Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309 LTM DIR +E + +++L+E R+ V+GM Sbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334

  21. BLAST Statistics • E-value is equivalent to a P value • smaller numbers are more significant • 1e-4 = 1 x 10-4 = 0.0004 • 1e-50 = 1 x 10-50 • E-value is calculated from the alignment score (S) • how many alignments of that score would likely occur by chance if you query a database of that size? • if GenBank contains 10 million sequences, there is a good probability that the sequence “MAGAV” will occur multiple times in sequences that are NOT evolutionarily related • The E-value represents the likelihood that the observed alignment is due to chance alone

  22. Interpretation of output • very low E-values (e-100) represent sequences that are very close to being identical • moderate E-values are related genes (homologs) • long list of gradually declining of E-values indicates a large gene family • you must examine the results when e-value is in the 10-4 to -5 range • examine sequences • a few AA matches in a long sequence? • many AA matches in a very short sequence?

  23. Evaluating Blast results • Alignment: • colored bar alignments (region of alignment, score along length) • sequence alignments (region of alignment, AA information) • Exploring potential function from significant blast hits • use accession link to go to the record page for each hit • published papers • full sequence information • annotation • Blast is linked to a protein domain tool

More Related