1 / 38

BNFO 602 Lecture 2

BNFO 602 Lecture 2. Usman Roshan. -3 mil yrs. AAGACTT. AAGACTT. AAGACTT. AAGACTT. AAGACTT. -2 mil yrs. AAGGCTT. AAG G CTT. AAGGCTT. AAGGCTT. T_GACTT. T_GACTT. T _ GACTT. T_GACTT. -1 mil yrs. _GGGCTT. _ G GGCTT. _GGGCTT. TAGACCTT. T AG A C CTT. TAGACCTT. A _ C ACTT. A_CACTT.

sonel
Download Presentation

BNFO 602 Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BNFO 602Lecture 2 Usman Roshan

  2. -3 mil yrs AAGACTT AAGACTT AAGACTT AAGACTT AAGACTT -2 mil yrs AAGGCTT AAGGCTT AAGGCTT AAGGCTT T_GACTT T_GACTT T_GACTT T_GACTT -1 mil yrs _GGGCTT _GGGCTT _GGGCTT TAGACCTT TAGACCTT TAGACCTT A_CACTT A_CACTT A_CACTT today _G_GCTT (Mouse) TAGGCCTT (Human) TAGCCCTTA (Monkey) A_CACTTC (Lion) A_C_CTT (Cat) GGCTT (Mouse) TAGGCCTT (Human) TAGCCCTTA (Monkey) ACACTTC (Lion) ACCTT (Cat) DNA Sequence Evolution

  3. Sequence alignments They tell us about • Function or activity of a new gene/protein • Structure or shape of a new protein • Location or preferred location of a protein • Stability of a gene or protein • Origin of a gene or protein • Origin or phylogeny of an organelle • Origin or phylogeny of an organism • And more…

  4. Pairwise sequence alignment • How to align two sequences?

  5. Pairwise alignment • How to align two sequences? • We use dynamic programming • Treat DNA sequences as strings over the alphabet {A, C, G, T}

  6. Pairwise alignment

  7. Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)

  8. Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n) Time and space complexity is O(mn)

  9. Dynamic programming Animation slides by Elizabeth Thomas in Cold Spring Harbor Labs (CSHL) http://meetings.cshl.org/tgac/tgac/flash/DynamicProgramming.swf

  10. How do we pick gap parameters?

  11. Structural alignments • Recall that proteins have 3-D structure.

  12. Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

  13. Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Computer generated aligned proteins Structural alignment - example 2 Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

  14. Structural alignments • We can produce high quality manual alignments by hand if the structure is available. • These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

  15. Benchmark alignments • Protein alignment benchmarks • BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. • Proteins benchmarks are generally large and have been in the research community for sometime now. • BAliBASE 3.0

  16. Biologically realistic scoring matrices • PAM and BLOSUM are most popular • PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins • BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

  17. PAM • We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j • Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families • Compute probabilities of change and background probabilities by simple counting

  18. Local alignment • Global alignment recursions: • Local alignment recursions

  19. Local alignment traceback • Let T(i,j) be the traceback matrices and m and n be length of input sequences. • Global alignment traceback: • Begin from T(m,n) and stop at T(0,0). • Local alignment traceback: • Find i*,j* such that T(i*,j*) is the maximum over all T(i,j). • Begin traceback from T(i*,j*) and stop when T(i,j) <= 0.

  20. BLAST • Local pairwise alignment heuristic • Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive. • Online server: http://www.ncbi.nlm.nih.gov/blast

  21. BLAST • Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides. • Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold • Report maximal segments above score S.

  22. Finding k-mers quickly • Preprocess the database of sequences: • For each sequence in the database store all k-mers in hash-table. • This takes linear time • Query sequence: • For each k-mer in the query sequence look up the hash table of the target to see if it exists • Also takes linear time

  23. Profile-sequence alignment • Given a family alignment, how can we align it to a sequence? • First, we compute a profile of the alignment. • We then align the profile to the sequence using standard dynamic programming. • However, we need to describe how to align a profile vector to a nucleotide or residue.

  24. Profile • A profile can be described by a set of vectors of nucleotide/residue frequencies. • For each position i of the alignment, we we compute the normalized frequency of nucleotides A, C, G, and T

  25. Aligning a profile vector to a nucleotide • ClustalW/MUSCLE • Let f be the profile vector • Score(f,j)= • where S(i,j) is substitution scoring matrix

  26. Multiple sequence alignment • “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk • Computationally very hard---NP-hard

  27. Formally…

  28. Unaligned sequences GGCTT TAGGCCTT TAGCCCTTA ACACTTC ACTT Aligned sequences _G_ _ GCTT_ TAGGCCTT_ TAGCCCTTA A_ _CACTTC A_ _C_ CTT_ Conserved regions help us to identify functionality Multiple sequence alignment

  29. Sum of pairs score

  30. What is the sum of pairs score of this alignment? Sum of pairs score

  31. Iterative alignment(heuristic for sum-of-pairs) • Pick a random sequence from input set S • Do (n-1) pairwise alignments and align to closest one t in S • Remove t from S and compute profile of alignment • While sequences remaining in S • Do |S| pairwise alignments and align to closest one t • Remove t from S

  32. Iterative alignment • Once alignment is computed randomly divide it into two parts • Compute profile of each sub-alignment and realign the profiles • If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit

  33. Progressive alignment • Idea: perform profile alignments in the order dictated by a tree • Given a guide-tree do a post-order search and align sequences in that order • Widely used heuristic

  34. Popular alignment programs • ClustalW: most popular, progressive alignment • MUSCLE: fast and accurate, progressive and iterative combination • T-COFFEE: slow but accurate, consistency based alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment) • PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme • DIALIGN: very good for local alignments

  35. MUSCLE

  36. MUSCLE

  37. Evaluation of multiple sequence alignments • Compare to benchmark “true” alignments • Use simulation • Measure conservation of an alignment • Measure accuracy of phylogenetic trees • How well does it align motifs? • More…

  38. Comparison of alignments on BAliBASE

More Related