460 likes | 663 Views
Module A: Fundamental Algorithms in Sequence Analysis. Section 1: Sequence Alignments Srinivas Aluru. Biology easily has 500 years of exciting problems to work on -Donald E. Knuth. Biological Data. DNA: Self-replicating
E N D
Module A: Fundamental Algorithms in Sequence Analysis Section 1: Sequence Alignments Srinivas Aluru
Biology easily has 500 years of exciting problems to work on -Donald E. Knuth
Biological Data DNA: • Self-replicating • Codes for proteins Proteins: • Perform most functions in living organisms BBSI Summer School - Iowa State University
O O C O P O HN C CH2 O O C CH C C N O C C H OH H DNA: Sequence of nucleotides Nucleotide: Deoxyribose sugar + Phosphate + Base Nucleotides: A, T, G, and C CH3 5’ 1’ 4’ 3’ 2’ BBSI Summer School - Iowa State University
5’ 3’ 5’ P P P 3’ A C G T G C 3’ P P P 5’ 3’ 5’ BBSI Summer School - Iowa State University
For computational purposes, DNA = A sequence over alphabet {A,C,G,T} 5’ A T T C G G G A A T G C A T G C C A 3’ 3’ T A A G C C C T T A C G T A C G G T 5’ BBSI Summer School - Iowa State University
Proteins: Chains of amino acid residues. There are 20 different amino acids. Functions: • Tissue building blocks (Structure proteins) • Catalysts (enzymes) • Oxygen transport • Antibody defense BBSI Summer School - Iowa State University
Example RNA: AUG GGA GAG CUA UGA Protein: Met Gly Glu Leu STOP BBSI Summer School - Iowa State University
Challenges in Computational Biology • Obtain the genome of an organism. • Identify and annotate genes. • Find the sequences, three dimensional structures, and functions of proteins. • Find sequences of proteins that have desired three dimensional structures. • Compare DNA sequences and proteins sequences for similarity. • Study the evolution of sequences and species. BBSI Summer School - Iowa State University
Sequence Comparison Caveats Magenta regions are structurally equivalent with enterotoxin (top left). http://www.sbg.bio.ic.ac.uk/AH/explanation.html BBSI Summer School - Iowa State University
Pairwise Sequence Alignment Problem: Find similarity between two sequences. Variations: • Given two sequences, find if parts of them are similar (local alignment). • Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence. BBSI Summer School - Iowa State University
Alignments • Show one sequence placed above another such that similarity is revealed A: C A T - T C A - C B: C - T C G C A G C Example: BBSI Summer School - Iowa State University
Measuring Similarity Score: A measure of alignment quality C A T - T C A - C C - T C G C A G C -------------------------------- 10 -5 10 -5 -2 10 10 -5 10 Total = 33 BBSI Summer School - Iowa State University
Pairwise Global Alignment T[i,j] = Score of optimally aligning first i bases of s with first j bases of t. BBSI Summer School - Iowa State University
Calculating Alignments Case 1: Match s[i] w/ t[j] i - 1 i s: C A T T C A C t: C - T T C A G j -1 j Case 2: Match t[j] w/ gap i s: C A T T C A C - t: C - T T C A - G j -1 j Case 3: Match s[i] w/ gap i - 1 i s: C A T T C A - C t: C - T T C A G - j BBSI Summer School - Iowa State University
-5 -10 -15 -20 -25 -30 -35 λ C T C G C A G C 0 -5 -10 -15 -20 -25 -30 -35 -40 λ 10 5 C A T T C A C +10 for match, -2 for mismatch, -5 for gap BBSI Summer School - Iowa State University
* * λ C T C G C A G C λ C A T T C A C Traceback yields both optimal alignments in this example BBSI Summer School - Iowa State University
End-gap free alignment • We often don’t want to penalize gaps at the start or end of the alignment, especially when comparing short and long sequences • Same as global alignment, except: • Initialize with zeros (free gaps at start) • Locate max in the last row/column (free gaps at end) BBSI Summer School - Iowa State University
0 0 0 0 0 0 0 0 0 0 0 5 8 5 8 5 20 15 10 0 0 15 10 5 6 15 18 13 0 -2 10 13 8 3 10 13 16 0 10 5 20 15 18 13 8 23 5 8 15 18 13 28 23 18 0 0 0 3 10 25 20 23 38 33 λ C T C G C A G C λ 10 5 10 5 10 5 0 10 C A T T C A G +10 for match, -2 for mismatch, -5 for gap BBSI Summer School - Iowa State University
Local Alignment T [i, j] = Score of optimally aligning a suffix of s with a suffix of t. Initialize top row and leftmost column to zero. BBSI Summer School - Iowa State University
λ C T C G C A G C λ C A T T C A C +1 for a match, -1 for a mismatch, -5 for a gap BBSI Summer School - Iowa State University
Some Results • Most pairwise sequence alignment problems can be solved in O(mn) time. • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88]. • Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86]. BBSI Summer School - Iowa State University
Reducing space requirements • O (mn) tables are often the limiting factor in computing large alignments • There is a linear space technique that only doubles the time required [Hirschberg77] BBSI Summer School - Iowa State University
0 5 8 5 8 5 20 15 10 λ C T C G C A G C 0 0 0 0 0 0 0 0 0 λ 0 10 5 10 5 10 5 0 10 C A T T C A G IDEA: We only need the previous row to calculate the next BBSI Summer School - Iowa State University
Linear-space Alignments mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn BBSI Summer School - Iowa State University
Affine Gap Penalty Functions Gap penalty = h + gk where k = length of a maximal sequence of gaps h = gap opening penalty g = gap continuation penalty BBSI Summer School - Iowa State University
PAM matrices • Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify evolutionary change within a protein sequence [Dayhoff78]. • A PAM unit is the amount of evolution which will on average change 1% of the amino acids within a protein sequence. BBSI Summer School - Iowa State University
PAM250 scoring matrix BBSI Summer School - Iowa State University
BLOSUM matrices • Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff92]. • For example BLOSUM62 is derived from sequence alignments with no more than 62% identity. BBSI Summer School - Iowa State University
Comparison • PAM is based on an evolutionary model using phylogenetic trees • BLOSUM assumes no evolutionary model, but rather conserved “blocks” of proteins BBSI Summer School - Iowa State University
Multiple Sequence Alignment VTISCTGSSSNIGAGNHVKWYQQLPG VTISCTGTSSNIGSITVNWYQQLPG LRLSCSSSGFIFSSYAMYWVRQAPG LSLTCTVSGTSFDDYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGAVTVAWKADS ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG- VSLTCLVKGFYPSDIAVEWESNG- BBSI Summer School - Iowa State University
Induced Pairwise Alignment S1 S - T I S C T G - S - N I S2 L - T I – C N G S S - N I S3 L R T I S C S G F S Q N I Induced pairwise alignment of S1andS2: S1 S T I S C T G - S N I S2 L T I – C N G S S N I BBSI Summer School - Iowa State University
Sum-of-Pairs Scoring Function Score of multiple alignment where BBSI Summer School - Iowa State University
Multiple Alignment Run-time of dynamic programming solution = O(2k nk) where n = length of each sequence k = number of sequences Space, O(nk), is prohibitively large! Example: 6 sequences of length 100 6.4X1013 calculations! BBSI Summer School - Iowa State University
Carillo-Lippman Heuristic L = Lower bound on multiple alignment score If Then T[i1,i2,…,ik] cannot be on an optimal path. BBSI Summer School - Iowa State University
Multiple Alignment to a Phylogenetic Tree • A tree showing the evolutionary relationship between sequences is available. • Compute multiple alignment such that for each edge (i,j) in the tree Induced alignment between Siand Sj. = Optimal alignment between Siand Sj. BBSI Summer School - Iowa State University
Examples Primates Darwin’s Finches http://members.aol.com/darwinpage/trees.htm BBSI Summer School - Iowa State University
Multiple Alignment to a Tree • Build the multiple alignment incrementally. • To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment. • Insert the new sequence according to its optimal alignment with the other sequence connected by the edge. • Adjust other sequences in the multiple alignment. • Run-time = time for k pairwise alignments. BBSI Summer School - Iowa State University
Searching Biological Databases BLAST (Basic Local Alignment Search Tool) http://www.ncbi.nlm.nih.gov • BLASTN (DNA) • BLASTP (Protein) • BLASTX (DNA against Protein) • PSI-BLAST (Position Specific Iterative BLAST) BBSI Summer School - Iowa State University
Multiple Alignment Software • Clustalw (http://www.ebi.ac.uk/clusalw) • MSA (http://softlib.rice.edu/softlib/msa.html) • HMMER (http://hmmer.wustl.edu/) • SAM (http://www.cse.ucsc.edu/research/ compbio/sam.html) BBSI Summer School - Iowa State University
References • M. O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, A model of evolutionary change in proteins, Atlas of Protein Sequence and Structure, 5:345-352, 1978. • S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Academy Science, 89:10915-10919, 1992. • D.S. Hirschberg, Algorithms for the longest common subsequence problem, J. ACM, 24:664-675, 1977. • G.M. Landau and U. Vishkin, Efficient string matching with k mismatches, Theoretical Computer Science, 43:239-249, 1986. • E. Myers and W. Miller, Optimal alignments in linear space. Computer Applications in the Biosciences, 4(1):11–17, 1988. BBSI Summer School - Iowa State University