Sequence alignment

Sequence alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 18th, 2012 BMI/CS 576

Outline of this segment • Pairwise Sequence alignment • Local and global alignments • Gap penalities • Heuristic algorithms • Multiple sequence alignment • Phylogenetic trees • Probabilistic and parsimony methods

Sequences are related • Darwin said that all organisms are related by a common ancestor. • Sequence changed gradually from ancestral sequence through changes • Similar DNA sequence often implies similar function

Compare sequence • Enables us to ask how similar are two sequences • Closer in sequence -> closer in ancestry • Horizontal transfer in bacteria generates some exceptions • Infer function based on sequence similarity.

Pairwise alignment:task definition Given • a pair of sequences (DNA or protein) • a scoring function Do • determine the common substrings in the sequences such that the similarity score is maximized

How to compute distances between two strings? • Hamming distance • Works only with strings of equal lenght • How far are ATATATAT and TATATATA? • Edit distance between sequences s and t • Number of edits needed to transform s to t • Editing could be switches, insertions and deletions • How far are ATATATAT and TATATATA?

An alignment • A two row matrix: one row for string s and another for string t • No column can have a “gap” in both rows. • At most m+n

Alignment example (DNA) DNA sequence alignment of the RAP1 gene between S. cerevisiae and K. lactis

Alignment example (Protein) Protein sequence alignment of the RAP1 gene between S. cerevisiae and K. lactis Notice fewer aligned columns?

Issues in sequence alignment • the sequences we’re comparing typically differ in length • Insertions, deletions • there may be only a relatively small region in the sequences that matches • we want to allow partial matches (i.e. some amino acid pairs are more substitutable than others) • variable length regions may have been inserted/deleted from the common ancestral sequence

DNA sequence edits • substitutions: ACGAAGGA • insertions: ACGAACGGA • deletions: ACGAAGA Gaps

Mismatches and gaps CA--GATTCGAAT CGCCGATT---AT mismatch gap Gaps make sequence alignment complicated

Scoring an alignment: what is needed? • Scoring matrix to score alignments • s(a,b) indicates score of aligning character a with character b • A kXk matrix where k is the size of our alphabet. • Algorithm to score an alignment • Statistical model to assess how likely is the alignment to occur by random chance

BLOSUM • BLOckSUbstitution Matrix • Specifies how likely one amino acid switches into another over evolutionary time • Some AA changes are more rare than others • E.g. Serine->Phenylalanine is three times more likely than Tryptophan->Phenylalanine • AA that have more codons can tolerate more mutations. • Matrix is created from alignments themselves(!) • Consider sequences at least x% similar, exclude very similar blocks, count substitutions • BLOSUM50: sequences were no more than 50% similar of x% • Each entry is a log odds ratio of observing A and B to A alone or B alone

Example substitution scoring matrix (BLOSUM50) A5 R -2 7 N -1 -1 7 D -2 -2 2 8 C -1 -4 -2 -4 13 Q -1 1 0 0 -3 7 E -1 0 0 2 -3 2 6 G 0 -3 0 -1 -3 -2 -3 8 H -2 0 1 -1 -3 1 0 -2 10 I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 L -2 -3 -4 -4 -2 -2 -3 -4 -3 25 K -1 3 0 -1 -3 21 -2 0 -3 -3 6 M -1 -2 -2 -4 -2 0 -2 -3 -1 23 -2 7 F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 S1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 25 W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 28 V 0 -3 -3 -4 -1 -3 -3 -4 -4 41 -3 1 -1 -3 -2 0 -3 -1 5 A R N D C Q E G H I L K M F P S T W Y V

Scoring Gaps • Linear score • γ(d)=-dg, where g is a fixed cost, d is the gap length • Longer the gap higher the penalty • Affine score • γ(d)=-(d +(g-1)e), d and g are fixed penalties and g<d. • d : gap open penalty • g : gap extension penalty

Sequence alignment problems • Pairwise • Compare two sequences • Global alignment (Needleman-Wunsch algorithm) • Local alignment (Smith-Waterman algorithm) • Multiple-sequence alignment • Compare >2 sequences

Global alignment • Given two sequences u and v and a scoring matrix delta, find the alignment with the maximal score. • The number of possible alignments is very big.

Number of possible alignments Sequence 1 CAATGA Sequence 2 ATTGAT CAATGA ATTGAT CAATGA-- --ATTGAT CAATGA-- AT--TGAT Alignment 1 Alignment 2 Alignment 3

Number of possible alignments • there are • possible global alignments for 2 sequences of length n • e.g. two sequences of length 100 have possible alignments • but we can use dynamic programming to find an optimal alignment efficiently

Dynamic programming • Divide and conquer • Solve a problem using pre-computed solutions for smaller subparts of the problem

Dynamic programming for sequence alignment • Two sequences: AAAC, AAG • x: AAAC • xi: Denotes the ithletter • y: AAG • yidenotes thejthletter in y. • Assume we have aligned x1..xi to y1..yj • There are three possibilities AAxiAAyj AAAAAG AA--AAG AAAAA--

A G C A A A C Dynamic programming idea • Letx’s length be n • Let y’s length be m • construct an (n+1)(m+1)matrix F • F (j, i) = score of the best alignment ofx1…xiwith y1…yj x score of best alignment of AAA to AG y

DP for global alignment with linear gap penalty • DP recurrence relation:

DP algorithm sketch: global alignment • initialize first row and column of matrix • fill in rest of matrix from top to bottom, left to right • for each F(i, j), save pointer(s) to cell(s) that resulted in best score • F(m, n) holds the optimal alignment score; • Trace back from F(m, n) to F(0, 0) to recover alignment

Global alignment example • suppose we choose the following scoring scheme: • +1 • -1 • d(penalty for aligning with a gap) =2

A G C -d -2d -3d 0 A -d A -2d A -3d C -4d Initializing matrix: global alignment with linear gap penalty

Global alignment example A G C 0 -2 -4 -6 one optimal alignment A -1 -3 -2 1 x: A A A C y: A G - C A -1 -4 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1

Computational complexity • initialization: O(m), O(n) where sequence lengths are m, n • filling in rest of matrix: O(mn) • traceback: O(m + n) • hence, if sequences have nearly same length, the computational complexity is

Summary of DP • Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1) • works for either DNA or protein sequences, although the substitution matrices used differ • finds an optimal alignment • the exact algorithm (and computational complexity) depends on gap penalty function (we’ll come back to this issue)

Local alignment • Look for local regions of sequence similarity. • Aligning substrings of x and y. • Try to find the best alignment over all possible substrings. • This seems difficult computationally. But can be solved by DP.

Local alignment motivation • Comparing protein sequences that share a common sequence pattern or domain (independently folded unit) but differ elsewhere • Comparing DNA sequences that share a similar sequence patternbut differ elsewhere • More sensitive when comparing highly diverged sequences • Selection on the genome differs between regions • Unconstrained regions are not very alignable

Local alignment DP • Similar to global alignment with a few exceptions • Initialize the first row and column to 0s Start a new alignment Key difference for local

Local alignment DP algorithm • initialization: first row and first column initialized with 0’s • traceback: • find maximum value of F(j, i)can be anywhere in matrix • stop when we get to a cell with value 0

A A G A 0 0 0 0 0 T 0 T 0 A 0 1 A 0 1 G 0 Local alignment example 0 0 0 0 0 0 0 0 1 0 1 2 0 1 0 0 3 1 x: A A G y: A A G

Sequence alignment