Applied Bioinformatics

Applied Bioinformatics Week 3

Theory I • Similarity • Dot plot

3.2 On sequence alignment Sequence alignment is the most important task in bioinformatics! Introduction to Bioinformaticshttp://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT

3.2 On sequence alignment Sequence alignment is important for: * prediction of function * database searching * gene finding * sequence divergence * sequence assembly http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT

3.3 On sequence similarity Homology: genes that derive from a common ancestor-gene are called homologs Orthologous genes are homologous genes in different organisms Paralogous genes are homologous genes in one organism that derive from gene duplication Gene duplication: one gene is duplicated in multiple copies that therefore free to evolve and assume new functions http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT

http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmHOMOLOGOUS and PARALOGOUS

http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmHOMOLOGOUS and PARALOGOUS versus ANALOGOUS

plants ? globin Ath-g analogs

Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT: sequence similarity

Similarity • We can only measure current similarity • We can form hypothesi

Similarity Searching • DotPlot • Needleman-Wunsch • Smith-Waterman • FASTA • BLAST

Dot Plot • Writing one sequence horizontally • Writing the other vertically • At each intersection with equal nucleotides make a dot in the matrix

Dot Plot

Dot Plot • Messy? • Strong similarities can be visually enhanced • Select a window size and a similarity score for that window (e.g. 10 and 8) • Create a new matrix with dots where the window score >= 8

Dot Plot

Dot Plot Interpretation

Creating a Dot Plot

End Theory I • Mindmapping • 10 min break

Practice I • Dot plot

Dot Plot • ACGTGTGCGTTTGAAC • GGGTGTTCGTTTAAAC • Make a Dot plot for the two sequences above • Use a window of 3 to refine the view • Can you use Excel? • Get any two DNA sequences and try the tool below • http://www.vivo.colostate.edu/molkit/dnadot/

Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful. Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.

How can we find an optimal alignment? • ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCT • How many possible alignments? C(27,7) gap positions = ~888,000 possibilities • Dynamic programming: The Needleman & Wunsch algorithm 27 1

= (2n)!/(n!)2 = (22n /n ) = (2n)   2n n Time Complexity Consider two sequences: AAGT AGTC How many possible alignments the 2 sequences have? = 70

Scoring a sequence alignment • Match/mismatch score: +1/+0 • Open/extension penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Open: 2 × (–2) • Extension: 5 × (–1) Score = +9

Pairwise Global Alignment • Computationally: • Given: a pair of sequences (strings of characters) • Output: an alignment that maximizes the similarity

Needleman-Wunsch Alg

Needleman-Wunsch Alg • Which Alignment is better? • For scoring use: • Match 1 • Mismatch 0 • Gap open -2 • Gap extension -1 • How can substitution matrices be integrated?

Needleman & Wunsch • Place each sequence along one axis • Place score 0 at the up-left corner • Fill in 1st row & column with gap penalty multiples • Fill in the matrix with max value of 3 possible moves: • Vertical move: Score + gap penalty • Horizontal move: Score + gap penalty • Diagonal move: Score + match/mismatch score • The optimal alignment score is in the lower-right corner • To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.

Three steps in Needleman-Wunsch Algorithm • Initialization • Scoring • Trace back (Alignment) • Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2) Pooja Anshul Saxena, University of Mississippi

Scoring Scheme • Match Score = +1 • Mismatch Score = -1 • Gap penalty = -1 • Substitution Matrix Pooja Anshul Saxena, University of Mississippi

Initialization Step • Create a matrix with X +1 Rows and Y +1 Columns • The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty Pooja Anshul Saxena, University of Mississippi

Scoring • The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(I, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g where S(I, j) is the substitution score for letters i and j, and g is the gap penalty Pooja Anshul Saxena, University of Mississippi

Scoring …. • Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = -1 + -1 = -2 scoreleft = C(i, j-1) + g = -1 + -1 = -2 Pooja Anshul Saxena, University of Mississippi

Scoring …. • Final Scoring Matrix Pooja Anshul Saxena, University of Mississippi

Trace back • The trace back step determines the actual alignment(s) that result in the maximum score • There are likely to be multiple maximal alignments • Trace back starts from the last cell, i.e. position X, Y in the matrix • Gives alignment in reverse order Pooja Anshul Saxena, University of Mississippi

Trace back …. • There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left • Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors Pooja Anshul Saxena, University of Mississippi

Trace back …. • The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G Pooja Anshul Saxena, University of Mississippi

Trace back …. • Final Trace back Best Alignment: A T C G | | | | _ T C G Pooja Anshul Saxena, University of Mississippi

Local Alignment • Problem first formulated: • Smith and Waterman (1981) • Problem: • Find an optimal alignment between a substring of s and a substring of t • Algorithm: • is a variant of the basic algorithm for global alignment

Motivation • Searching for unknown domains or motifs within proteins from different families • Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) • Identifying active sites of enzymes • Comparing long stretches of anonymous DNA • Querying databases where query word much smaller than sequences in database • Analyzing repeated elements within a single sequence

Applied Bioinformatics