B ioinform atics Alignment of biological sequences

Bioinformatics Alignment of biological sequences UL, 2019,Juris Viksna

Topics • Short review about sequence comparison: • biological motivation to compare sequences • sequence similarity criteria • DP basic algorithm for distance computation between sequences • Global and local sequence comparisons • similarity matrices and gap penalties • modified algorithms that use gap penalties • local sequence comparison • Similarity matrices • how to obtain them • relations between similarity matrices and sequence evolution • suitability for matrices for specific sequences

Comparison of biological sequences • Two sequence comparisons (pairwise alignment): • the formulation of the problem • DP algorithm (match = 1, mismatch = 1, gap = 2) • gloabal and local comparisons • affine gap penalties • similarity matrices • Multiple alignment • the formulation of the problem (SOP) • Star alignment • relation with phylogenetic trees, progressive alignment • Sequence classification: profiles and moitifs • profile matrices • HMM (Hidden Markov Models)

Why we need to compare sequences? • Genome is already sequenced (assume...) • There are methods that predict DNA coding regions (genes) • What are biological functions of these genes?? • We can find out what protein (sequence) gene encodes • But we still do not know what this protein does... • However we can search for known proteins with similar sequences and such that functions of these proteins are known • We want to find out something about proteins in humans • The best approach is “experimental”, but tricky with humans... • But we can try to use similar protein (e.g. in mice) and start our experiments with them

Basic assumptions • Will consider proteins/RNA/DNA just as sequences in correspondingly 20 and 4 letter alphabets • Aims of comparison: • to find out how similar the sequences are (some similarity measure) • to find “common motif” of sequences (alignment) • Regarding algorithmic complexity two distinctive cases: • comparison of two sequences (relatively easy) • simultaneous comparison of n sequences (complexity grows exponentially withn) • In this lecture we will consider the problem of comparison of two sequences

Nucleotides and DNA [Watson, Crick 1953] For us DNA is a sequence in 4 letter alphabet [Adapted from Y.Guo]

Proteins For our purposes we will treat proteins as sequences in 20 symbol alphabet [Adapted from R.Shamir]

From DNA to proteins • Each codon consists of 3 nucleotides • Mutations: • Substitution: (changes a single aa) • Insertion/ Deletion: “frame shift” • (change all subsequent aa) • NB!Insertion/ Deletion might be a multiple of 3... • “Silent mutation” – DNA changed, but not aa • “Nonsense mutation” -creates“stop” codon

Genetic code Genetic code Completely worked out in 1962

Evolution of sequences • Mutations are a natural process of DNA evolution • DNAreplicationerrors: • substitutions • insertions • deletions • Similarity between sequences: • indicates their common ancestral origin • indicates similarity of biological functions • Well, this is of course simplification: the change of protein • function will determine whether the organism will have • offsprings and the changed gene will survive • Protein sequence similarity is closely associated with • similarity of DNA coding regions }indels

Sequence evolution • Each codon consists of 3 nucleotides • Mutations: • Substitution: (changes a single aa) • Insertion/ Deletion: “frame shift” • (change all subsequent aa) • NB!Insertion/ Deletion might be a multiple of 3... • “Silent mutation” – DNA changed, but not aa • “Nonsense mutation” -creates“stop” codon

ggcatt agcatt agcata agccta aggatt agcatg gacatt Sequence evolution

Sequence homology Homologs - evolved from the common ancestor Orthologs - the same function in different organisms Paralogs - similar function in the same organism

Orthologs vs paralogs [Adapted from R.Shamir]

How to compare sequences? Given two proteins: >sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR How to assess their similarity?

Sequence alignment - BLAST

Sequence alignment – the results we expect

Sequence alignment - SSEARCH

Sequence alignment - scores • sequence similarity/identity (%)This is well-defined for aligned sequence parts • “Score” (usually very method-specific in absolute value) • p-value– probability that alignment with given score or higher is found by chanceNormally the given values are only approximations • Expect(E)-value (a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size)The lower the better. But short similar sequences might have comparatively high values (E-value decreases exponentially with Score) • Z-score– number of standard deviations from mean value

Z-score

How to align two sequences - BLAST • Find two exact similarity regions (usually 4 aa each) • Try to join and extend these match until score falls below threshold • Anyway, how we should do this “correctly”?

The “Manhattan Tourist” problem Visit as many sights as possible starting from top-left corner and moving just down or right

Longest common subsequence Given two sequences A and B find a longest possible sequence C that is subsequence of both A and B (such C does not need to be unique) Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA C = ATTCGCA or ATTAGCT How can we find it?

LCS – dynamic programming solution A = a1 a2an B = b1 b2bm c(i,k) - length of LCS of a1 a2ai and b1 b2bk 0, if i = 0 or k = 0 c(i–1,k–1) +1, if i, k > 0 and ai = bk max{c(i, k–1), c(i–1, k)}, if i, k > 0 and aibk c(i, k) =

LCS – example A =GADTAMAWGRAMMA B = GAGAWKIAMM

LCS - example

LCS - example LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA GAG----AWKI—AMM-

Edit distance • Levenshtein 1966 • Minimal number of operations that transforms one sequence into another • insert, delete, substitute (1 simbols) • Edit distance is0(sequences are identical) or positive • For example “AIMS” & “AMOS”:(distance=2 for all three solutions) AMOS AMOS AIMSAMOS AIM-S A-MOS AIMS AIMS  [Adapted from D.Gilbert]

Edit distance Given two sequences A and B find a the smallest possible number of Insertion, Deletion and Substitution operations that chnages A to B Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA [G][G]AT[A]T[C][C][C][C]CG[G-C][G-C]C[G-C]C[A]T[A] ED = 12? How can we find it?

Edit distance A = a1 a2an B = b1 b2bm e(i,k) – lenght of ED for sequencesa1 a2aiandb1 b2bk i, if k = 0 k, if i = 0 e(i–1,k–1), ifi, k > 0 andai = bk min{e(i–1,k–1),e(i,k–1),e(i–1,k)}+1,ifi,k > 0 andaibk e(i, k) =

ED - modifications e(i,0) = i e(0,j) = j { e(i-1,j)+ t e(i,j-1)+ t e(i-1,j-1) + t(ai,bj) e(i,j)= min If you interested in result «up to a sign» it does not matter whether min or max is used. min is more natural for ED, max for LCS. max is also the usual choice for sequence comparison. tij– probability that aa ai changes to aa bj For ED: t = 1 t(ai,bj) = 0 ifai=bj t(ai,bj) = 1 ifaibj For «inverse» LCS: t = 0 t(ai,bj) = 1ifai=bj t(ai,bj) = 0ifaibj

Substitution (similarity) matrices A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 • Similarity Matrix • Most popular: • PAM • Blossum • Gonnet • The one shown is • BLOSSOM 62 • (almost :) "Traditional assumption": substitution score > 0 for substitutions that are more frequent as random ones, and < 0 for less frequent than random ones.

Sequence similarity as the longest path problem We can treat matrix as graph with weighted edges. The problem then translates to finding path with the largest/smallest weight in Directed Acyclic Graph.

Complexity of similarity computation Size of matrix: nm Computing of value for each cell: const Total time: (nm) Total memory: (nm) Notice that if we want just score only two rows are needed. In this case the required memory: (nm) However, if we also need the alignment (and we usually do)?

Edit distance in linear space?

Interpretation of comparison results Alignment grid (edit graph). Every alignment is a path from (0,0) to (n,m).

B ioinform atics Alignment of biological sequences

B ioinform atics Alignment of biological sequences

Presentation Transcript

2. Comparing biological sequences: sequence alignment (cont’d)

GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Alignment of Long Sequences: LAGAN

Algorithms for Alignment of Genomic Sequences

Semantic Modeling of Biological Sequences

DNA sequences alignment measurement

Alignment to a database of sequences

Alignment of large genomic sequences

Sequences Alignment Statistics

Multiple Sequences Alignment

Alignment of Genomic Sequences

Pairwise alignment of DNA/protein sequences

Biological sequences and SO

2. Comparing biological sequences : sequence alignment

B ioinform atics Alignment of biological sequences - databases and software

Computational searches of biological sequences

Multiple sequence alignment of TIM sequences

Introduction to Biological sequences

Semantic Modeling of Biological Sequences

Alignment of large genomic sequences