EE3J2 Data Mining Lecture 14: Sequence Analysis & Dynamic Programming Martin Russell

EE3J2 Data MiningLecture 14: Sequence Analysis & Dynamic ProgrammingMartin Russell EE3J2 Data Mining

Objectives • To understand the principal of Dynamic Programming (DP) • To understand how DP can be used to compute the distance between two sequences of data • To understand what is meant by, and how to compute: • An alignment path • The DP recurrence equation • The distance matrix • The accumulated distance matrix • The optimal path EE3J2 Data Mining

Sequences • Sequential data common in real applications: • DNA analysis in bioinformatics and forensic science • Sequences of the letters A, G, C and T • Signature recognition biometrics • Words and text • Spelling and grammar checkers, author verification,.. • Speech, music and audio • Speech/speaker recognition, speech coding and synthesis • Electronic music • Radar signature recognition… EE3J2 Data Mining

Retrieving sequential data • Large corpora of sequential data • E.g: DNA databases • Individual sequences may not be amenable to human interpretation • Need for automated sequential data retrieval • Need to ‘mine’ sequential data • Fundamental requirement is a measure of distance between two sequences EE3J2 Data Mining

Basic definitions • In a typical sequence analysis application we have a basic alphabet consisting of N symbols • Examples: • In conventional text A is the set of letters plus punctuation plus ‘white space’ • In bioinformatics, scientists us {A, B, D, F} to describe DNA sequences EE3J2 Data Mining

Sequences of continuous variables • In some important applications the elements of a discrete sequence are taken from a continuous vector space, rather than a finite set • Sequences of continuous variables can be dealt with in two ways: • Directly • Clustering can be used to represent the space as a set of K centroids, and each vector can be replaced by its closest centroid (Vector Quantisation) EE3J2 Data Mining

Distance between sequences (1) • Consider two sequences of symbols from the alphabet A = {A, B, C, D} • How similar are the sequences: • S1 = ABCD • S2 = ABD • Intuitively S2 is obtained from S1 by deleting C • An alternative explanation is that S2 is obtained from S1 by substituting D for C and then deleting D EE3J2 Data Mining

Distance between sequences (2) • Alternatively S2 was obtained from S1 be deleting ABCD and inserting ABC • … • Why is the first explanation intuitively ‘correct’ and the last one intuitively ‘wrong’? • Because we favour the simplest explanation which involves the minimum number of insertions, deletions and substitutions • Or do we? EE3J2 Data Mining

Distance between sequences (3) • Consider: • S1 = AABC • S2 = SABC • S3 = PABC • S4 = ASCB • If we know that these sequences were typed then S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard • Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error EE3J2 Data Mining

A B B C A B C Alignments • The relationship between two sequences can be expressed as an alignment between their elements • E.G: Insertion EE3J2 Data Mining

substitution A X C A B C Alignment: deletion and substitution deletion A C A B C N.B: All edits described relative to vertical string EE3J2 Data Mining

General alignment path A C X C C D A B C D Which alignment is best? EE3J2 Data Mining

The Distance Matrix • Suppose that d(A,B) denotes the distance between the alphabet symbols A and B • Examples: • d(A,B) = 0 if A = B, otherwise d(A,B) = 1 • In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B • In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B • For continuous valued sequences d could be Euclidean distance EE3J2 Data Mining

Notation • Suppose we have an alphabet: • A distance matrix for A is an N N matrix where is the distance between the mthand nthalphabet symbols EE3J2 Data Mining

The Accumulated Distance • Consider two sequences: • S1 = ABCD • S2 = ACXCCD • For any alignment path p between S1 and S2 we define the accumulated distance between S1 and S2, denoted by ADp(S1,S2), to be the sum over all the nodes of p of the corresponding distances between elements of S1 and S2. EE3J2 Data Mining

A C X C C D A B C D Path p Accumulated distance along p EE3J2 Data Mining

A C X C C D A B C D Accumulated distance (continued) KDEL KINS KINS KINS EE3J2 Data Mining

Optimal path and DP distance • The optimal path is the path with minimum accumulated distance • Formally the optimal path is where: • The DP distance, or accumulated distanceAD(S1,S2) between S1 and S2 is given by: EE3J2 Data Mining

*If KDEL and KINSare not defined you should assume that they are zero Calculating the optimal path • Given • the distance matrix D, • the insertion penalty KINS , and • the deletion penalty KDEL* • How can we compute the optimal path between two (potentially long) sequences S1 and S2? • In a real application the lengths of S1 and S2 might be large EE3J2 Data Mining

A X Y B Dynamic Programming (DP) • The optimal path is calculated using Dynamic Programming (DP) • DP is based on the principle of optimality • Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of: • (i) the best path from X to A plus the cost of going from A to Y • (ii) the best path from X to B plus the cost of going from B to Y EE3J2 Data Mining

A C X C C D A B C D DP – step 1 • Step 1: draw the trellis of all possible paths EE3J2 Data Mining

d(A,A) ad(2,1)=ad(1,1)+d(2,1) DP – forward pass – initialisation (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining

(i-1,j) (i-1,j-1) (i,j-1) ad(i,j) • ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j) • It is calculated using the principle of optimality • In addition, the forward path matrix records the local optimal path at each point EE3J2 Data Mining

ad(1,2) = ad(1,1)+d(A,C) DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining

DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining

DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand corner EE3J2 Data Mining

Summary • Introduction to sequence analysis • Definitions of: • distance, • forward accumulated distance, • forward path matrix, • optimal path • Dynamic programming • How to computed the accumulated distance using DP • How to recover the optimal path EE3J2 Data Mining

EE3J2 Data Mining Lecture 14: Sequence Analysis & Dynamic Programming Martin Russell