280 likes | 452 Views
EE3J2 Data Mining Lecture 14: Sequence Analysis & Dynamic Programming Martin Russell. Objectives. To understand the principal of Dynamic Programming (DP) To understand how DP can be used to compute the distance between two sequences of data To understand what is meant by, and how to compute:
E N D
EE3J2 Data MiningLecture 14: Sequence Analysis & Dynamic ProgrammingMartin Russell EE3J2 Data Mining
Objectives • To understand the principal of Dynamic Programming (DP) • To understand how DP can be used to compute the distance between two sequences of data • To understand what is meant by, and how to compute: • An alignment path • The DP recurrence equation • The distance matrix • The accumulated distance matrix • The optimal path EE3J2 Data Mining
Sequences • Sequential data common in real applications: • DNA analysis in bioinformatics and forensic science • Sequences of the letters A, G, C and T • Signature recognition biometrics • Words and text • Spelling and grammar checkers, author verification,.. • Speech, music and audio • Speech/speaker recognition, speech coding and synthesis • Electronic music • Radar signature recognition… EE3J2 Data Mining
Retrieving sequential data • Large corpora of sequential data • E.g: DNA databases • Individual sequences may not be amenable to human interpretation • Need for automated sequential data retrieval • Need to ‘mine’ sequential data • Fundamental requirement is a measure of distance between two sequences EE3J2 Data Mining
Basic definitions • In a typical sequence analysis application we have a basic alphabet consisting of N symbols • Examples: • In conventional text A is the set of letters plus punctuation plus ‘white space’ • In bioinformatics, scientists us {A, B, D, F} to describe DNA sequences EE3J2 Data Mining
Sequences of continuous variables • In some important applications the elements of a discrete sequence are taken from a continuous vector space, rather than a finite set • Sequences of continuous variables can be dealt with in two ways: • Directly • Clustering can be used to represent the space as a set of K centroids, and each vector can be replaced by its closest centroid (Vector Quantisation) EE3J2 Data Mining
Distance between sequences (1) • Consider two sequences of symbols from the alphabet A = {A, B, C, D} • How similar are the sequences: • S1 = ABCD • S2 = ABD • Intuitively S2 is obtained from S1 by deleting C • An alternative explanation is that S2 is obtained from S1 by substituting D for C and then deleting D EE3J2 Data Mining
Distance between sequences (2) • Alternatively S2 was obtained from S1 be deleting ABCD and inserting ABC • … • Why is the first explanation intuitively ‘correct’ and the last one intuitively ‘wrong’? • Because we favour the simplest explanation which involves the minimum number of insertions, deletions and substitutions • Or do we? EE3J2 Data Mining
Distance between sequences (3) • Consider: • S1 = AABC • S2 = SABC • S3 = PABC • S4 = ASCB • If we know that these sequences were typed then S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard • Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error EE3J2 Data Mining
A B B C A B C Alignments • The relationship between two sequences can be expressed as an alignment between their elements • E.G: Insertion EE3J2 Data Mining
substitution A X C A B C Alignment: deletion and substitution deletion A C A B C N.B: All edits described relative to vertical string EE3J2 Data Mining
General alignment path A C X C C D A B C D Which alignment is best? EE3J2 Data Mining
The Distance Matrix • Suppose that d(A,B) denotes the distance between the alphabet symbols A and B • Examples: • d(A,B) = 0 if A = B, otherwise d(A,B) = 1 • In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B • In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B • For continuous valued sequences d could be Euclidean distance EE3J2 Data Mining
Notation • Suppose we have an alphabet: • A distance matrix for A is an N N matrix where is the distance between the mthand nthalphabet symbols EE3J2 Data Mining
The Accumulated Distance • Consider two sequences: • S1 = ABCD • S2 = ACXCCD • For any alignment path p between S1 and S2 we define the accumulated distance between S1 and S2, denoted by ADp(S1,S2), to be the sum over all the nodes of p of the corresponding distances between elements of S1 and S2. EE3J2 Data Mining
A C X C C D A B C D Path p Accumulated distance along p EE3J2 Data Mining
A C X C C D A B C D Accumulated distance (continued) KDEL KINS KINS KINS EE3J2 Data Mining
Optimal path and DP distance • The optimal path is the path with minimum accumulated distance • Formally the optimal path is where: • The DP distance, or accumulated distanceAD(S1,S2) between S1 and S2 is given by: EE3J2 Data Mining
*If KDEL and KINSare not defined you should assume that they are zero Calculating the optimal path • Given • the distance matrix D, • the insertion penalty KINS , and • the deletion penalty KDEL* • How can we compute the optimal path between two (potentially long) sequences S1 and S2? • In a real application the lengths of S1 and S2 might be large EE3J2 Data Mining
A X Y B Dynamic Programming (DP) • The optimal path is calculated using Dynamic Programming (DP) • DP is based on the principle of optimality • Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of: • (i) the best path from X to A plus the cost of going from A to Y • (ii) the best path from X to B plus the cost of going from B to Y EE3J2 Data Mining
A C X C C D A B C D DP – step 1 • Step 1: draw the trellis of all possible paths EE3J2 Data Mining
d(A,A) ad(2,1)=ad(1,1)+d(2,1) DP – forward pass – initialisation (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining
(i-1,j) (i-1,j-1) (i,j-1) ad(i,j) • ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j) • It is calculated using the principle of optimality • In addition, the forward path matrix records the local optimal path at each point EE3J2 Data Mining
ad(1,2) = ad(1,1)+d(A,C) DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining
DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining
DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining
DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand corner EE3J2 Data Mining
Summary • Introduction to sequence analysis • Definitions of: • distance, • forward accumulated distance, • forward path matrix, • optimal path • Dynamic programming • How to computed the accumulated distance using DP • How to recover the optimal path EE3J2 Data Mining