1.03k likes | 1.28k Views
The Longest Common Subsequence Problem and Its Variants. 楊昌彪 中山大學資訊工程學系 http://www.nsysu.edu.tw. Outline. Introduction to Bioinformatics Traditional LCS Algorithms Our Works Block Edit Problems LCS of Run-Length Encoded Strings Merged LCS Problem Mosaic LCS Problem Conclusions.
E N D
The Longest Common Subsequence Problem and Its Variants 楊昌彪 中山大學資訊工程學系 http://www.nsysu.edu.tw
Outline • Introduction to Bioinformatics • Traditional LCS Algorithms • Our Works • Block Edit Problems • LCS of Run-Length Encoded Strings • Merged LCS Problem • Mosaic LCS Problem • Conclusions
動物細胞(細胞核、細胞質、細胞膜) • DNA位於細胞核內之「核仁」
DNA and RNA • Nucleotide (核甘酸): 腺嘌呤 (adenine, A) 鳥糞嘌呤(guanine, G) 胞嘧啶(cytosine, C) 胸腺嘧啶(thymine, T) 尿嘧啶(uracil, U) • DNA(deoxyribonucleic acid , 去氧核糖核酸) {A, G, C, T} (base pair: GC, A=T ) • RNA(ribonucleic acid, 核糖核酸) {A, G, C, U} (base pair: GC, A=U, GU )
DNA Length • The total length of the human DNA is about 3109(30億) base pairs. • 1% ~ 1.5% of DNA sequence is useful. • # of human genes: 30,000~40,000 • Conclusion from the Human Genome Project (1990~2003) • Expected # is 100,000 originally.
DNA TCCAACGGTGCTGAGGTGCAC Protein Gene DNA, Genes and Proteins • DNA: program for cell processes • Proteins: execute cell processes
Amino Acids (胺基酸) 胺基酸:Protein(蛋白質)的基本單位,共20種
Traditional Dynamic Programming (DP) for the Longest Common Subsequence (LCS) Problem
The Longest Common Subsequence (LCS) Problem • A string : S1 = “TAGTCACG” • A subsequence of S1 : deleting 0 or more symbols from S1 (not necessarily consecutive). e.g. G, AGC, TATC, AGACG • Common subsequences of S1 = “TAGTCACG” and S2 = “AGACTGTC” : GG, AGC, AGACG • Longest common subsequence (LCS) :S1: TAGTCACG S2: AGACTGTC LCS: AGACG
Applications of LCS • The edit distance of two strings or files. (# of deletions and insertions) S1: TAGTCACG S2: AGACTGTC Operation: DMMDDMMIMII • Spoken word recognition • Similarity of two biological sequences (DNA or protein) • Sequence alignment
The Traditional LCS Algorithm • S1 = a1a2am and S2 = b1b2bn • Ai,j denotes the length of the longest common subsequence of a1a2 ai and b1 b2 bj. • Dynamic programming: Ai,j = Ai-1,j-1 + 1if ai= bj max{ Ai-1,j, Ai,j-1 }if ai bj A0,0 = A0,j = Ai,0 = 0 for 1 i m, 1 j n. • Time complexity: O(mn) a1a2 ai-1ai b1 b2 bj-1bj
LCS and Edit Distance • Edit distance = |S1| + |S2| - 2 * |LCS(S1, S2)|
Sequence Alignment S1 = TAGTCACG S2 = AGACTGTC ----TAGTCACG TAGTCAC-G-- AGACT-GTC--- -AG--ACTGTC • Which one is better? • We can set different gap penalties as parameters for different purposes.
Gap Penalty for Sequence Alignment • is the gap penalty. • Suppose
Example for Sequence Alignment TAGTCAC-G-- -AG--ACTGTC
MSA, ET and LCS Multiple sequence alignment LCS Phylogeny (evolutionary tree) 親緣樹
Hunt-Szymanski LCS Algorithm • By extending the idea in RSK (Robinson-Schensted-Knuth) algorithm for solving the longest increasing subsequence, the LCS problem can be solved in O(r log n) time, where r denotes the number of matches. • This algorithm is faster than the traditional dynamic programming if r is small.
The Pairs of Matching in Hunt-Szymanski Algorithm • Input sequences: TAGTCACG and AGACTGTC • Pairs of matching:
Example for Hunt-Szymanski Algorithm • The insertion order is row major and column backward. • Time Complexity: O(r log n), r: # of matchesEach match needs O(log n) time for binary search. L
Block Edit Problems • Operations: Block copy, block deletion and block move. • Shapira and Storer (2002) proved that it is NP-hard when recursive block-move operations are allowed. • Various approximations were proposed. • Our assumptions – Restricted edit sequence: • A series of edit operations are performed from left to right on the source string X. • Any two block-edit operations would not be performed on overlapping regions on X.
Restricted Edit Sequence (a) General (recursive) edit operations (b) Restricted edit sequence
Definitions of the Problems (1/2) • Let P(o, c) denote a block edit problem: • o: a composition of block-edit operations • c: the class of cost measures • The Block-Copy operations: • External copy: copy a substring of Xto Wi • Internal copy: copy a valid substring of Wi-1to Wi • Shifted copy: copy a shifted substring
Definitions of the Problems (2/2) • The Cost Measures that can be chosen: • Constant cost: pcopy • Linear cost: ps+ k ×pe • Nested cost: pcopy+ dc(A, B) • Three problems are defined in our work: • P(EIS,C) • P(EI,L) • P(EI,N)
Problem 1 -- P(EIS,C) – External, Internal, Shifted, Constant • External and internal copies are allowed in constant cost. • Shifted copies are allowed in constant cost. • It can be solved by a straightforward DP algorithm in O(nm2 (n + m) |Σ|) time. • We propose an O(nm) time DP algorithm with • O(n+m2) preprocessing time in worst case • O(n+mlogm) preprocessing time in average case
Recurrence DP Formula for P(EIS,C) • Straightforward implementation:O(nm2 (n + m) |Σ|) time.
Functions and Operations (1) • Character operations: • Block deletions:
Functions and Operations (2) • External copies: • Internal copies:
Functions and Operations (3) • Shifted copies:
Preprocessing for P(EIS,C) • For external copies: • Build a suffix treeT(XR#YR$) to find the common substrings between X and Y. • For internal copies: • Build a suffix tree T(YR) to find the valid common substrings to be copied from working string Wito Wi+1. • For shifted copies: • Compute the differential stringsX'and Y'of Xand Y. • Find the valid common substrings for external / internal copies.
Preprocessing – Longest Common Prefixes (LCP) and Suffix trees
Problem 2 -- P(EI,L) – External, Internal, Linear • The cost of each copy or deletion is with an initial penalty plus a linear extended penalty.
Problem 3 -- P(EI,N) – External, Internal, Nested • The copied strings can be further edited with character-edit operations.
LCS of Run-Length Encoded Strings • Run-length encoding (RLE) compressionaaaaabbbccccdd a5b3c4d2 • Input: • RLE string X: length n, k runs • RLE string Y: length m, l runs • Output: • LCS between X and Y.
Dark & Light Blocks • Divide the DP lattice into k × l blocks. • Dark blocks: matched blocksLight blocks: mismatched blocks
Results of Bunke and Csirik (1995) • Lemma 1 (Dark block): • Lemma 2 (Light block): • Only the boundaries of the blocks are needed.
Results of Liu et al. (2008) • A complex modified DP formula which computes the DP lattice row by row. • Only the bottom boundaries of the blocks are needed.