1.01k likes | 1.17k Views
Bioinformatic PhD. course. Bioinformatics Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona Supercomputing Center Universitat Politècnica de Catalunya. Contents. 1. Biological introduction .
E N D
Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona Supercomputing Center Universitat Politècnica de Catalunya
Contents 1. Biological introduction 2. Comparison of short sequences ( up to 10.000bps) Dot Matrix Pairwise align. Multiple align. Hash alg. 3. Comparison of large sequences ( more that 10.000bps) Data structures Suffix trees MUMs 4. String matching Exact Extended Approximate 5. Sequence assembly 6. Projects: PROMO, MREPATT, …
String matching 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Extended string matching 3. Approximate string matching (Dynamic programming) • Flexible pattern matching in strings • G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press • Algorithms on strings, trees and sequences • D. Gusfield, Cambridge University Press, 1997
String matching Definition: given a long text T and a set of k patterns p1,p2,…,pk, the string matching problem is to find all the ocurrences of all the patterns in the text T. On-line algorithms: the patterns are known. Off-line algorithms: the text is known. • Only one pattern (exact and approximated) • Five, ten, hundred, thusand,.. patterns (exact) • Suffix trees
Master Course First part: (Exact) string matching
String matching: one pattern How does the string algorithms made the search? For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. and for the pattern TACTACGGTATGACTAA
String Matching: Brute force algorithm A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A Example: Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ...
String Matching: Brute force algorithm Connect to http://www-igm.univ-mlv.fr/~lecroq/string/index.html and open Brute Force algorithm
String Matching of one pattern Prefix search Suffix search Factor search The cost of Brute Force algorithm is O(nm), and the expected number of comparisons? Can the search be made with lower cost? CTACTACTACGTCTATACTGATCGTAGCTACTACATGC TACTACGGTATGACTAA
String matching of one pattern There is a sliding window along the text against which the pattern is compared: Text : Pattern : How does the string algorithms made the search? At each step the comparison is made and the window is shifted to the right. Which are the facts that differentiate the algorithms? • How the comparison is made. • The length of the shift.
String Matching: Brute force algorithm • How the comparison is made? Text : Pattern : From left to right: prefix search • Which is the next position of the window? Text : Pattern : The window is shifted only one cell The cost is O(mn).
String Matching: one pattern Most efficient algorithms (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 w Length of the pattern 2 4 8 16 32 64 128 256
String Matching: Horspool algorithm • How the comparison is made? Text : Pattern : From right to left: suffix search • Which is the next position of the window? a Text : Pattern : a a a a a a a a a It depends of where appears the last letter of the text, say it ‘a’, in the pattern: Then it is necessary a preprocess that determines the length of the shift.
String Matching: Horspool algorithm A 4 C 5 G 2 T 1 And the search: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A Example: Given the pattern ATGTA, the shift table is
String Matching: Horspool algorithm A 4 C 5 G 2 T 1 And the search: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A Example: http://www-igm.univ-mlv.fr/~lecroq/string/index.html Given the pattern ATGTA, the shift table is …
String Matching: one pattern The most efficient algorithms (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 w Length of the pattern 2 4 8 16 32 64 128 256 What happens with many patterns?
String matching: many patterns Given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC Search for the patterns ACTGACT GTCT AATT ACTGATCTTT GTAGC AATACT ACATGC ACTGA.
Horspool for many patterns 1. Build the trie of the inverted patterns G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 A 3. Table of shifts Search for ATGTATG,TATG,ATAAT,ATGTG 2. lmin=4 4. Start the search
Horspool for many patterns G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 A Search for ATGTATG,TATG,ATAAT,ATGTG The text ACATGCTATGTGACA…
Horspool for many patterns G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 A Search for ATGTATG,TATG,ATAAT,ATGTG The text ACATGCTATGTGACA…
Horspool for many patterns G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 A Search for ATGTATG,TATG,ATAAT,ATGTG The text ACATGCTATGTGACA…
Horspool for many patterns G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 A Search for ATGTATG,TATG,ATAAT,ATGTG The text ACATGCTATGTGACA…
Horspool for many patterns G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 A Search for ATGTATG,TATG,ATAAT,ATGTG The text ACATGCTATGTGACA…
Horspool for many patterns G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 A Search for ATGTATG,TATG,ATAAT,ATGTG The text ACATGCTATGTGACA… …
Horspool for many patterns G T A T A T G G T A T A A T A Short Shifts! A 1 C 4 (lmin) G 2 T 1 A Search for ATGTATG,TATG,ATAAT,ATGTG The text ACATGCTATGTGACA… …
Horspool to Wu-Manber 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 … AA 1 AT 1 GT 1 TA 2 TG 2 A 1 C 4 (lmin) G 2 T 1 How do we can increase the length of the shifts? With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG into the text: ACATGCTATGTGACATAATA … Experimental length: log|Σ| 2*lmin*r
String matching of many patterns | | (5 patterns) 8 Wu-Manber 4 SBOM Lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 patterns) 4 SBOM 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (100 patterns) 4 SBOM 2 5 10 15 20 25 30 35 40 45
String Matching: one pattern The most efficient algorithms (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 w Length of the pattern 2 4 8 16 32 64 128 256
BNDM algorithm • How the comparison is made? Searches for suffixes of T that are factors of P Text : Pattern : This state is expressed with an array D of bits: D2 = 1 0 0 0 1 0 0 Given the mask B(x) of x, the cells where character x appears into the pattern D = D<<1 & B(x) If B(x) = ( 0 0 1 1 0 0 0) then D3 = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) • How the shift is determined? x How the next state can be obtained? ?
BNDM algorithm: example B(A) = ( 1 0 0 0 1 ) B(C) = B(G) = B(T) = the mask of characters is: Given the pattern ATGTA,
BNDM algorithm: example B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) the mask of characters is: Given the pattern ATGTA,
BNDM algorithm: example B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) the mask of characters is: Given the text : G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A Given the pattern ATGTA, D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
BNDM algorithm: example B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) The pattern is ATGTA , the masks are: and the text: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Pattern found! …
BNDM algorithm • How the comparison is made? Searches for suffixes of T that are factors of P Text : Pattern : This state is expressed with an array D of bits: D = 1 0 0 0 1 0 0 • How the shift is determined? ?
BNDM algorithm • How the comparison is made? Searches for suffixes of T that are factors of P Text : Pattern : This state is expressed with an array D of bits: D = 1 0 0 0 1 0 0 • How the shift is determined? If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells
String matching: one pattern The most efficient algorithms (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. patró 2 w 2 4 8 16 32 64 128 256
BOM (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Automaton: Factor Oracle(1999) Checks if the suffix is a factor of the pattern • How the shifted is determined? ?
Automaton Factor Oracle: properties G T A G T T A G T A Suffixes that have not been found before. Suffixes found before. Factor Oracle of the word G T A T G T A G T A T G T A T G A T G T G G but the automaton also recognizes other strings as G T G then it is usefull only for discard words out as factors!
BOM: example G T A G T T A G T A • Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A... • How the comparison is made? • The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG A T G T A T G
BOM: example G T A G T T A G T A • Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G • How the comparison is made? • The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG A T G T A T G
BOM: example G T A G T T A G T A • Search G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G • How the comparison is made? • The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG A T G T A T G
BOM: example G T A G T T A G T A • Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G A T G T A T G • How the comparison is made? • The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG A T G T A T G
BOM: example G T A G T T A G T A • Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ... A T G T A T G A T G T A T G A T G T A T G A T G T A T G • How the comparison is made? • The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG A T G T A T G
BOM: example G T A G T T A G T A • Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G • How the comparison is made? • Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG A T G T A T G …
BOM (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of the pattern • How the shifted is determined? a • a is the first mismatch But what happens with many patterns?
SBOM • How the comparison is made? Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of any pattern • How the shifted is determined? ?
Factor Oracle of many patterns A G T A G T A T G T 1,4 A A T A 3 2 The AFO of GTATGTA, GTAA, TAATA i GTGTA
SBOM algorithm • How the comparison is made? Text : Patrons: Autòmaton………… of lenght lmin • How the shift is determined? a • If the a doesn’t appears in the AFO • If lmin characters have been read
SBOM algorithm : example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 ACATGCTAGCTATAATAATGTATG