380 likes | 439 Views
Advanced Data Structure: Bioinformatics. First week : Algorithms for exact string matching. Second week : Approximate search and alignment of short sequences. Third week : Dealing with long sequences. Advanced Data Structure:bibliography. Bioinformatics, Sequence and Genome Analysis
E N D
Advanced Data Structure: Bioinformatics • First week: Algorithms for exact string matching. • Second week: Approximate search and alignment of short sequences. • Third week: Dealing with long sequences.
Advanced Data Structure:bibliography • Bioinformatics, Sequence and Genome Analysis David W. Mount • Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot • http://www-igm.univ-mlv.fr/~lecroq/string/index.html • http://www.ncbi.nlm.nih.gov/
First week • First week: algorithms for exact string matching: One pattern: The algorithm depends on |p| and | k patterns: The algorithm depends on k, |p| and || • Second week: approximate search and alignment of short sequences. • Third week: dealing with long sequences.
Exact string matching for one pattern How does the string algorithms made the search? For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. and for the pattern TACTACGGTATGACTAA
Exact string matching: Brute force algorithm A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A Example: Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ...
Exact string matching: Brute force algorithm • How the comparison is made? From left to right: prefix • Which is the next position of the window? The window is shifted only one cell Text : Pattern : Text : Pattern :
Exact string matching: one pattern Text : Pattern : How does the matching algorithms made the search? There is a sliding window along the text against which the pattern is compared: At each step the comparison is made and the window is shifted to the right. Which are the facts that differentiate the algorithms? How the comparison is made. The length of the shift.
Exact string matching for one pattern Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256e
Horspool algorithm • How the comparison is made? Text : Pattern : Sufix search • Which is the next position of the window? a Text : Pattern : Shift until the next ocurrence of “a” in the pattern: a a a a a a We need a preprocessing phase to construct the shift table.
Horspool algorithm : example Given the pattern ATGTA A C G T • The shift table is:
Horspool algorithm : example Given the pattern ATGTA A 4 C G T • The shift table is:
Horspool algorithm : example Given the pattern ATGTA A 4 C 5 G T • The shift table is:
Horspool algorithm : example Given the pattern ATGTA A 4 C 5 G 2 T • The shift table is:
Horspool algorithm : example Given the pattern ATGTA A 4 C 5 G 2 T 1 • The shift table is:
Horspool algorithm : example Given the pattern ATGTA A 4 C 5 G 2 T 1 • The shift table is: • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A
Horspool algorithm: example Given the pattern ATGTA A 4 C 5 G 2 T 1 • The shift table is: • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A
Some questions about Horspool algorithm A 4 C 5 G 2 T 1 Given the pattern ATGTA, the shift table is Given a random text over an equally likely probability distribution (EPD): 1.- Determine the expected shift of the window. And, if the PD is not equally likely? 2.- Determine the expected number of shifts assuming a text of length n. 3.- Determine the expected number of comparisons in the suffix search phase
Exact string matching for one pattern Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256
BNDM algorithm • How the comparison is made? Search for suffixes of T that are factors of x Text : Pattern : That is denoted as D2 = 1 0 0 0 1 0 0 Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) • Which is the next position of the window? Depends on the value of the leftmost bit of D
BNDM algorithm: example B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) • The mask of characters is: • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A Given the pattern ATGTA D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
BNDM algorithm: example of window shift B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) • Given the pattern ATGTA • The mask of characters is : • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found
BNDM algorithm: example B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) • The mask of characters is : • The searching phase: G T A C T A G A A T A C G T A T G T A C T G ... A T G T A A T G T A A T G T A Given the pattern ATGTA How the shif is determined? D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
Extended string matching • Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. • Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. • Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. • Wild cards: we find pattern as AT*TA where * means an arbitrary long string. • Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..
Exact string matching for one pattern Algorismes més eficients (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256
Autòmata Factor Oracle: propietats G T A G T T A G T A Factor Oracle of word G T A T G T A All states are accepting states. Recognizes all factors … but more, which? If a word is rejected, it isn't a factor, then
BOM algorithm (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Automata: Factor Oracle Checks from right to left • How many cells are shifted? a • If the a isn't into the automaton a • If we reach the last stat of the automaton with the a
BOM algorithm: example G T A G T T A G T A • And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG A T G T A T G
BOM algorithm: example G T A G T T A G T A • And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG A T G T A T G
BOM algorithm: example G T A G T T A G T A • And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG A T G T A T G
BOM algorithm: example G T A G T T A G T A • And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG A T G T A T G
BOM algorithm: example G T A G T T A G T A • And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG A T G T A T G
BOM algorithm: example G T A G T T A G T A • And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG A T G T A T G
Automata Factor Oracle G T A T A G GT T GTA TA A G T A T T A G GT T GTA TA A GTAT TAT AT T Given the pattern GTATA, in which state the factors are accepted? When the new T is read, 4 factors should be accepted GTAT TAT AT T, how it can be reached? When the new A is read, 5 factors should be accepted GTATA TATA ATA TA A, how it can be reached?
Automata Factor Oracle A G G T A T T A G GT T GTATA TATA ATA TA A GTA TA A GTAT TAT AT T GTATAG TATAG ATAG TAG AG G When the new G is read, 6 factors should be accepted GTATAG TATAG ATAG TAG AG G
Autòmata Factor Oracle: algorisme T T If there is a T transition ...
Autòmata Factor Oracle: algorisme T But if there isn't a T transition ... T … and recursively continue ...