570 likes | 671 Views
Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://www-igm.univ-mlv.fr/~lecroq/string/index.html. Algorismes de:
E N D
Recuperació de la informació • Modern Information Retrieval (1999) • Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) • Gonzalo Navarro and Mathieu Raffinot • http://www-igm.univ-mlv.fr/~lecroq/string/index.html Algorismes de: Cerca de patrons (exacta i aproximada) (String matching i Pattern matching) Indexació de textos: Suffix trees, Suffix arrays
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models
Exact string matching: one pattern (text on-line) Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256
Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
Trie G T A T A T G Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
Trie G T A T A T G Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G T A T A T G T A A T A A
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G T A T A T G G T A T A A T A Which is the cost?
Set Horspool algorithm • How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns • Which is the next position of the window? a We shift until a is aligned with the first a in the trie not longer than lmin,or lmin
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G T 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4 4. Find the patterns
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA… …
Set Horspool algorithm G T A T A T G G T A T A A T A As more patterns we search for, shorter shifts we do! A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA… … Is the expected length of the shifts related with the number of patterns?
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3 1
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT 1 CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3 3 3 3
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 … AA 1 AT 1 GT 1 TA 2 TG 2 A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA But given k patterns, how many symbols we should take ? … log|Σ| 2*lmin*k
Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45
BOM algorithm (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor of any pattern • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata
Factor Oracle of k strings A How can we build the Factor Oracle of GTATGTA, GTAA, TAATA i GTGTA ? G T A G T A T G T 1,4 A A T A 3 2
Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G
Factor Oracle of k strings A Given the Factor Oracle of GTATGTA G T T
Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G T A T A
Factor Oracle of k strings G Given the Factor Oracle of GTATGTA G T A T T A
Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G T A G T G T A
Factor Oracle of k strings A 1 Given the Factor Oracle of GTATGTA G T A G T T G T A … we insert GTAA
Factor Oracle of k strings A 2 …inserting GTAA G T A G T A T G T 1 A
Factor Oracle of k strings Given the AFO of GTATGTA and GTAA G T A G T A T G T 1 A A 2 … we insert TAATA
Factor Oracle of k strings A T A 3 … inserting TAATA G T A G T A T G T 1 A A 2
Factor Oracle of k strings A Given the AFO of GTATGTA, GTAA and TAATA G T A G T A T G T 1 A A T A 3 2 …we insert GTGTA
Factor Oracle of k strings A …inserting GTGTA G T A G T A T G T 1 A A T A 3 2
Factor Oracle of k strings A G T A G T A T G T 1,4 A A T A 3 2 This is the Automata Factor Oracle of GTATGTA, GTAA, TAATA and GTGTA
SBOM algorithm • How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata
SBOM algorithm: example A We search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG … the we build the Automata Factor Oracle of GTATG, GTAAT, TAATA and GTGTA of length lmin=5 G T A G T T A 1 4 A G T A A T 2 3
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 text: ACATGCTAGCTATAATAATGTATG
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 text: ACATGCTAGCTATAATAATGTATG