260 likes | 270 Views
Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq
E N D
Recuperació de la informació • Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot • Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq • http://www-igm.univ-mlv.fr/~lecroq/string/index.html
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models
Regular expression A regular expression ℛ is a string on the set of simbols ΣU { ε, |, · , * , (, ) } which is recursively defined as: • ε (empty character) is a regular expression • A character of Σis a regular expression • ( ℛ ) is a regular expression • ℛ1 ·ℛ2is a regular expression • ℛ1 |ℛ2is a regular expression • ℛ * is a regular expression
Regular lenguage The lenguage defined by a regular expression ℛis the set of strings generated by ℛ. The problem of searching for a regular expression in the text T is to find all the factors in T that belong to the lenguage.
Methods Parse tree Search with bit-parallel Thompson automata DFA Search with deterministic finit automata Regular expression NFA Strings found
Methods Parse tree Search with bit-parallel Thompson automata DFA Search with deterministic finit automata Regular expression NFA Strings found
Search with a deterministic finit automata 2 b 3 b b 0 1 a b b 12 b a 0 1 a 3 Given the regular expression bb*(b|b*a) the NFA is As it’s not possible to spell the text out the NFA, the NFA is transformed into a DFA … What is the cost? And the search process…
Search example with DFA The search on the text: b b b a a b a a b b … b b 12 b a 0 1 a 3 Given the regular expression bb*(b|b*a)and the NFA: …
Methods Parse tree Search with bit-parallel Thompson automata DFA Search with deterministic finit automata Regular expression NFA Strings found
Parse tree . ℛ ℛ1 ℛ2 | * ℛ1 ℛ2 ℛ Is a tree such that: - internal nodes are labeled by operators - leaves are labeled by characters of Σ and ε ( ℛ ) ℛ1 ·ℛ2 ℛ1 |ℛ2 ℛ *
Parse tree: example . | * . b * b Given the regular expression bb*(b|b*a) the parse tree is: b b a
NFA (Thompson automaton) a For a character a of Σ: . ℛ1 ℛ2 | ε ε ε ε ℛ1 ℛ2 * ε ε ℛ ε From the regular expression or from the parse tree we define the automaton:
Thompsom automaton construction b b a b b bb*(b|b*a) . | b * . b b a * b
NFA: ε-closure (states ε-equivalents) bb*(b|b*a) b 6 7 b a 2 3 5 8 9 b 0 1 4 12 b 10 11 ε 1 3 4 5 7 9 11 1, 2, 4, 5, 6, 8, 10 2, 3, 4, 5, 6, 8,10 4, 5, 6, 8, 10 5, 6, 8 6, 7, 8 9, 12 11, 12
Bit-parallel Thompsom algorithm ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 bb*(b|b*a) b 6 7 b a 2 3 5 8 9 b 0 1 4 12 b 11 10 Text: ababbbaab The bit-vector D mark the active states: at the begining At every step we shift to the right followed by an “and” operator with the mask of the last read character… The masks are (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …and the ε-closure extension of active states.
Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 0 1 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0
Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1
Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1
Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1
Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1
Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0
Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0
Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0