330 likes | 481 Views
Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents. Agata SAVARY Universit é - Fran ç ois Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique. Seminarium IPIPAN, 24 kwietnia , 200 6. String-to-string correction.
E N D
Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006
Traditional string-to-string correction(Wagner&Fischer 1974, Lawrence&Wagner 1975,…) • CONTEXT: • Finite set of symbols (alphabet) • Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) • Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) • Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B • INPUT: • Two words A and B • OUTPUT: • Distance between A and B Seminarium IPIPAN, 24/04/2006
Examples of elementaryedit operations • Insertion of a letter montermontaer, montermontrer • Deletion ofa letter montermontr, montermonte • Replacement of a letter by another monter ponter, monterconter • Transposition oftwo adjacent letters monter mnoter, montermontre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation. Seminarium IPIPAN, 24/04/2006
Edit sequence • Edit sequence = sequence of elementary edit operations • For each couple of words X and Y many edit sequences exist that transform X into Y. • Example 1: transformingsorting intostring : • sorting srting sting string (3 operations) • sorting sotring string (2 operations) • sorting srting string (2 operations) • sorting strting string (2 operations) • sorting srting sting sing sring string (5 operations) • ................. • Example 2: transformingabc intoca : • abc ac ca(2 operations) • abc cabc cac ca (3 operations) • From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. Linear sequence Linear sequence Linear sequence Linear sequence Seminarium IPIPAN, 24/04/2006
Edit (error) distance • Costof an edit sequence = sum of costs of all elementary operations included in the sequence • sortingsrtingstingstring (3 operations) cost = 3 • sortingsotringstring (2 operations) cost = 2 • sortingsrtingstingsingsringstring (5 operations) cost = 5 • Edit distance(error distance) between two wordsXand Y (ed(X,Y)) = minimal cost of all edit sequences transforming X intoY : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account Seminarium IPIPAN, 24/04/2006
Calculatingthe edit distance(1/4) Notation : wordX= x1 x2 ... xi ...xn; theprefixof lenghtiofX : X[i]= x1 x2 ... xi i X X[i] It is possible to calculatethe distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases i X[i+1] If xi+1 = yj+1then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) Y[j+1] j Seminarium IPIPAN, 24/04/2006
Replacement’s cost Calculatingthe edit distance(2/4) If xi = yj+1and xi+1 = yj (the 2 last characters may be inverted) then4 sub-casesare possible: • The cheapest sequence transforming X[i+1] into Y[j+1] containsa transpositionof xiand xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 i X[i+1] Transposition’s cost Y[j+1] j • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthereplacementof xi+1by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthe l’insertionof yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthedeletionof xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Insertion’s cost Deletion’s cost Seminarium IPIPAN, 24/04/2006
Replacement’s cost Calculatingthe edit distance(3/4) i X[i+1] OTHERWISE (ifxi+1 yj+1, and (xi yj+1 orxi+1 yj)) then3 sub-casesare possible: Y[j+1] j • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthereplacementof xi+1by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containstheinsertionof yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthedeletionof xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Insertion’s cost Deletion’s cost Seminarium IPIPAN, 24/04/2006
Calculatingthe edit distance(4/4) Edit distancebetweenX[i] and Y[j] - recursivedefinition: For i=0,...,m, j=0,...,n: 1°ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j])ifxi+1 = yj+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]),if xi=yj+1 etxi+1 = yj 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]),otherwise ed(X[i],Y[j+1])} Seminarium IPIPAN, 24/04/2006
case [i,j] containstheedit distance betweenthe prefix [1,..,i] of the one word andthe prefixe [1,...,j] of the other word j m i n case [n,m] containsthe edit distance between the 2 words Calculation the edit distance : dynamic programming Seminarium IPIPAN, 24/04/2006
j+1 i+1 Dynamic programming: case 1 xi+1= yj+1 Seminarium IPIPAN, 24/04/2006
j+1 i+1 Dynamic programming : case 2 xi+1= yj and xi+1= yj Seminarium IPIPAN, 24/04/2006
j+1 i+1 Dynamic programming : case 3 xi+1 yj+1 et (xi+1 yj ou xi+1 yj) Seminarium IPIPAN, 24/04/2006
String-to-language correction: problem definition • CONTEXT: • Finite set of symbols (alphabet) • Elementary edit operations on symbols (as before) with their costs (1 per operation) • Edit sequences (as before) • Edit distance(error distance) between words: as before • INPUT: • Regular grammar describing words (a finite set of words in particular) • Incorrect word A(unrecognizable by the grammar) • Threshold t • OUTPUT: • A set of correct words B1, B2, …, Bn whose distance from A stays within t (the nearest neighbors of A) Seminarium IPIPAN, 24/04/2006
String-to-language correction: simplistic approach • METHOD: • For each word B recognizable by the grammar calculate the edit distance matrix between A and B. • Propose candidates whose distance from A does not exceed the threshold t (ed(A,B) t). • FAISABILITY: • Impossible in case of infinite languages • COMPLEXITY: O(n * m * |D|) Seminarium IPIPAN, 24/04/2006
String-to-language correction: threshold-controlled depth-first exploration of an FSA(Oflazer 1996, …) Seminarium IPIPAN, 24/04/2006
String correction with respect to a deterministic FSA (1/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold a new candidate has been found apple Seminarium IPIPAN, 24/04/2006
String correction with respect to a deterministic FSA (2/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 s 6 5 4 3 3 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold a new candidate has been found apple Seminarium IPIPAN, 24/04/2006
String correction with respect to a deterministic FSA (3/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 s 6 5 4 3 3 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold a new candidate has been found • A backtrancking results in deleting the current column apple Seminarium IPIPAN, 24/04/2006
String correction with respect to a deterministic FSA (4/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l y 5 4 3 2 1 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold a new candidate has been found • A backtrancking results in deleting the current column apple apply Seminarium IPIPAN, 24/04/2006
Controlling the searchspace by the threshold 2 a c Word to be corrected : abcbb, t=2 b 1 8 d 9 b • If the current column exceeds the threshold the whole path is cut off Seminarium IPIPAN, 24/04/2006
Tree-to-tree correction(Selkow 1977,…) • CONTEXT: • Finite set of node symbols (alphabet) • Elementary edit operations on trees: • Insertion of a leaf • Deletion of a leaf • Renaming of a node (leaf or internal node) • Non negatif cost for each elementary operation • Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) • Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B • INPUT: • Two trees A and B • OUTPUT: • Distance between A and B Seminarium IPIPAN, 24/04/2006
B root(B) B0 B2 B3 B1 Comparing two trees(Selkow 1977,…) A • A partial tree A0:i is the root of A and its subtrees A0,...,Ai • The comparison is based on comparing roots, and then recursively comparing the roots’ subtrees b a root(A) A0 A2 A1 e c c d d c c e f b b e e b d b B0:2 A0:1 Seminarium IPIPAN, 24/04/2006
j m i n Edit distance matrix between two trees(Selkow 1977,…) case [i,j] containstheedit distance betweenthepartial trees A0:i andB0:j case [-1,-1] containsthecost of renaming root(A) into root(B) case [n,m] containsthe edit distance between the 2 trees Seminarium IPIPAN, 24/04/2006
j i Calculation of the tree matrix(Selkow 1977,…) Adding the cost od deleting Ai (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4 Adding the edit distance between Ai and Bj (here +0) Adding the cost of inserting Bj (here +1) Seminarium IPIPAN, 24/04/2006
</b> </b> <b> <b> Extension to the correction of XML-documents <root> </root> • The validity of a node is described by a set of regular expressions, e.g. E = ab*c + db* • The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996) • The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977) <x> </x> <y> </y> <z> </z> </a> <a> </b> </c> <b> <c> Seminarium IPIPAN, 24/04/2006
Main idea String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued) Seminarium IPIPAN, 24/04/2006
j i Edit distance matrix with edit sequences case [i,j] containstheedit distance betweenthepartial trees A0:i andB0:j, and the edit sequence necessary to transform A0:i intoB0:j Seminarium IPIPAN, 24/04/2006
Bibliography • Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document Trees. Technical Report 95-372, Department of Computing and Information Science, Queen’s University, Kingston, Ontario. • Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302 • Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp. 381-402 • Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp. 177-183 • Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477 • Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89 • Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp. 184-186 • Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp. 265-268 • Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol.21(1), pp. 168-173 Seminarium IPIPAN, 24/04/2006
Some details of the state of the art • Wagner & Fischer (1974): • Elegant and solid theoretical definition of the string-to-string correction problem • 3 elementary operations on single letters admitted (insertion, deletion, replacement) • Model of a trace describing the edit distance between two strings • Dynamic programming method • Lowrance & Wagner (1975) • Additional elementary operation: inversion of two adjacent letters • Restriction of the cost function • Du & Chang (1992): • Cost 1 for each elementary operation • Restriction to linear editing sequences • Application to the nearest neighbor search in a dictionary, with a threshold • Oflazer (1996): • Nearest-neighbor search in finite-state automata • Application to large natural-language dictionaries • Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003): • Tree-to-tree correction problem • Mihov & Schulz (2004): • Levenshtein automaton • Backward dictionary • Bouchou, B. & Halfeld Ferrari Alves, M. (2003): • Incremental validation of XML documents resulting from updates: human-computer interaction Seminarium IPIPAN, 24/04/2006