1 / 33

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents. Agata SAVARY Universit é - Fran ç ois Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique. Seminarium IPIPAN, 24 kwietnia , 200 6. String-to-string correction.

roden
Download Presentation

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006

  2. String-to-string correction

  3. Traditional string-to-string correction(Wagner&Fischer 1974, Lawrence&Wagner 1975,…) • CONTEXT: • Finite set of symbols (alphabet) • Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) • Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) • Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B • INPUT: • Two words A and B • OUTPUT: • Distance between A and B Seminarium IPIPAN, 24/04/2006

  4. Examples of elementaryedit operations • Insertion of a letter montermontaer, montermontrer • Deletion ofa letter montermontr, montermonte • Replacement of a letter by another monter ponter, monterconter • Transposition oftwo adjacent letters monter mnoter, montermontre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation. Seminarium IPIPAN, 24/04/2006

  5. Edit sequence • Edit sequence = sequence of elementary edit operations • For each couple of words X and Y many edit sequences exist that transform X into Y. • Example 1: transformingsorting intostring : • sorting  srting  sting  string (3 operations) • sorting  sotring  string (2 operations) • sorting  srting  string (2 operations) • sorting  strting  string (2 operations) • sorting  srting  sting sing sring string (5 operations) • ................. • Example 2: transformingabc intoca : • abc  ac ca(2 operations) • abc  cabc  cac  ca (3 operations) • From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. Linear sequence Linear sequence Linear sequence Linear sequence Seminarium IPIPAN, 24/04/2006

  6. Edit (error) distance • Costof an edit sequence = sum of costs of all elementary operations included in the sequence • sortingsrtingstingstring (3 operations) cost = 3 • sortingsotringstring (2 operations) cost = 2 • sortingsrtingstingsingsringstring (5 operations) cost = 5 • Edit distance(error distance) between two wordsXand Y (ed(X,Y)) = minimal cost of all edit sequences transforming X intoY : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account Seminarium IPIPAN, 24/04/2006

  7. Calculatingthe edit distance(1/4) Notation : wordX= x1 x2 ... xi ...xn; theprefixof lenghtiofX : X[i]= x1 x2 ... xi i X X[i] It is possible to calculatethe distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases i X[i+1] If xi+1 = yj+1then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) Y[j+1] j Seminarium IPIPAN, 24/04/2006

  8. Replacement’s cost Calculatingthe edit distance(2/4) If xi = yj+1and xi+1 = yj (the 2 last characters may be inverted) then4 sub-casesare possible: • The cheapest sequence transforming X[i+1] into Y[j+1] containsa transpositionof xiand xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 i X[i+1] Transposition’s cost Y[j+1] j • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthereplacementof xi+1by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthe l’insertionof yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthedeletionof xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Insertion’s cost Deletion’s cost Seminarium IPIPAN, 24/04/2006

  9. Replacement’s cost Calculatingthe edit distance(3/4) i X[i+1] OTHERWISE (ifxi+1 yj+1, and (xi yj+1 orxi+1 yj)) then3 sub-casesare possible: Y[j+1] j • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthereplacementof xi+1by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containstheinsertionof yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthedeletionof xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Insertion’s cost Deletion’s cost Seminarium IPIPAN, 24/04/2006

  10. Calculatingthe edit distance(4/4) Edit distancebetweenX[i] and Y[j] - recursivedefinition: For i=0,...,m, j=0,...,n: 1°ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j])ifxi+1 = yj+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]),if xi=yj+1 etxi+1 = yj 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]),otherwise ed(X[i],Y[j+1])} Seminarium IPIPAN, 24/04/2006

  11. case [i,j] containstheedit distance betweenthe prefix [1,..,i] of the one word andthe prefixe [1,...,j] of the other word j m i n case [n,m] containsthe edit distance between the 2 words Calculation the edit distance : dynamic programming Seminarium IPIPAN, 24/04/2006

  12. j+1 i+1 Dynamic programming: case 1 xi+1= yj+1 Seminarium IPIPAN, 24/04/2006

  13. j+1 i+1 Dynamic programming : case 2 xi+1= yj and xi+1= yj Seminarium IPIPAN, 24/04/2006

  14. j+1 i+1 Dynamic programming : case 3 xi+1 yj+1 et (xi+1 yj ou xi+1 yj) Seminarium IPIPAN, 24/04/2006

  15. String-to-language correction

  16. String-to-language correction: problem definition • CONTEXT: • Finite set of symbols (alphabet) • Elementary edit operations on symbols (as before) with their costs (1 per operation) • Edit sequences (as before) • Edit distance(error distance) between words: as before • INPUT: • Regular grammar describing words (a finite set of words in particular) • Incorrect word A(unrecognizable by the grammar) • Threshold t • OUTPUT: • A set of correct words B1, B2, …, Bn whose distance from A stays within t (the nearest neighbors of A) Seminarium IPIPAN, 24/04/2006

  17. String-to-language correction: simplistic approach • METHOD: • For each word B recognizable by the grammar calculate the edit distance matrix between A and B. • Propose candidates whose distance from A does not exceed the threshold t (ed(A,B)  t). • FAISABILITY: • Impossible in case of infinite languages • COMPLEXITY: O(n * m * |D|) Seminarium IPIPAN, 24/04/2006

  18. String-to-language correction: threshold-controlled depth-first exploration of an FSA(Oflazer 1996, …) Seminarium IPIPAN, 24/04/2006

  19. String correction with respect to a deterministic FSA (1/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found apple Seminarium IPIPAN, 24/04/2006

  20. String correction with respect to a deterministic FSA (2/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 s 6 5 4 3 3 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found apple Seminarium IPIPAN, 24/04/2006

  21. String correction with respect to a deterministic FSA (3/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 s 6 5 4 3 3 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found • A backtrancking results in deleting the current column apple Seminarium IPIPAN, 24/04/2006

  22. String correction with respect to a deterministic FSA (4/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l y 5 4 3 2 1 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found • A backtrancking results in deleting the current column apple apply Seminarium IPIPAN, 24/04/2006

  23. Controlling the searchspace by the threshold 2 a c Word to be corrected : abcbb, t=2 b 1 8 d 9 b • If the current column exceeds the threshold the whole path is cut off Seminarium IPIPAN, 24/04/2006

  24. Tree-to-tree correction

  25. Tree-to-tree correction(Selkow 1977,…) • CONTEXT: • Finite set of node symbols (alphabet) • Elementary edit operations on trees: • Insertion of a leaf • Deletion of a leaf • Renaming of a node (leaf or internal node) • Non negatif cost for each elementary operation • Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) • Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B • INPUT: • Two trees A and B • OUTPUT: • Distance between A and B Seminarium IPIPAN, 24/04/2006

  26. B root(B) B0 B2 B3 B1 Comparing two trees(Selkow 1977,…) A • A partial tree A0:i is the root of A and its subtrees A0,...,Ai • The comparison is based on comparing roots, and then recursively comparing the roots’ subtrees b a root(A) A0 A2 A1 e c c d d c c e f b b e e b d b B0:2 A0:1 Seminarium IPIPAN, 24/04/2006

  27. j m i n Edit distance matrix between two trees(Selkow 1977,…) case [i,j] containstheedit distance betweenthepartial trees A0:i andB0:j case [-1,-1] containsthecost of renaming root(A) into root(B) case [n,m] containsthe edit distance between the 2 trees Seminarium IPIPAN, 24/04/2006

  28. j i Calculation of the tree matrix(Selkow 1977,…) Adding the cost od deleting Ai (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4 Adding the edit distance between Ai and Bj (here +0) Adding the cost of inserting Bj (here +1) Seminarium IPIPAN, 24/04/2006

  29. </b> </b> <b> <b> Extension to the correction of XML-documents <root> </root> • The validity of a node is described by a set of regular expressions, e.g. E = ab*c + db* • The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996) • The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977) <x> </x> <y> </y> <z> </z> </a> <a> </b> </c> <b> <c> Seminarium IPIPAN, 24/04/2006

  30. Main idea String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued) Seminarium IPIPAN, 24/04/2006

  31. j i Edit distance matrix with edit sequences case [i,j] containstheedit distance betweenthepartial trees A0:i andB0:j, and the edit sequence necessary to transform A0:i intoB0:j Seminarium IPIPAN, 24/04/2006

  32. Bibliography • Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document Trees. Technical Report 95-372, Department of Computing and Information Science, Queen’s University, Kingston, Ontario. • Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302 • Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp. 381-402 • Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp. 177-183 • Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477 • Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89 • Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp. 184-186 • Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp. 265-268 • Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol.21(1), pp. 168-173 Seminarium IPIPAN, 24/04/2006

  33. Some details of the state of the art • Wagner & Fischer (1974): • Elegant and solid theoretical definition of the string-to-string correction problem • 3 elementary operations on single letters admitted (insertion, deletion, replacement) • Model of a trace describing the edit distance between two strings • Dynamic programming method • Lowrance & Wagner (1975) • Additional elementary operation: inversion of two adjacent letters • Restriction of the cost function • Du & Chang (1992): • Cost 1 for each elementary operation • Restriction to linear editing sequences • Application to the nearest neighbor search in a dictionary, with a threshold • Oflazer (1996): • Nearest-neighbor search in finite-state automata • Application to large natural-language dictionaries • Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003): • Tree-to-tree correction problem • Mihov & Schulz (2004): • Levenshtein automaton • Backward dictionary • Bouchou, B. & Halfeld Ferrari Alves, M. (2003): • Incremental validation of XML documents resulting from updates: human-computer interaction Seminarium IPIPAN, 24/04/2006

More Related