270 likes | 408 Views
Efficient Incremental Validation of XML Documents. Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet Marcelo Arenas. Presented by Daria Barger. Outline. Introduction Types of constraints Update operations Incremental validation Experiments
E N D
Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet Marcelo Arenas Presented by Daria Barger Daria Barger – DB Seminar
Outline • Introduction • Types of constraints • Update operations • Incremental validation • Experiments • Conclusions • Future work Daria Barger – DB Seminar
Introduction • The problems of storing and querying XML documents have attracted a great deal of interest. • Other aspects of XML data management, however, have not yet been satisfactorily explored. • Among them is the problem of checking that documents are valid with respect to their specifications, and that they remain valid after updates. Daria Barger – DB Seminar
DTD • One popular form of XML document specification is the Document Type Definition (DTD). • A DTD D is a grammar that defines a set of documents L(D). • Each document in L(D) is said to be valid with respect to D . Daria Barger – DB Seminar
The Validation Problem The validationproblem is: Given a DTD D and an XML document X, is it the case that X L(D) ? The incrementalvalidationproblem is: Let U be some update operation. Given X L(D), is it the case that U(X) L(D)? Daria Barger – DB Seminar
Validation of structural constraints Content Model: Element- valid iff the string formed by concatenating its children elements belongs to L(E), the language denoted by E. Elements are declared in DTD by rules of the form: <!ELEMENT e c> <?xml version="1.0"?> <!ELEMENT db (person*)> <!ELEMENT person(name, dep, email, tel*)> <!ELEMENT name (#PCDATA)> <!ELEMENT dep(#PCDATA)> <!ELEMENT email(#PCDATA)> <!ELEMENT tel(#PCDATA)> Content Model: #PCDATA – validation can be done trivially Daria Barger – DB Seminar
Validation of attributes Attributes validation is trivial, except for ID and IDREF attribute types. Valid XML document should hold: • Values of all ID attributes are unique • Value of each IDREF attribute must be equal to the value of some ID attribute Daria Barger – DB Seminar
1-unambiguous regular expressions Marking: The specification of XML DTDs restricts the regular expression used for defining element content to be 1- unambiguous (deterministic). Position – subscripted symbol in E`. For given position x, Χ (x) denotes a corresponding (unmarked) symbol in Σ. For example: pos(E’) = {a,b1,b2,c} Χ (b1) =b Daria Barger – DB Seminar
1-unambiguous regular expressions A regular expression E is 1- unambiguous if and only if for all words u,v,w over the subscripted alphabet pos(E) and all x,y in pos(E), the conditions uxv, uyw L(E`) and x≠y imply Χ(x) ≠ Χ(y) Which regular expression is deterministic? • (ab)|(ac) • a(b|c) • a(a+b)*ac Daria Barger – DB Seminar
The Glushkov automaton for Regular Expressions set of positions that appear as the first symbol of some word in L(E’) set of positions that appear immediately after position x in some word in L(E’) set of positions that appear as the last symbol of some word in L(E’) Daria Barger – DB Seminar
Update operations A p A A y A A A A A A A A A A A A • Append(p,y) - insert element y as the last child of element p. Append Daria Barger – DB Seminar
Update operations (2) A A A A A • InsertBefore(x,y) – insert element y as immediate left sibling of element x.(This operation is not defined if x is the root of the document). A A A x A A A y A A Insert Before A A Daria Barger – DB Seminar
Update operations(3) A A A A A A A A A A A A A A A x • Delete(x) – delete element x from the document. Note that if x is the root of the document the operation is trivially valid. Delete(x) Daria Barger – DB Seminar
Observation The incremental validation concerns only the content of the element where the update takes place. For example, after an Append(p,y) operation only the content of p needs to be revalidated. Daria Barger – DB Seminar
The approach wk w2 w1 p w3 … • Together with the i-th child of p we store the value of for the automaton that validates the content model of p. • This requires auxiliary storage of size O(n log d), where n is a size of XML document, d is size of DTD Daria Barger – DB Seminar
Append at the end wk w2 w1 p y w3 … Append(p,y) operation Daria Barger – DB Seminar
Arbitrary insertions and deletions wk w2 w1 wi Delete(x) operation p … … Problem: Complexity Daria Barger – DB Seminar
1,2 Conflict Free Regular Expression Possible solution: Let’s consider E=a(b1*|cb2*) W=acb…b. All b’s match state b2 Delete c from w, receive w’=ab…b Now all b’s match state b1 We should re - validate the entire string This condition does not hold always, e.g. Daria Barger – DB Seminar
Definition of 1,2 Conflict-free Let E be regular expression over alphabet Σ Follow(E,x) – set of position in E that can follow x in some path through E. Define such that E is 1,2 conflict - free regular expression if: Daria Barger – DB Seminar
Restricted forms of DTD • 1,2 Conflict Free DTD • There is no “flipping” between automata states after the update. • The per update complexity for 1,2 Conflict Free DTD is O(log n + log d) time and O(n log d) auxiliary space. • Conflict-free DTD: • No repeated symbols. • The per update complexity: O(log n + log d) and constant auxiliary space. Daria Barger – DB Seminar
Incremental validation of ID and IDREF for adding element Append(p,y) and InsertBefore(x,y) operations require checking that no two ID attributes are the same and every IDREF attribute in y refers to some existing document values. The complexity: O(|y|log n) time and linear auxiliary space. |y| = size of added subtree. Daria Barger – DB Seminar
Incremental validation of ID and IDREF for deleting element After Delete(x) operation we have to check that there is no subtree rooted at x that contains a node that has an ID attribute referenced by some other node that is not a descendant of x. c b a Checking reference counter in delete requires O(log n) time. Updating reference counter in insert/removing IDREF attribute: O(h log n) time. Daria Barger – DB Seminar
Valid Insertion 1e+08 Incr CF – Incr 1.2 CF – Incr Arb – Full Arb – Full CF - 1e+06 Time [micro sec] 10000 100 64K 512K 4M 32M 256M 2G Document size Daria Barger – DB Seminar
Valid Deletion 1e+08 Incr CF – Incr 1.2 CF – Incr Arb – Full Arb – Full CF - 1e+06 Time [micro sec] 10000 100 64K 512K 4M 32M 256M 2G Document size Daria Barger – DB Seminar
Invalid Deletion Incr CF – Incr 1.2 CF – Incr Arb – Full Arb – Full CF - 1000 Time [micro sec] 100 10 64K 512K 4M 32M 256M 2G Document size Daria Barger – DB Seminar
Conclusions • Handled insertion and deletion of subtrees (not leaf nodes only). • Validated ID and IDREF attributes. • Characterize a class of DTDs appearing to capture most real life DTDs that admits a log time and constant space incremental validation algorithm. • Conducted experiments showing that the method is practical for large data documents and behaves much better than full revalidation. Daria Barger – DB Seminar
Future Work Handling complex updates, involving several insertions and deletions as a single transactions. Daria Barger – DB Seminar