380 likes | 508 Views
Validating Streaming XML Documents Luc Segoufin & Victor Vianu. Presented by Harel Paz. The Challenge. XML becoming a standard for data exchange on the Web. Need: on-line processing of large amounts of data in XML format, using limited memory.
E N D
Validating Streaming XML DocumentsLuc Segoufin & Victor Vianu Presented by Harel Paz
The Challenge • XML becoming a standard for data exchange on the Web. • Need: on-line processing of large amounts of data in XML format, using limited memory. • Our focus: validating XML documents against given DTDs.
... <u><v> /v><v><w> <w></v> ...< ... start accept FSA Yes/No Input stream Validating Streaming XML Documents • Restrictions over the validation: • In a single pass. • Using a fixed amount of memory, depending on the DTD. FSA
The Problem in 2 Flavors • There are 2 flavors to the problem: • Strong validation: validation that includes checking well-formedness. • Validation: checking satisfaction of the DTD, under the assumption that the input is a well-formed XML document.
r a a b c b c c Tree Document • XML documents are abstracted by “tree documents”. • A tree document over a finite alphabet is a finite unranked tree with labels in and an order on the children of each node.
r a a b c b c c String Representation • XML documents are a string representation of trees using opening and closing tags for each element. • For each , • represents the opening tag. • represents the closing tag for . • Notation: .
r a a • A tree document over Σsatisfies a DTD if it is a derivation tree of the grammar. • DTD : • r a* • a bc • b c? • c є b c b c satisfies c DTDs • A DTD consists of an extended context-free grammar over alphabet Σ.
DTDs – cont’ • Each DTD has a unique rule for each symbol . • denotes the regular expression. • is the language over consisting of the string representations of all tree documents satisfying .
Strong Validation of Streaming XML Documents • The problem: validating an XML document with respect to a given DTD. • Need to characterize the DTDs , for which can be recognized by an FSA. • Such DTDs are called stronglyrecognizable.
r a . . a Strong Validation – Example 1 • DTD d: • r a • a a? • . • is not regular, so cannot be strongly validated by an FSA. • is not strongly recognizable.
r a a . . b c Strong Validation – Example 2 • DTD d: • r a* • a b|c • . • is regular, so is strongly recognizable.
More Definitions • Let be a DTD over . • The dependency graph of , , is the graph constructed as follows: • Its set of vertices is . • For each rule in , there is an edge from to , for each occurring in .
More Definitions (cont’) • Two labels, and , are mutually recursiveif they belong to some cycle of . • is recursive if it is mutually recursive with itself. • DTD is non-recursive iff is acyclic. • A DTD is fully recursive if all labels from which recursive labels are reachable in are mutually recursive.
is not acyclic. • is not fully recursive. • is recursive r a • is non-recursive. r a b c Dependency Graph – Examples • DTD d: • r a • a a? • DTD d: • r a* • a b|c
Characterization of Strongly Recognizable DTDs Theorem 3.1 (partial): A DTD is strongly recognizable iff it is non-recursive. • Proof sketch: • If is a strongly recognizable DTD, there is an FSA recognizing exactly . Suppose towards a contradiction that is recursive, and show using the pumping lemma that the above FSA accepts also non well-balanced strings. • If is non-recursive, an algorithm to build an FSA recognizing is given.
Validating Well-Formed XML Documents • The problem: validating an XML document with respect to a given DTD , assuming the XML document is well-formed. • Validation using an FSA. • Such DTDs are called recognizable. • The requirement that should be regular is now too strong. • The FSA should only work correctly on well-balanced strings representing trees.
Validation - Example 1 • DTD d: • r a • a a? • is not strongly recognizable. • But, it is recognizable: • If the input is known to be well balanced, the FSA should just check that the string is of the form (more precisely ).
a a b a b c a c a Validation - Example 2 • DTD d: • a (ab|ca|є) • b є • c є • is not recognizable. • An FSA cannot store enough information to recall, when it reads , whether the corresponding node has a left sibling (in which is not allowed to its right).
Characterizing Recognizable DTDs • Which DTDs are recognizable? • Non-recursive DTDs. • What about recursive DTDs? • Not a trivial question. • Are there any necessary conditions of being a recognizable DTD? • Are there any sub-groups of DTDs for which the necessary conditions are also sufficient?
Necessary Condition for a Recognizable DTD Lemma 4.2: Let be a recognizable DTD. Then the following hold, where are words over while (possibly subscripted) are individual symbols: Let be a positive integer and , be mutually recursive symbols of (not necessarily distinct). If , and for , then must be in .
Fully Recursive DTDs • The necessary condition stated in lemma 4.2 in order for a DTD to be recognizable, is also sufficient when the DTD is fully recursive. • Next, we’ll see how to construct an FSA for a DTD , which accepts all words in (and possibly more). • For fully recursive DTDs satisfying the conditions of Lemma 4.2, accepts precisely the words in (and possibly also non well-balanced words).
The Standard FSA • Let be a DTD over alphabet . • Equivalence relation on • Equivalence classes are the strongly connected components of . • Let be a partial order on the classes of , where iff for some and there is an edge from to in . • may have several maximal classes, but only one minimum class.
The classes of , are and . • . r a Example • DTD d: • r aa • a a?
Constructing FSA for Constructing FSA of class {a}’s string representation For edge in add to : • . • . a A A Example – cont’ • DTD d: • r aa • a a?
Example – cont’ • DTD d: • r aa • a a?
Example – cont’ • The above FSA recognizes all well-balanced words produced by the above DTD. • But also other well-balanced words (such as ). • There is no automaton recognizing this DTD. • DTD d: • r aa • a a?
Recognizable Fully Recursive DTDs Theorem 4.1: The following are equivalent for each fully recursive DTD : (i) is recognizable. (ii) satisfies the conditions of Lemma 4.2. (iii) The set of well-balanced strings accepted by the FSA is precisely .
Recognizable DTDs • Which DTDs are recognizable? • Non-recursive DTDs. • Fully recursive DTDs satisfying the conditions of Lemma 4.2. • And others… • But, characterization in the general case remains an open question. • Partial progress: necessary conditions for recognizability.
Alternative Validation Approaches • 2 alternative approaches for validating DTDs that are not recognizable: • Relax the constant memory requirement. • Refining the original DTD.
Validation with Bounded Stack • Relaxing the constant memory requirement. • Use a stack whose depth is bounded in the depth of an XML document. • Validation done in a single deterministic pass. • Appealing approach in practice. • For each DTD, there exists a deterministic PDA that accepts precisely its language. • Example- the DTD: • r aa • a a?
DTD: DTD: Refining the DTD • Refining a DTD means providing in the tags additional information that can be used for validation. • Example: • The refined DTD can be validated by an FSA. • For every DTD, there exists such equivalent DTD of size quadratic, which is recognizable.
Summary • First step towards the formal investigation of processing streaming XML. • Provided conditions under which validation can be done in a single pass and constant memory, using an FSA. • Considered alternative approaches, when validation using an FSA is not possible.
Appendix The Standard FSA Construction
The Standard FSA • is inductively constructed starting from the maximal elements of . • Let be a maximal element of . • For each regular expression ( ), a non-deterministic FSA is built. • Disjoint states for different ’s. • Initial state of is , while its final states are
The Standard FSA – cont’ is a maximal element of • Build : • Its states are the union of the states of the FSAs for . • Transitions- for each transition of , add to the transitions: • for the initial state of . • for each final state of . must belong to
The Standard FSA – cont’ • Build for non-maximal elements of , when all FSAs of elements , such that are already constructed: • Unlike the maximal elements case, has transitions ,where (i.e., ). • For such transitions, we add to : • A new disjoint copy of . • for the initial state of . • for each final state of .
The Standard FSA – cont’ • The final FSA is obtained by adding to the FSA of the minimum class (containing the root label ): • A new start state with transition for the start state of . • A final state with transition for each final state of .
The Standard FSA - Lemma Lemma 4.3: For each DTD , let be the automation described. We have: (i) Every word in is accepted by . (ii) can be constructed from in exponential time. • Complexity of ‘s construction: . • is the maximum size of an FSA for a regular expression of . • is the depth of the partial order .