340 likes | 551 Views
Streamed Validation. Ksenia Rybenko, TU Dresden. XML-processing. Querying Computing running aggregates of streams Validating XML-documents against given DTDs. Statement of the problem.
E N D
Streamed Validation Ksenia Rybenko, TU Dresden
XML-processing • Querying • Computing running aggregates of streams • Validating XML-documents against given DTDs
Statement of the problem • Verify that an XML-document is valid with respect to a given DTD in a single pass and using a fixed amount of memory, depending on the DTD but not on the XML-document • Validation by FSA performing a pass on XML-document as it streams through the network
Validation • Strong validation (additionally checks well-formedness) -> strongly recognizable DTDs • Validation -> recognizable DTDs
XML as a tree document • Tree document over ∑ is a finite unranked tree with labels in ∑ and an order on the children of each node • Let T is a set of tree documents. ζ(T) is the language consisting of the string representations of the tree documents in T
String representation • String associated to the tree document denoted [t] • Induction • If t is a single root labeled a then [t]=aā • If t consists of a root labeled a and subtrees t1…tk then [t]=a[t1]…[tk]ā
Context Free Grammar (CFG) • Context-free grammar G = (V, ∑, P, S) • V - sets of nonterminals • ∑ - set of terminals • S in V - start symbol • P - finite set of productions of the form A -> a, where A is in V and a in (V U ∑)* • A tree document with root r over ∑satisfies a DTD d if it is a derivation tree of CFG G = (∑, ø, P, r) for d, where only regular expressions are allowed on the right hand side of the productions
Regular languages • Ø,ε and a for a in ∑ are in Reg∑ • If L, L1, L2 in Reg∑ then so are L1 U L2, L1· L2={u·v | u in L1 and v in L2}, L*={u1…un | n ≥ 0 and ui in L} Example: (ab)*a is regular, while anbn is not
Useful notation • a -> Ra is unique for each a in ∑ (a ->Ra1 and a -> Ra2 = a ->Ra1|Ra2) • The set of tree documents satisfying a DTD d is denoted by SAT(d) • ζ(d) is the language consisting of all string representations of elements of SAT(d)
Finite State Automaton • A= (Q,∑,I,Δ,F) • a finite set of states Q • a finite alphabet ∑ • a set of initial states I • a transition relation Δ: Q× ∑ ×Q • a set of final states F • Path in Automaton is a sequence q0a1q1a2…anqn : q0 ->a1..an -> qn, where (qi-1,ai,qi) in Δ • Path is successful if q0 is in I and qn is in F • Accepting language L(A)={w in ∑* | q0 ->w -> qn is a successful path in A} • Kleene‘s theorem: L is recognizable if it is regular
Strong validation of XML-documents • DTD d is strongly recognizable if ζ(d) can be recognized by an FSA • Strong validation includes also checking well-formedness of the XML-document
Example of recognizability not regular regular
Dependency graph for DTDs • Gd construction: • set of vertices is ∑ • a -> Ra - add edge from a to b for each b occurring in some word in Ra • Two labels a and b are mutually recursive if they belong to some cycle of Gd, and a is recursive if it is mutually recursive with itself Gd: (r,a),(a,a),(a,b)
Recursivity of DTDs • DTD d is nonrecursive iff Gd is acyclic • A specialized DTD d = (∑, ∑‘, d’, μ) is nonrecursive iff the DTD d’ over ∑‘ is nonrecursive. • DTD d is fully recursive if all labels from which recursive labels are reachable in Gd are mutually recursive
Recognizability condition (1) • DTD is strongly recognizable iff it is nonrecursive
Validating well-formed XML-documents • Let (Tree) denote the language consisting of all string representations of trees over ∑. The DTD d is recognizable and can be validated by an FSA iff there exists some regular language R such that ζ(d) = ζ(Tree) ∏ R
Condition of recognizability (2) • Lemma1: Let d is a recognizable DTD. Then the following holds, where a, b, u, v, w are words over ∑ while x, y, z (possibly subscripted) are individual symbols: Let k be a positive integer and xi, zi, 1 ≤ i ≤ k be mutually recursive symbols of d (not necessarily distinct). If ax1b in Rz1 , a’xkb’ in Rz1 and uixi-1vixiwi in Rzi for 1 ≤ i ≤ k, then ax1v2x2 . . . vkxkb’ must be in Rz1
Example of not recognizable DTD according to the lemma1 for k=2 does not hold. a and b are mutually recursive, Ra contains a and b, Rb contains ab, but Ra does not contain the required ab
Constructing a standard FSA Ad, which accepts ζ(d) • Ad is constructed from the separate automata for every regular expression, connected by additional transitions, with new initial and final states • Procedure is based on the in induction on the order of the edges in Gd, starting from the maximal element
Example of constructing Ad Note: Ad also accepts additional words such as raaāaāāŕ
Condition of recognizability (3) • A fully recursive DTD is recognizable • iff the set of well-balanced strings accepted by the standard FSA Ad is precisely ζ(d) • iff d satisfies the conditions of lemma1
Alternative approaches to validation • Validation with bounded stack • Refining the DTD
Validation with bounded stack • Relaxing the memory requirement • A stack whose depth is bounded in the depth of the XML-document is allowed as auxiliary memory • Formally it can be done by the Pushdown automaton (PDA), it is a finite automaton with control of both an input tape and a stack. The stack is a string of symbols of some alphabet
Validation of a DTD by PDA • The class of languages accepted by PDA’s is precisely the class of contextfree languages. Thus, every DTD can be strongly validated by some PDA
Refining the DTD • Refining the DTD is the problem of providing it in the tags with additional information (specialization) • For every DTD d there exists an equivalent specialized DTD dspec of size quadratic in d such that dspec is recognizable
DTDs with specialization A specialized DTD over ∑ is a tuple d=(∑,∑‘,d‘,μ) where • ∑ and ∑‘ are finite alphabets • d‘ is a DTD over ∑‘ • μ is a mapping from ∑‘ to ∑ • A tree document t over ∑satisfies a specialized DTD d if t is in μ(SAT(d‘))
Example of DTD with specialization μ(c)={ca,cb}
Conclusion • Conditions under which validation can be done in a single pass and constant memory are provided • Whenever DTD is recognizable it can be validated by the standard FSA • Another options for validation are considered: PDA, specializing a DTD
References • [1] A.V. Aho and J.D. Ullman. Translation on a context free grammar. Information and Cintrol, 19(19):439–475, 1971. • [2] A. Bruggemann-Klein and D. Wood. Regular tree and regular hedge languages over non-ranked alphabets. Hong Kong University of Science and Technology Computer Science Center. ResearchReport HKUST-TCSC-2001-05, 2001. • [3] J. Engelfriet, H.J. Hoogeboom, and J-P. van Best. Trips on trees. Acta Cybernetica, 14:51–64, 1999. • [4] J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979. • [5] Y. Papakonstantinou and V. Vinau. Dtd inference for views of xml data. In ACM PODS, pages 35–46, 2000. • [6] L. Segoufin and V. Vinau. Validating streaming xml documents. In PODS, 2002.