1.11k likes | 1.13k Views
This paper explores the use of tree automata for typing semistructured data, specifically focusing on XML data. The motivations include simplifying software development for XML, enhancing interoperability, and improving storage and performance. It covers various aspects such as type checking, static verification, and regular tree languages. The advantages, challenges, and applications of tree automata in handling XML data are discussed in detail.
E N D
Typing semistructured data Serge Abiteboul 2008 Typing semistructured data
Organization • Motivations • Automata • Automata on words • Ranked tree automata • Unranked tree automata • Automata and monadic second-order logic • Automata – to compute • XML typing: DTD, XML schema • Graphs and bisimulation
Motivation Typing semistructured data
XML typing • Not compulsory • Simplify writing software for XML • Improve interoperability between programs • Improve storage and performance • Ease querying: data guide • Simplify data protection • Reject illegal update – like relational dependencies
Root person company Company works-for managed-by Employee Company c.e.o. Employee name address name string Improve storage Lower-bound schema Store rest in overflow graph Typing semistructured data
Bib paper book address year title title journal author string int string string last name first name zip street city string string string string string Improve performance select X.title from Bib._ X where X.*.zip = “12345” select X.title from Bib.book X where X.address.zip = “12345” Typing semistructured data
Type checking • Who checks • XML editor: check that the data conforms to its type • XML exchange, e.g., with Web service • Server when delivering the data • Client/application: when receiving it • Dynamic verification: after the data is produced • Static verification: verification of the program that generates the data
Static verification • Input: input type T and code of function f • f is Xquery, Xpath, XSLT, etc. • Verification of T’ • Is it true that d╞T, f(d)╞T’ ? • Type inference • Find the smallest T’ such that d╞T, f(d)╞T’ • Rapidly undecidable because of “joins”
Example for $p in doc("parts.xml“)//part[color=“red"] return <part> <name>$p/name</name> <desc>$p/desc</desc> </part> Result type (part (name (string) desc (any) )* If the type of parts.xml//part/desc is string (part (name (string) desc (string) )*
Difficulty for $X in Input, $Y in Input do { print ( <b/> } Input: <a/> <a/> Result: <b/> <b/> <b/> <b/> Problem: { bi i=n2 for n ≥ 0 } cannot be described in XML schema There is no « best » result • b* • + b2 b* • + b2 + b4b* • + b2 + b4 + b9b* • …
Why tree automata? • XML = unranked trees • No theory for XML • Rich theory for strings: Automata • Extend to rich theory for ranked trees: Tree automata • Nice algorithms • Nice theorems • Can this carry to unranked trees and XML? • Yes!
From strings to trees a a a b b b b b a b b b b b b a b a a a a b a a b b b b Word Binary tree… Unranked tree automata Finite State Ranked tree automata no bound on number of children Automata
Only unranked tree automata? • Missing practical gadgets • Complexity of verification • Goal: typing at reasonable cost • Unranked tree automata + …
Automata Automata on words Typing semistructured data
Finite state automata on words Transitions Alphabet State Initial state Accepting states Typing semistructured data
Nondeterministic automaton: Example a b a - a b a a b - q0 q0 q0 q0 q0 q0 q0 q0 q0 q0 q2 q1 q1 q1 q1 q1 OK KO
Deterministic No transition No alternative transitions such as Determinization It is possible to obtain an equivalent deterministic automaton State of new automaton = set of states of the original one Possible exponential blow-up Minimization Limitations – cannot do Context-free languages Essential tool – e.g., lexical analysis Reminder
Reminder (2) • L(A) = set of words accepted by automata A • Regular languages • Can be described by regular expressions, e.g. a(b+c)*d • Closed under complement • Closed under union, intersection • Product automata with states (s,s’) where s is from A and s’ is from A’
Automata on words versus trees a Top down Bottom up Left to right b b b b a a b b a a Right to left a b No difference Differences
Automata Automata on ranked trees Typing semistructured data
Binary tree automata • Parallel evaluation • For leaves: • For other nodes: q2 a Bottom up q” q1 b b b b a a q” q’ q q a b q’ q Typing semistructured data
Bottom-up tree automata Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’ Accepts is the root is in some state in F Not deterministic if alternatives or -transitions:
v v v v v 1 1 1 0 0 v 1 1 Boolean circuit evaluation OK
Regular tree language = set of trees accepted by a bottom-up tree automata Typing semistructured data
Regular tree languages The following are equivalent • L is a regular tree language • L is accepted by a nondeterministic bottom-up automata • L is accepted by a deterministic bottom-up automata • L is accepted by a nondeterministic top-down automata Deterministic top-down is weaker
Top-down tree automata Top-down: if a node labeled a is in state q”, then its left child moves to state q (right to q’) Accepts is all leaves are is in states in F Not deterministic if
Why deterministic top-down is weaker? • Consider the language • L = { f(a,b), f(b,a) } • It can be accepted by a bottom-up TA • Exercise: write a BUTA A such that L = L(A) • Suppose that B is a deterministic top-down TA with L = L(B) • Exercise: Show that B also accepts {f(a,a)} • A contradiction Fact: No deterministic top-down tree automata accepts L
Ranked trees automata: Properties • Like for words only higher complexity • Determinization • Minimization • Closed under • Complement • Intersection • Union
But… • XML documents are unranked • The kind of things we want to do: book (intro,section*,conclusion)
Automata Automata on unranked tree Typing semistructured data
Unranked tree automata Issue: represent an infinite set of transitions Solution: a regular language
Unranked tree automata (2) Rule: Meaning: if the states of the children of some node labeled a form a word in L(Q), this node moves to some state in {r1,…,rm}
Building on ranked trees a a b b a b b b a b b b a b b b a b • Ranked tree: FirstChild-NextSibling • F: encoding into a ranked tree • F is a bijection • F-1: decoding
Building on bottom-up ranked trees (2) For each Unranked TA A, there is a Ranked TA accepting F(L(A)) For each Ranked TA A, there is an unranked TA accepting F-1(L(A)) Both are easy to construct Consequence: Unranked TA are closed under union, intersection, complement
Determinization always possible for bottom-up Can we use the FirstChild-NextSibling encoding No: it does not preserve determinism Determinization
Top-down? • This is more delicate • Transition (a,q)=A(a,q) • The state of the automata A(a,q) when reading the labels of the children of a node labeled a determines the states of the children of that node • Accepts if all the leaves are in accepting state
Boolean circuit evaluation It is accepted It rejects by if some state of a leaf is neither 0 with q0 nor 1 with q1 v v v v 1 0 0 v 1 v 0 1 1 1 1 v v 1 0 1 1
Automata Automata and monadic second-order logic Typing semistructured data
Monadic second-order logic a 1 b b a b 2 3 4 5 b b a b 6 7 8 9 • Representation of a tree as a logical structure E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9)
Monadic second-order logic Quantification over a set variable Set variable E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) MSO syntax
Example of MSO Each a node has a b-descendant This corresponds to the formula For each node x labeled a: each set X that ()contains x and that () is closed under descendant, X contains some y labeled b
Bridge Theorem: for a set L of trees, the following are equivalent L = L(A) for some bottom-up tree automata A i.e. L is definable with bottom-tree automata L = {T | T satisfies } for some MSO formula i.e. L is definable in MSO
XML typing DTDs Typing semistructured data
DTD • Describe the children of a node of a label a by a regular expression • Bizarre syntax <!ELEMENT populationdata (continent*) > <!ELEMENT continent (name, country*) > <!ELEMENT country (name, province*)> <!ELEMENT province (name, city*) > <!ELEMENT city (name, pop) > <!ELEMENT name (#PCDATA) > <!ELEMENT pop (#PCDATA) >
DTD and deterministism • Regular expressions in DTD should be deterministic • Complicated definition • Intuition: the corresponding automata should be deterministic • (a+b)*a is not • When reading <a>, one cannot tell whether it is an a from (a+b) or if it is the a of the end • (b*a)(b*a)* is an equivalent expression that is deterministic
Very efficient validation • It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata Aa • Possible to type check the document while scanning it, e.g. with SAX parser
Very efficient validation (2) <a><b><d/><d/></b><c/></a> a b c d d Aa s t u s’ t’ b c t s u Accept d Ab s’ t’ d <!ELEMENT a ( b c ) > <!ELEMENT b ( d+ ) >
Warning The previous example can be checked with a simple automata on words But not the following one <!ELEMENT part ( part* ) > The stack is needed for accepting <a>…<a></a>…</a> n <a> n </a>
Some bad news for DTD • Not closed under union DTD1 … <!ELEMENT used( ad*) > <!ELEMENT ad ( year, brand )> DTD2 … <!ELEMENT new( ad*) > <!ELEMENT ad ( brand )> • L(DTD1) L(DTD2) cannot be described by a DTD but can be described easily by a tree automata • Problem with the type of ad that depends of its parent • Also not closed under complement • Limited expressive power