230 likes | 313 Views
Optimal Probabilistic Generators for XML Collections. Serge Abiteboul, Yael Amsterdamer , Daniel Deutch, Tova Milo, Pierre Senellart [ ICDT 2012 ]. Adding probabilities to an XML Schema. XML schemas are useful for describing the structures of XML documents. E.g., DTD or XSD
E N D
Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]
Adding probabilities to an XML Schema Optimal Probabilistic Generators for XML Collections XML schemas are useful for describing the structures of XML documents. • E.g., DTD or XSD Schemas may be very general (e.g., xhtml, RSS) We want to add probabilities that reflect the likelihood of different parts of the schema • We will use the probabilities to turn the schema into a probabilistic generative model for XML documents • In particular, we want them to maximize the likelihood of a given XML document or document collection - 2 - Motivation
One Application: XML Auto-Completion [SIGMOD 2012] <MyPapers> <Paper> <title>XML for Beginners</title> <author>M. Jones<author> <author>H. Q. David</author> <author>L. Martin</author> <author>S. Smith</author> </Paper> <Paper> <title>Advanced XML</title> <author>M. Jones</author> <author>J. E. Peterson</author> <author>G. L. Williams</author> </Paper> <Paper> <title> </title> <author> </author> <author> </author> <author> </author> </Paper> </MyPapers> Optimal Probabilistic Generators for XML Collections Based on previous document versions / corpus of example documents Suggest nodes / sub-trees / node values to the user For example: Challenges: • Allow editing every part of the document • What kind of completion to suggest? • Finding the top-k best completions - 3 - Motivation
Many Other Usages for a Probabilistic Schema • Testing – e.g., generating many XML messages to simulate network load and test system performance. • Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc. • Schema Evaluation – how well a given schema describes a given corpus. ✗ ✓ Optimal Probabilistic Generators for XML Collections ... - 4 - Motivation
Our solution - An Outline Preliminaries – Tree Automata Generators for Schemas without Constraints Adding Constraints Restart Generators Continuation-Test Generators Leaf Values Optimal Probabilistic Generators for XML Collections - 5 -
Schema as a Deterministic Tree Automaton An XML document is modeled as an ordered tree. r Document d0: a b c $ Schema validation: the children of an a-labeled node are accepted by DFA Aa abcd 532 abcd a c b $ q0 q1 q2 Automaton Ar: (L(Ar) = a*bc*$) Validation is performed for the children of every inner node. Optimal Probabilistic Generators for XML Collections - 6 - Preliminaries
Using the Schema as a Generator Optimal Probabilistic Generators for XML Collections Recall that we want to turn the schema from an acceptor into a probabilistic generative model. Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly. Adding probabilities: we consider two problem settings • Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus. • Additionally, imposing integrity constraints on the documents (e.g., key constraints) - 7 - Preliminaries
Probabilistic Generator r a pa pc c b $ q0 q1 q2 a a b $ pb p$ Optimal Probabilistic Generators for XML Collections Without Constraints - 8 - Each transition is assigned a probability We assume independent choices, (a Markovian process) thus the document probability is the product. In this case, Pr(d)=pa∙pa∙pb∙p$ The schema and generator ignore leaf values (for now!)
Formal Problem Definition Optimal Probabilistic Generators for XML Collections Given a corpus D of documents , and a deterministic schema S that accepts every document in D We want to find an optimal generator based on S: • Find probabilities for the transitions of S that maximize the probability of generating D, • i.e., the maximum likelihood estimator (MLE). - 9 - Without Constraints
A Learning Algorithm r The frequency of using each transition during the corpus verification process is recorded. a b c $ 1 a c 1 b $ q0 q1 q2 1 1 Optimal Probabilistic Generators for XML Collections Without Constraints - 10 -
An Algorithm for Probabilities Learning (Cont.) /2 /2 /2 • Theorem: This efficient algorithm learns the MLE probabilities – finds an optimal probabilistic generator /2 Optimal Probabilistic Generators for XML Collections This is repeated for every node in every corpus document. We set the probability of each transition to be its relative frequency. - 11 - Without Constraints
Termination Optimal Probabilistic Generators for XML Collections Theorem: generation terminates with probability 1. • Guaranteed only because of the choice of probabilities according to the corpus. - 12 - Without Constraints
Integrity Constraints Optimal Probabilistic Generators for XML Collections We want to support integrity constraints, which are used in XML schema languages. Key Constraint: the leaves of a-labeled leaves have unique values (unary key) Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain - 13 - Adding Constraints
New Problem r r b c a a b $ a c b b … Optimal Probabilistic Generators for XML Collections - 14 - • We want to find optimal generators for XML schemas with constraints. • Valid generator output: an XML document, which • is a accepted by the schema, and • there exists a validleaf value assignment – which does not violate the constraints • Example: a, b, c are unique and contain each other Adding Constraints
Restart Generators Optimal Probabilistic Generators for XML Collections A simple idea: • Use a probabilistic generator to generate a document • Check if it has a value assignment valid w.r.t. the constraints • If not, 'restart' and try again until a valid document is generated Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME • Proof: By translating the constraints to bounds on the number of unique values for each leaf label Bad news: number of restarts can be unboundedly large in an optimal generator - 15 - Adding Constraints
Continuation-test Generators Perform a continuation-test before taking the transition Implies |c|≤|a| Pr(d) = pa∙pb∙pc∙1 r a pa pc c b $ q0 q1 q2 a b c $ pb p$ Optimal Probabilistic Generators for XML Collections Never make choices that lead to a 'dead end', thus always generate a valid document. We use a binary test to check if a choice has a continuation. Example: add to the schema of d0the constraints: • c is included in a • c is unique The generation process: - 16 - Adding Constraints
Learning Algorithm for Continuation-test Generators /2 /2 /1 • (q1, $) was chosen only when (q1, c) was not available. /1 Optimal Probabilistic Generators for XML Collections The probabilities are again relative frequencies, but –only in cases where there was an alternative choice. The learned generator will generate as many c-s as a-s Adding Constraints - 17 -
Results for Continuation-test Generators Optimal Probabilistic Generators for XML Collections Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices. • Extensions to non-binary are discussed in the paper Theorem: Continuation-test is NP-Complete • But only in the size of the schema; it is polynomial in the document size • Both generation and finding the optimal generator are polynomial when using a continuation-test oracle. • Based on schema satisfiability test [David et al. 2011] Theorem: probability of termination for a continuation-test generator may be arbitrarily small! • Proof – by construction of a simple, non-recursive schema • Can be handled by adding a constraint on the document size. • Sub-classes of schemas that guarantee termination? - 18 - Adding Constraints
Adding Values to the Structure Optimal Probabilistic Generators for XML Collections So far our generators were used only for the document structure Leaf values may also have a distribution according to which they can be generated • The distribution may be learned from the same document collection We will focus on the interesting case – generating leaf values for a schema with constraints - 19 - Leaf Values
Suggested Algorithm r a b c $ abcd efg abcd Optimal Probabilistic Generators for XML Collections We start with a valid document skeleton Order labels by inclusion constraints (e.g., c, b, a) Choose a leaf from the 'smallest' (most included) label, and including leaves Draw a value (from the domain) according to a given distribution. Use PTIME test to verify validity, if not revert the step Improvements presented in the paper - 20 - Leaf Values
Related Work Optimal Probabilistic Generators for XML Collections Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011] Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010] Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011] Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008] AXML [Abiteboul, Benjelloun & Milo 2008] PCFGs[e.g., Chi & Geman 1998] - 21 - Summary
Conclusion Optimal Probabilistic Generators for XML Collections A model for a probabilistic XML generators Unconstrained case • Generation and learning optimal generators can be done efficiently • Termination is guaranteed Constrained case • Restart generator • # of restarts is unbounded • Continuation-test generators • Generation and learning optimal generators are expensive • Termination is not guaranteed Leaf Value generation In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled. Future work • More Efficient combinations of restart and continuation-test generators - 22 - Summary
Thank You! Thank You! Q&A