1 / 23

Optimal Probabilistic Generators for XML Collections

Optimal Probabilistic Generators for XML Collections. Serge Abiteboul, Yael Amsterdamer , Daniel Deutch, Tova Milo, Pierre Senellart [ ICDT 2012 ]. Adding probabilities to an XML Schema. XML schemas are useful for describing the structures of XML documents. E.g., DTD or XSD

awen
Download Presentation

Optimal Probabilistic Generators for XML Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

  2. Adding probabilities to an XML Schema Optimal Probabilistic Generators for XML Collections XML schemas are useful for describing the structures of XML documents. • E.g., DTD or XSD Schemas may be very general (e.g., xhtml, RSS) We want to add probabilities that reflect the likelihood of different parts of the schema • We will use the probabilities to turn the schema into a probabilistic generative model for XML documents • In particular, we want them to maximize the likelihood of a given XML document or document collection - 2 - Motivation

  3. One Application: XML Auto-Completion [SIGMOD 2012] <MyPapers> <Paper> <title>XML for Beginners</title> <author>M. Jones<author> <author>H. Q. David</author> <author>L. Martin</author> <author>S. Smith</author> </Paper> <Paper> <title>Advanced XML</title> <author>M. Jones</author> <author>J. E. Peterson</author> <author>G. L. Williams</author> </Paper> <Paper> <title> </title> <author> </author> <author> </author> <author> </author> </Paper> </MyPapers> Optimal Probabilistic Generators for XML Collections Based on previous document versions / corpus of example documents Suggest nodes / sub-trees / node values to the user For example: Challenges: • Allow editing every part of the document • What kind of completion to suggest? • Finding the top-k best completions - 3 - Motivation

  4. Many Other Usages for a Probabilistic Schema • Testing – e.g., generating many XML messages to simulate network load and test system performance. • Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc. • Schema Evaluation – how well a given schema describes a given corpus. ✗ ✓ Optimal Probabilistic Generators for XML Collections ... - 4 - Motivation

  5. Our solution - An Outline Preliminaries – Tree Automata Generators for Schemas without Constraints Adding Constraints Restart Generators Continuation-Test Generators Leaf Values Optimal Probabilistic Generators for XML Collections - 5 -

  6. Schema as a Deterministic Tree Automaton An XML document is modeled as an ordered tree. r Document d0: a b c $ Schema validation: the children of an a-labeled node are accepted by DFA Aa abcd 532 abcd a c b $ q0 q1 q2 Automaton Ar: (L(Ar) = a*bc*$) Validation is performed for the children of every inner node. Optimal Probabilistic Generators for XML Collections - 6 - Preliminaries

  7. Using the Schema as a Generator Optimal Probabilistic Generators for XML Collections Recall that we want to turn the schema from an acceptor into a probabilistic generative model. Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly. Adding probabilities: we consider two problem settings • Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus. • Additionally, imposing integrity constraints on the documents (e.g., key constraints) - 7 - Preliminaries

  8. Probabilistic Generator r a pa pc c b $ q0 q1 q2 a a b $ pb p$ Optimal Probabilistic Generators for XML Collections Without Constraints - 8 - Each transition is assigned a probability We assume independent choices, (a Markovian process) thus the document probability is the product. In this case, Pr(d)=pa∙pa∙pb∙p$ The schema and generator ignore leaf values (for now!)

  9. Formal Problem Definition Optimal Probabilistic Generators for XML Collections Given a corpus D of documents , and a deterministic schema S that accepts every document in D We want to find an optimal generator based on S: • Find probabilities for the transitions of S that maximize the probability of generating D, • i.e., the maximum likelihood estimator (MLE). - 9 - Without Constraints

  10. A Learning Algorithm r The frequency of using each transition during the corpus verification process is recorded. a b c $ 1 a c 1 b $ q0 q1 q2 1 1 Optimal Probabilistic Generators for XML Collections Without Constraints - 10 -

  11. An Algorithm for Probabilities Learning (Cont.) /2 /2 /2 • Theorem: This efficient algorithm learns the MLE probabilities – finds an optimal probabilistic generator /2 Optimal Probabilistic Generators for XML Collections This is repeated for every node in every corpus document. We set the probability of each transition to be its relative frequency. - 11 - Without Constraints

  12. Termination Optimal Probabilistic Generators for XML Collections Theorem: generation terminates with probability 1. • Guaranteed only because of the choice of probabilities according to the corpus. - 12 - Without Constraints

  13. Integrity Constraints Optimal Probabilistic Generators for XML Collections We want to support integrity constraints, which are used in XML schema languages. Key Constraint: the leaves of a-labeled leaves have unique values (unary key) Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain - 13 - Adding Constraints

  14. New Problem r r b c a a b $ a c b b … Optimal Probabilistic Generators for XML Collections - 14 - • We want to find optimal generators for XML schemas with constraints. • Valid generator output: an XML document, which • is a accepted by the schema, and • there exists a validleaf value assignment – which does not violate the constraints • Example: a, b, c are unique and contain each other Adding Constraints

  15. Restart Generators Optimal Probabilistic Generators for XML Collections A simple idea: • Use a probabilistic generator to generate a document • Check if it has a value assignment valid w.r.t. the constraints • If not, 'restart' and try again until a valid document is generated Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME • Proof: By translating the constraints to bounds on the number of unique values for each leaf label Bad news: number of restarts can be unboundedly large in an optimal generator - 15 - Adding Constraints

  16. Continuation-test Generators Perform a continuation-test before taking the transition Implies |c|≤|a| Pr(d) = pa∙pb∙pc∙1 r a pa pc c b $ q0 q1 q2 a b c $ pb p$ Optimal Probabilistic Generators for XML Collections Never make choices that lead to a 'dead end', thus always generate a valid document. We use a binary test to check if a choice has a continuation. Example: add to the schema of d0the constraints: • c is included in a • c is unique The generation process: - 16 - Adding Constraints

  17. Learning Algorithm for Continuation-test Generators /2 /2 /1 • (q1, $) was chosen only when (q1, c) was not available. /1 Optimal Probabilistic Generators for XML Collections The probabilities are again relative frequencies, but –only in cases where there was an alternative choice. The learned generator will generate as many c-s as a-s Adding Constraints - 17 -

  18. Results for Continuation-test Generators Optimal Probabilistic Generators for XML Collections Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices. • Extensions to non-binary are discussed in the paper Theorem: Continuation-test is NP-Complete • But only in the size of the schema; it is polynomial in the document size • Both generation and finding the optimal generator are polynomial when using a continuation-test oracle. • Based on schema satisfiability test [David et al. 2011] Theorem: probability of termination for a continuation-test generator may be arbitrarily small! • Proof – by construction of a simple, non-recursive schema • Can be handled by adding a constraint on the document size. • Sub-classes of schemas that guarantee termination? - 18 - Adding Constraints

  19. Adding Values to the Structure Optimal Probabilistic Generators for XML Collections So far our generators were used only for the document structure Leaf values may also have a distribution according to which they can be generated • The distribution may be learned from the same document collection We will focus on the interesting case – generating leaf values for a schema with constraints - 19 - Leaf Values

  20. Suggested Algorithm r a b c $ abcd efg abcd Optimal Probabilistic Generators for XML Collections We start with a valid document skeleton Order labels by inclusion constraints (e.g., c, b, a) Choose a leaf from the 'smallest' (most included) label, and including leaves Draw a value (from the domain) according to a given distribution. Use PTIME test to verify validity, if not revert the step Improvements presented in the paper - 20 - Leaf Values

  21. Related Work Optimal Probabilistic Generators for XML Collections Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011] Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010] Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011] Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008] AXML [Abiteboul, Benjelloun & Milo 2008] PCFGs[e.g., Chi & Geman 1998] - 21 - Summary

  22. Conclusion Optimal Probabilistic Generators for XML Collections A model for a probabilistic XML generators Unconstrained case • Generation and learning optimal generators can be done efficiently • Termination is guaranteed Constrained case • Restart generator • # of restarts is unbounded • Continuation-test generators • Generation and learning optimal generators are expensive • Termination is not guaranteed Leaf Value generation In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled. Future work • More Efficient combinations of restart and continuation-test generators - 22 - Summary

  23. Thank You! Thank You! Q&A

More Related