230 likes | 310 Views
Finding Optimal Probabilistic Generators for XML Collections. Serge Abiteboul, Yael Amsterdamer , Daniel Deutch, Tova Milo, Pierre Senellart. Adding probabilities to an XML Schema. Given a collection of XML documents, we sometimes have a schema the documents conform to. E.g., DTD or XSD
E N D
Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart
Adding probabilities to an XML Schema Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Given a collection of XML documents, we sometimes have a schema the documents conform to. • E.g., DTD or XSD • Restricts the structure, mostly parent-child node relations (using regular expressions) The schema may be very general (e.g., xhtml, RSS) We want to add probabilities that reflect the likelihood of different parts of the schema • We will use the probabilities to turn the schema into a probabilistic generative model for XML documents • In particular, we want them to maximize the likelihood of a given XML document or document collection - 2 - Motivation
One Application: XML Auto-Completion [SIGMOD 2012] <MyPapers> <Paper> <title>XML for Beginners</title> <author>M. Jones<author> <author>H. Q. David</author> <author>L. Martin</author> <author>S. Smith</author> </Paper> <Paper> <title>Advanced XML</title> <author>M. Jones</author> <author>J. E. Peterson</author> <author>G. L. Williams</author> </Paper> <Paper> <title> </title> <author> </author> <author> </author> <author> </author> </Paper> </MyPapers> Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Based on previous document versions / corpus of example documents – Suggest nodes / sub-trees / node values to the user For example: Challenges: • Allow editing in every part of the document • What kind of completion to suggest? • Finding the top-k best completions - 3 - Motivation
Many Other Usages for a Probabilistic Schema Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Testing – e.g., generating many XML messages to simulate network load and test system performance. Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc. Schema Evaluation – how well a given schema describes a given corpus. … - 4 - Motivation
Our solution - An Outline Preliminaries – Tree Automata Generators for Schemas without Constraints Adding Constraints Restart Generators Continuation-Test Generators Leaf Values Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer - 5 -
Schema as a Deterministic Tree Automaton An XML document is modeled as an ordered tree. r Document d0: a b c $ Schema validation: the children of an a-labeled node are accepted by DFA Aa abcd 532 abcd a c b $ q0 q1 q2 Automaton Ar: (L(Ar) = a*bc*$) Validation is performed for the children of every inner node. Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer - 6 - Preliminaries
Using the Schema as a Generator Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Recall that we want to turn the schema from an acceptor into a probabilistic generative model. Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly. Adding probabilities: we consider two problem settings • Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus. • Additionally, imposing integrity constraints on the documents (e.g., key constraints) - 7 - Preliminaries
Probabilistic Generator r a pa pc c b $ q0 q1 q2 a a b $ pb p$ Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Without Constraints - 8 - Each transition is assigned a probability We assume independent choices, (a Markovian process) thus the document probability is the product. In this case, Pr(d)=pa∙pa∙pb∙p$ The schema and generator ignore leaf values (for now!)
Formal Problem Definition Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Given a corpus D of documents , and a deterministic schema S that accepts every document in D We want to find an optimal generator based on S: • Find probabilities for the transitions of S that maximize the probability of generating D, • i.e., the maximum likelihood estimator (MLE). - 9 - Without Constraints
A Learning Algorithm r The frequency of using each transition during the corpus verification process is recorded. a b c $ 1 a c 1 b $ q0 q1 q2 1 1 Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Without Constraints - 10 -
An Algorithm for Probabilities Learning (Cont.) /2 /2 /2 • Theorem: This efficient algorithm learns the MLE probabilities – finds an optimal probabilistic generator /2 Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer This is repeated for every node in every corpus document. We set the probability of each transition to be its relative frequency. - 11 - Without Constraints
An Additional Result Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Theorem: generation terminates with probability 1. • Guaranteed only because of the choice of probabilities according to the corpus. - 12 - Without Constraints
Integrity Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer We want to support integrity constraints, which are used in XML schema languages. Key Constraint: the leaves of a-labeled leaves have unique values (unary key) Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain • Different types are considered in the literature [Fan & Libkin 2001; David Libkin & Tan 2011] - 13 - Adding Constraints
New Problem r r b c a a b $ a c b b … Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer - 14 - • We want to find optimal generators for XML schemas with constraints. • Valid generator output: an XML document, which • is a accepted by the schema, and • there exists a validleaf value assignment – which does not violate the constraints • Example: each of a, b, c is unique, and contained the others Adding Constraints
Restart Generators Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer A simple idea: • Use a probabilistic generator to generate a document • Check if it has a value assignment valid w.r.t. the constraints • If not, 'restart' and try again until a valid document is generated Problem definition -- same as in the case without constraints (but now the schema includes constraints) Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME • Proof: By translating the constraints to bounds on the number of unique values for each leaf label Bad news: number of restarts can be unboundedly large in an optimal generator - 15 - Adding Constraints
Continuation-test Generators Perform a continuation-test before taking the transition Implies |c|≤|a| Pr(d) = pa∙pb∙pc∙1 r a pa pc c b $ q0 q1 q2 a b c $ pb p$ Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Never make choices that lead to a 'dead end', thus always generate a valid document. We use a binary test to check if a choice has a continuation. Example: add to the schema of d0the constraints: • c is included in a • c is unique The generation process: - 16 - Adding Constraints
Learning Algorithm for Continuation-test Generators /2 /2 /1 • (q1, $) was chosen only when (q1, c) was not available. /1 Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer The probabilities are again relative frequencies, but –only in cases where there was an alternative choice. The learned generator will generate as many c-s as a-s Adding Constraints - 17 -
Results for Continuation-test Generators Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices. • Extensions to non-binary are discussed in the paper Theorem: Continuation-test is NP-Complete • But only in the size of the schema; it is polynomial in the document size • Both generation and finding the optimal generator are exponential in the schema size unless P=NP. • Based on schema satisfiability test [David et al. 2011] Theorem: probability of termination for a continuation-test generator may be arbitrarily small! • Proof – by construction of a simple, non-recursive schema • Can be handled by adding a constraint on the document size. • Sub-classes of schemas that guarantee termination? - 18 - Adding Constraints
Adding Values to the Structure Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer So far our generators were used only for the document structure Leaf values may also have a distribution according to which they can be generated • The distribution may be learned from the same document collection We will focus on the interesting case – generating leaf values for a schema with constraints - 19 - Leaf Values
Suggested Algorithm r a b c $ abcd efg abcd Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer We start with a valid document skeleton Order labels by inclusion constraints (e.g., c, b, a) Choose a leaf from the 'smallest' (most included) label, and including leaves Draw a value (from the domain) according to a given distribution. Use PTIME test to verify validity, if not revert the step Improvements presented in the paper - 20 - Leaf Values
Related Work Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011] Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010] Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011] Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008] AXML [Abiteboul, Benjelloun & Milo 2008] PCFGs[e.g., Chi & Geman 1998] - 22 - Summary
Conclusion Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer A model for a probabilistic XML generators Unconstrained case • Generation and learning optimal generators can be done efficiently • Termination is guaranteed Constrained case • Restart generator • # of restarts is unbounded • Continuation-test generators • Generation and learning optimal generators are expensive • Termination is not guaranteed Leaf Value generation In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled. Future work • Efficient combinations of restart and continuation-test generators • Experimental study - 23 - Summary
Thank You! Thank You! Q&A