280 likes | 302 Views
Information Preserving XML Schema Embedding. Philip Bohannon Bell Laboratories Wenfei Fan Univ of Edinburgh & Bell Labs Michael Flaster Bell Laboratories PPS Narayan Bell Laboratories. XML mapping. XML mapping σ d : I( S1 ) → I( S2 ):
E N D
Information Preserving XML Schema Embedding Philip Bohannon Bell Laboratories Wenfei FanUniv of Edinburgh & Bell Labs Michael Flaster Bell Laboratories PPS Narayan Bell Laboratories
XML mapping XML mappingσd: I(S1) → I(S2): • Instance-level: from XML instances of a given source DTD schema S1 to XML trees of a predefined target DTD schema S2 • Information preserving (lossless) • XML data exchange, migration, integration, P2P, … XML mapping XML tree of S2 XML tree T of S1
db * class cno title type regular project prereq * Example: XML mapping – source DTD Source schema S1: db class* class cno, title, type type ( regular + project ) regular prereq prereq class* • DTD: (E, P, r). E: element types; r: root; P: element type definitions A ::= PCDATA | | B1, …, Bk | B1 + … + Bk | B* • Graph representation: • concatenation production B1, …, Bk : AND edge (solid) • disjunction B1 + … + Bk : OR edge (dashed) • Kleene star B*: STAR edge (with edge label *)
Example: XML mapping – target DTD school target schema S2: courses students * history current student * * course ssn name gpa taking category basic * * cno credit semester mandatory advanced title year term regular lab seminar project required prereq gpa *
information preserving XML mapping Objective: Find an XML mapping σd: I(S1) → I(S2)such that • Type safety: for any XML tree T of S1, σd(T)isan XML document that is conforms to the predefined target schema S2 • Information preserving: • Invertibility: there exists an inverse σ-1d: I(S2) → I(S1) such that for any XML tree T of S1, T = σ-1d (σd (T)). The source T can be recovered from the target σd(T) • Query preservation w.r.t a query language L: there is a query-rewriting function F: L → L such that for any Q in L and any T of S1, Q(T) = F(Q)(σd (T)). All queries in L on the source can be answered on the target
db * class cno title type regular project prereq * Challenge: different structures S1 S2 school S1 and S2 have vastly different structures: graph similarity (simulation) does not work here! courses students . . . history current * * course category basic * cno credit semester mandatory advanced title year term project regular lab seminar required * prereq gpa
db school * db * class courses students student * history current cno title type student ssn name taking * regular project cno prereq * Challenge: data integration S1’ S1 Multiple sources are to be mapped to a single target: the target schema must have a larger information capacity – it cannot be similar to sources S2 * . . .
db * class cno title type regular project prereq * About query preservation: XML query languages • Regular XPath: Q ::= | A | Q/text() | Q/Q | Q ∪ Q | Q* | Q[q] q ::= Q | Q/text() = ‘c’ | position() = k | q ∧ q | q ∨ q | not q • An XPath fragment: Q//Q instead of Q* Example: a regular XPath query over S1:Find all prerequisites of CIS 331 Q1: class [ cno/text() = ‘CS331’] / (type/regular/prereq/class)* query rewriting Q2: courses/current/course [ basic/cno/text() = ‘CS331’] / (category/mandatory/regular/required/prereq/course)*
Challenge: information preservation for XML For relational data w.r.t. relational calculus (L), invertiblility (calculus dominance) and query preservation (dominance) coincide [Hull 84] Separation: (a) There is an invertible XML mapping that is NOT query preserving w.r.t. XPath. (b) There is an XML mapping that is query preserving w.r.t. XPath without position( ) but it is NOT invertible. Complexity: It is undecidable to determine, for an XML mapping defined in any language subsuming FO, whether it is (a) invertible, or (b) query preserving w.r.t. any query language with projection. beyond reach for XML mappingsdefined in XQuery/XSLT Other results: • query preservation w.r.t. regular XPath: stronger than invertibility • sufficient conditions under which the two coincide
Previous work • XML mappings defined in XQuery/XSLT: no guidance on • type safety: for any XML tree T of S1, is σd(T) guaranteed to conforms to predefined (recursive) target schema S2? • how to ensure information preservation • Schema mapping: to derive instance-level mapping • similarity flooding, Cupid, Clio, TransSCM… • cannot guarantee information preservation • Information preservation in traditional data models: not directly applicable to XML mappings No prior work has considered information-preservingXML mapping
Our approach A systematic way to find XML mappings commonly used in practice • find a schema mapping (embedding): σ: S1→ S2 with certain properties, if there is any • derive an instance-level mappingσd: I(S1) → I(S2) from σ • automatically guarantee information preservation • accommodate integration (multiple sources) Input: • source DTD S1 = (E1, P1, r1), target DTDS2= (E2, P2, r2); • similarity matrixatt( ) on element type names: att(A, B) in [0, 1] indicates how close A ∈ E1is toB ∈ E2 Output: Schema embedding:σ= (λ( ), path( ))
A A A B B B C C C Is there a schema embedding for the following? A S1 S2 S1 S2 B C Schema embedding σ= (λ( ), path( )) • λ: E1→ E2, type mapping: λ(r1) = r2 and att(A, λ(A)) > 0 • path(A, B) maps an edge(A, B) in S1 to a unique path from λ(A) to λ(B) in S2: A1[position( ) = k1] / … /An(position( ) = kn] • path type: AND (OR, STAR) edge to AND (OR, STAR) path (solid/star edges, solid + at least 1 dashed, solid edges + *) Information capacity • prefix-free: if P1(A) = A1, …, An, path(A, Ai) is NOT a prefix of any path(A, Aj) for j ≠ i; similarly forP1(A) = A1+ … + An. Type safety – valid mapping
A B C S1 S2 λ(A) = A, λ(B) = B, λ(C) = C path(A, B) = A/B path(A, C) = B/C Unfolding: the prefix-free condition query translation: B/C B C Example: Schema embedding A Schema embedding is not a mild generalization of graph simulation A A S1 S2 Schema embedding: NO Graph simulation: YES 1 2 B B
Schema embedding: example • λ(db) = school, λ(class) = course path(db, class) = courses/current/course • mapping edge to path • STAR edge to STAR path • Graph similarity? NO school S2 S1 db * courses students * class history current student * * course ssn name gpa taking
Schema embedding: example • λ(type) = category, λ(A) = A path(class, cno) = basic/cno path(class, title) = basic/semester/title path(class, type) = category • AND (STAR) edges to AND (STAR) paths • Relative path: relative tocourse course S2 S1 category class basic * cno title type cno credit semester title year term
class . . regular prereq * Schema embedding: example • λ(X) = X path(type, regular) = mandatory/regular path(type, project) = advanced/project • λ(X) = X path(regular, prereq) = required/prereq path(prereq, class) = course OR edges to OR paths category type S2 S1 mandatory advanced regular project regular lab seminar project course . . S1 S2 regular required gpa prereq *
Deriving instance-level mapping Each schema embedding σ: S1→ S2 determines an XML mappingσd: I(S1) → I(S2) Path types and prefix-free Given an XML tree T1 of S1, σd (T1) constructs an instance T2 of S2,top-down by mapping A-elements of T1to λ(A)-nodes inT2 • the root of T2 is mapped from the root of T1; • for each λ(A)-element in T2 mapped from an A-element of T1, generate path(A, B) in T2for each B-child of the A-element; • when all the element in T2 mapped from nodes inT1 are fully expanded, add necessary “default” elements to T2 such that T2 satisfies S2.
Properties of schema embedding Theorem: The XML mapping σd: I(S1) → I(S2) derived from a schema embedding σ: S1→ S2 is • well defined (type safety) • invertible (with a quadratic-time inverse), and • query preserving w.r.t. regular XPath (query rewriting: linear-time data complexity, quadratic-time combined complexity)
db school * db * class courses students student * history current cno title type student ssn name taking * regular project cno prereq * ssn name gpa taking Integration: multiple sources S1’ S1 λ(db) = school, λ(X) = X path(db, student) = students/student path(taking, cno) = cno S2 . . . cno * pairwise disjoint path mappings from S1, S1’ toS2
Schema embedding vs. graph simulation • Definition: • embedding:mapping edges to paths • simulation: mapping edges to edges • restructuring: • embedding:various DTD constructs, different structures • simulation: source and target schemas with similar structures • information preservation for XML mappings: • embedding:automatically guarantee both invertibility and query preservation w.r.t. regular XPath • simulation: no • data integration: • embedding:multiple source DTDs to a single target schema • simulation: no A systematic method to define information-preserving XML mappings
Complexity: finding schema embedding Input: two DTD schemas S1 and S2, and a similarity matrix att( ) Output: find a schema embedding from σ: S1→ S2 such that qual(σ, att) is maximal, if there is any qual(σ, att)is the sum of att(A, λ(A))for all A in S1 Theorem: It is NP-complete to determine whether or not there is a schema embedding from S1 to S2, even when S1 and S2 are nonrecursive and they consist of concatenation types only. Efficient algorithms are necessarily heuristic. • Find local embedding for each DTD production of S1 • Assemble local embeddings to make a schema embedding
type regular project S1 Computing local embedding – fixed type mapping Input: a production A → P(A) in source DTD S1, target schema S2 Output:σ0 = (λ0, path0), a partial embedding from P(A) to S2 Example: find λ0( ) from types in P(A) to types of S2, and path0( ) . . . category S2 mandatory advanced regular lab seminar project . . . • If λ0 is given: an O(|P(A)| |S2|) algorithm findPath to find local embedding (depth-first search, checking each S2subtree only once) • Whenλ0 is not fixed, the local embedding problem is NP-hard • Heuristic: randomized findPath to find both λ0 andpath0 (randomly pick up possible type-node match in the search)
Assembling local embeddings Input: C(A), a set of local embeddings for each A in the source DTD(initialized via randomized findPath); a target schema S2 Output:σ = (λ, path), a schema embedding from S1to S2 if any Theorem: The assemble-embedding problem is NP-complete even when S1 and S2 are nonrecursive. Conflict: type mapping, prefix free Three heuristic algorithms: • Fix an order O on S1 types via qual( ), pick a local embedding σAfrom C(A) in O, and increment σwith σA if no conflict • Assume a random order O on S1 types, then do the same as (1) • Reduction to the MAX-Weight-Independence-Set problem, leveraging an existing tool for that problem.
Experimental evaluation • benchmark • XMark (99 type nodes in its original form) • Real-life DTDs: SIGMOD (13), PSD (121), mondial (70), etc • Generating target schemas by adding noise: changing edges to paths, mutating names, inserting new subtrees. • selectivity/accuracy of att( ): [0, 1] (1.0: exact match) • Target schemas with 75% noise: XMark (581-748), SIGMOD (54-96), PSD (712-820), mondial (395-496) • system • 933MHZ/1.0GHZ Pentium III, 256M memory • QUALEX: a tool for MAX-Weight-Independence-Set • Algorithms implemented in Java
Experimental result – target size XMark (acc 0.75). RandomOrder and MAXSet-Reduction perform well
Experimental result – running time required XMark (acc 0.75). In seconds for schemas of hundreds of nodes
Experimental result – different source schemas Various source schemas (acc 0.75). RandomOrder finds solutions more than 90% of the time, in seconds
Summary • Information preservation: the first study for XML mappings • more intriguing than its relational counterparts: separation, equivalence, complexity of invertibility and query preservation • important for data exchange, migration, integration, P2P, … • Schema embedding: • mapping edges to paths • capture various DTD constructs, support restructuring • automatically guarantee information preservation • accommodate multiple sourceto a single target • NP-complete, but with efficient and effective heuristic A practical solution for finding information-preserving XML mappings