Inferring XML Schema Definitions from XML Data

Inferring XML Schema Definitions from XML Data Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB 2007 2008. 02. 15. Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

Inferring XML Schema • Why schemas? • automation & optimization of search • integration of XML data sources • … • Why infer schemas? • 50% of XML on the web have none • 33% of schemas are not valid • Why infer XSD? (XML Schema Definition) • DTD (Document Type Definitions) has limitations • element type only depend on the element’s name (not consider path)

Example: DTD vs. XSD name type

Theorem • Inferring XSD from XML corpus is • impossible to learn from positive data only • Content model of an element is • uniquely determined by the path from the root to that element

Observation: local context • XSD is k-local • its content models depend only on labels up to the k-th ancestor • 98% of XSD, k = 2

Observation: SORE duplicated element names • Single Occurrence Regular Expression (SORE) • What’s SORE title, (author, affiliation?)+, abstract • What’s not SORE title, ((author, affiliation)++(editor, affiliation)+), abstract • 99 % of regular expressions is single occurrence

Proposed Algorithms • SOA: Single Occurrence Automaton • Theorem • XSDs with local context and SORE content models arelearnable from positive examples only (need ‘sufficiently large’) • iLocal = iSOA + TOSORE + MINIMIZE • infer k-local and single occurrence target XSD Schema • iXSD = iLocal & REDUCE • REDUCE = (unify sufficiently similar types)

Algorithm: iLocal (1/4)

Algorithm: iLocal (3/4) iSOA: make SOA from strings ToSORE: translate SOA → SORE

Algorithm: iXSD • incomplete data • iLocal derives too many types • REDUCE: practical heuristics • define distance between types • for type s and t • if distance(s, t) < ε then unify s and t

Experiments • 8 schemas & 200 generated documents for each schema • schema: 12~23 types with unbounded depth and width • local with k = 2, 3 • types of iXSDimprecisions: • content model for target and inferred type can differ • based on positive examples, can’t be avoided • type in target XSD can corresponds to multiple types in inferred XSD: false positives • type in inferred XSD can corresponds to multiple types in target XSD: false negatives • type in target XSD is not derived • incomplete corpus, can't be avoided

Experiments • k = 3, parsing 697 XSDs (40Mb), PentiumM1.73 → 17 seconds • k = 2, without REDUCE → 29 false positive • power of REDUCE • Sensitivity to parameters • context size k ↑ ⇒ false positives ↑ ⇒ false negatives ↓ • ε ↑ ⇒ false positives ↓ ⇒ false negatives ↑

Experiments • iXSD derives good XSDs from small training sets (50~)

Conclusions • Propose two algorithms • iLocal – sound & k-complete • iXSD – deal with poor data • good performance on real world • good runtime performance • Future work • determine best locality k

Inferring XML Schema Definitions from XML Data