390 likes | 491 Views
Inference of Concise DTDs from XML data. Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3. 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg. Outline. Goals & motivation
E N D
Inference of Concise DTDs from XML data Geert Jan Bex1 Frank Neven1 Thomas Schwentick2 Karl Tuyls3 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
DTD Aims & requirements XML • Problem: infer DTD from XML corpus • Requirements: • Concise: humans can interpret/validate • Work on large data sets • Work on small data sets • Robust to noise
Why DTD inference? • Schema inference • ≈ 50 % of XML documents : no schema [Barbosa et al. 2005] • ≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005] • Improving existing schemas • “Noisy” XML documents ≈ 90 % of XHTML docs : not valid • Related work • Fails on real-world, large data sets • Results not concise
Why schemas? • Validation : efficiency, security • Optimization : search, processing • Static analysis, type checking (e.g., XQuery) • Software development : modeling,OR-mapping • Integration : (meta-)data sources • Schema matching • Semantics
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
… … book book title editor year isbn title author author year title (author+ + editor+) year isbn? … … … … … … XML documents Learning regular expression from set of strings
((b?(a+c))+ d)+ e Learning automata? Well studied, but… Learning automata≠learning regular expressions
< ? a (b* + c) d? ??? < Learning regular languages? S = { abbb, abbd, acd, ac } • abbb + abbd + acd + ac • most specific regex for S • (a + b + c + d)* • most general regex for S positive examples only! generalization vs. specificity Impossible…in general
Subclasses • SingleOccurrenceRegularExpressions • 99 % of regular expression in DTDs/XSDs • CHAinRegularExpressions • 90 % of regular expression in DTDs/XSDs Infer with iDTD Infer with CRX
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
duplicate element names SOREs • What’s a SOREheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?title . (author . affiliation?)+ . abstract • … and what’s nottitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract
a b 2T-Inf d e [Garcia & Vidal 1990] c Sample SOA W = {bacacdacde, cbacdbacde, abccaadcde} SingleOccurrenceAutomaton
< < in general: |S| |L(SOA)| Sample SOA • SOA size • || + 2 states • O(||2) transitions • Complexity of algorithm • O(||W||) • streaming • Algorithm sound • W L(SOA)
a a b d d d d d b? b? e e e e e c c a+c b? (a+c) ((b? (a+c))+ d)+ e ((b? (a+c))+ SOA SORE: REWRITE optional b disjunction a, c self-loop b? (a+c) concatenation b?, a+c
REWRITE: properties • Theorem • REWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete) • Complexity: O(||4) • SORE size • || symbols • O(||) operators
a a b b d d e e c c ((b? (a+c))+ d)+ e REWRITE + repairs = iDTD W = {bacacdacde, cbacdbacde} no rules apply !!! almost disjunction a, c Fix: enable-disjunctionenable-optional
iDTD: properties • Theorem • iDTD transforms SOA into SORE such that L(SOA) L(SORE) • iDTD can be parameterized for performance
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
CHAREs • Definition: A chain regular expression is a sequence of factors f1,…,fn such that no alphabet symbol occurs more than once and a factor is one of • (a1 + … + ak) • (a1 + … + ak)? • (a1 + … + ak)+ • (a1 + … + ak)* CRX derives CHAin Regular Expressions Chain Regular expressioneXtraction
not a factor duplicate element names CHAREs • What’s a chainheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs? • … and what’s nottitle . (author . affiliation?)+ . abstracttitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract
Pre-order relation W a b b c c d d e h i c f b d g a e c a a d b f f e e g f h h i CRX run: pre-order relation Sample W a b c c d e c c c a d b f e g b f h i
f e d g h i a b c CRX run: transitive closure a W b and b W c then a W c Sample W a b c c d e c c c a d b f e g b f h i
a,b,c f e d g h i a b c equivalence class CRX run: transitive closure a W b and b W a then a W b Sample W a b c c d e c c c a d b f e g b f h i Symbol occurs in exactly one equivalence class
a,b,c f e d g h i predecessor set successor set CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ | W ’} Sample W a b c c d e c c c a d b f e g b f h i
a,b,c e g h i d,f CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ | W ’} Sample W a b c c d e c c c a d b f e g b f h i W: partial order W
a,b,c e g h i ? + ? d,f ? ? . . . . . (a + b + c)+ (d + f) e? g? h? i? CRX run: multiplicity & RE topological sort Sample W a b c c d e c c c a d b f e g b f h i Chain Regular Expression
CRX algorithm: properties • Optimality:W linearly ordered CHARE r,WL(r) and L(r)L(rW): rW = r • Performance : O(||W|| + |Σ|3) • Training set size:Any CHARE r can be learned from{w | wL(r)w’L(r): |w| |w’| + 2}
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
Related work • XTRACT [Garofalakis et al. 2000] • Pioneer • More general than iDTD • Focuses on regular expressions that don’t occur in real DTDs no concise schemas • Trang: roughly equivalent to CRX • Inconsistent results
Data • Real world regular expressions • SOREs • Non SOREs • Real world data when available • Synthetic data otherwise
CRX iDTD no repairs Experiments: generalization
CRX iDTD Experiments: generalization
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
Extensions • Incremental computation • new data update internal representation (SOA or partial order) • Noise • Support for element name too small ignore element • SOA: support for edges too small delete edges before repair • Numerical predicates • Bookkeeping: minOccurs, maxOccurs • Generating XSDs • Infer data types (integer, double, date,…)
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
Conclusions • iDTD + CRX • learns robust class of regexes from positive examples • complete in their target class for sufficient data • deals with insufficient data • performs well on real world data • runs efficiently • Future work: inferring XML Schemas