1 / 39

7 Translating Data to XML

7 Translating Data to XML. How to translate existing data formats to XML? (and why?) XW (XML Wrapper) an "XML wrapper description language" developed in XRAKE project, Univ. of Kuopio, 2001–02

dannon
Download Presentation

7 Translating Data to XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 7 Translating Data to XML • How to translate existing data formats to XML? • (and why?) • XW (XML Wrapper) • an "XML wrapper description language" • developed in XRAKE project, Univ. of Kuopio, 2001–02 • Ek, Hakkarainen, Kilpeläinen, Kuikka, Penttinen: Describing XML Wrappers for Information Integration. In Proc. of XML Finland 2001, Tampere, Finland, Nov. 2001, 38–51. • Ek, Hakkarainen, Kilpeläinen, Penttinen: Declarative XML Wrapping of Data. Report A/2002/2, Dept. of CS & Appl. Math, Univ. of Kuopio. Notes 7: XML Wrapping

  2. XRAKE Project • "XML-rajapintojen kehittäminen" (Developing XML-based interfaces) • Studied definition and implementation of XML-based interfaces, and their application in • integration of heterogeneous data sources • management of mass printing • assembly and manipulation of electronic patient records Notes 7: XML Wrapping

  3. XRAKE - Support • National Technology Agency of Finland (TEKES) and seven local IT companies/organizations • DEIO IS • Enfo Group • JSOP Interactive • Kuopio University Hospital • Medigroup • SysOpen • TietoEnator Notes 7: XML Wrapping

  4. XW: Motivation • XML-based protocols developed for e-business, medical messages, … • Legacy data formats need to be converted to XML • How? Notes 7: XML Wrapping

  5. wrapper1 wrapper2 wrapper3 source1 source2 XML-wrapping • Need ”XML-wrappers” (aka extractors) • interface/conversion program to produce an XML representation for source data XML-form-1 XML-form-2 source3 Notes 7: XML Wrapping

  6. How to wrap? 1. With an interface integrated to source • E.g. XML-interfaces of database systems • OK, if available 2. With an ad-hoc written translator • E.g. JDBC+Java or separator-encoded text form + Perl • OK; conversion possibly efficient • Development and maintenance tedious :-( Notes 7: XML Wrapping

  7. How to wrap? (2) 3. Generic source-independent wrapping • requires a file/message/report produced by the system • normally available • development and maintenance of wrappers should become easier => Wrapper description language XW Notes 7: XML Wrapping

  8. XW (XML Wrapper) • XML-based, declarative wrapper description language • To convert from a • textual or binary source • currently (XW 1.59) only text sources supported to XML form Notes 7: XML Wrapping

  9. XW: Design principles • A concise and natural XML syntax • description of simple and typical conversion tasks should be simple • Solving the key problem: Initial conversion of a legacy data format to XML • more general post-processing with XSLT/SAX/ DOM • necessary for being able to apply XML techniques Notes 7: XML Wrapping

  10. XW: Influences xmlns:xw=”http://www.cs.uku.fi/XW/2001” • XML Namespaces • for separating XW commands and result elements • XML Schema • description of alternative and repetitive structures (CHOICE, minoccurs, maxoccurs) • data types of binary source data (string, byte, int, …) • XSLT • template-based description of result documents • variables for storing result fragments Notes 7: XML Wrapping

  11. How does XW look like? <xw:wrapper xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001" xw:inputencoding="Cp850" … > <invoice note="XW-generated" xw:starter="\^INVOICE"> <identifierdata ...>Inserted result text ... ... </identifierdata> <specification xw:starter="\^PHONE SPECIFICATION" ...> ... </specification> <data xw:starter="\^---"xw:maxoccurs="unbounded"...> ... </data> </invoice> </xw:wrapper> Notes 7: XML Wrapping

  12. AA x1 x2 BB y1y2 z1 z2 XW-architecture (1) source data wrapper description result document post-processing <xw:wrapper … > … </xw:wrapper> <part-a> <e1>x1</e1> <e2>x2</e2> </part-a> <part-b> <line-1> <d1>y1</d1> <d2>y2</d2> </line-1> <d3>z2</d3> </part-b> XSLT SAX XW-engine DOM Notes 7: XML Wrapping

  13. AA x1 x2 BB y1y2 z1 z2 XW-architecture (2) source data wrapper description - to use as a program component <xw:wrapper … > … </xw:wrapper> SAX events startElement(part-a, …) startElement(e1, …) characters(”x1”) … XW-engine Notes 7: XML Wrapping

  14. AA x1 x2 BB y1y2 z1 z2 <part-a> <e1>x1</e1> <e2>x2</e2> </part-a> <part-b> <line-1> <d1>y1</d1> <d2>y2</d2> </line-1> <d3>z2</d3> </part-b> result document XW-architecture (3) XW-engine SAX application source data <xw:wrapper … > … </xw:wrapper> wrapper description Notes 7: XML Wrapping

  15. <whole> <a> </a> • Result document = XML for the parse tree of the source <b> <b2> <b1> <b3> </b> </whole> XW: Basic Ideas • Wrapper description ~ a grammar for source • Wrapping ~ parsing the source data • split data into parts according to the description Notes 7: XML Wrapping

  16. XW Syntax <xw:wrapper xw:sourcetype=”text” xmlns:xw=”http://www.cs.uku.fi/XW/2001”> <invoice … > <identifierdata ...> ... </identifierdata> <specification ...> ... </specification> </invoice></xw:wrapper> Splitting of source content into parts(-> elements) Notes 7: XML Wrapping

  17. for sub-parts <identifierdata xw:childterminator="\n" … > Recognition of content parts (1) • by separators; For example: <invoice xw:starter="\^INVOICE"… • by position (within surrounding part): <invoicenumber xw:position="53 64"/> (Invoice number is in positions 53..64 of the first row of an identifierdata-part) Notes 7: XML Wrapping

  18. Recognition of content parts (2) • In binary data by content data types; For example:<xw:wrapper xw:sourcetype="binary"...> <A xw:type="byte"/> <B xw:type="string" xw:stringLength="20"/> <C xw:type="int"/> </xw:wrapper> • Split input to a byte, a string of 20 charactes, and an integer; (-> elements A,BandC) Notes 7: XML Wrapping

  19. Alternative parts: <xw:CHOICE xw:maxoccurs=”unbounded"><A xw:starter=”\^aa” xw:terminator=”\n” /> <B xw:starter=”\^bb” xw:terminator=”\n” /> </xw:CHOICE> • arbitrary number (at least 1) lines starting with ”aa” or ”bb” -> elements A or B Recognition of content parts (3) • Repetition:<line xw:terminator="\n" xw:minoccurs="2" xw:maxoccurs="2"/> • 2 input lines -> 2 line elements Notes 7: XML Wrapping

  20. XW: Modifying the structure of data • Limited modification possible: • discarding parts of data • collapsing levels of hierarchy • adding levels of hierarchy • generating verbatim content and attributes • re-arranging existing data Notes 7: XML Wrapping

  21. Discarding parts of data Input parts not matched by wrapper elements are ignored <spec xw:starter="SPEC" xw:childterminator="\n"> <!-- Split the ”SPEC” into rows: --> <!-- Ignore the first three rows: --> <xw:ignore xw:minoccurs="3" xw:maxoccurs="3" /> . . . </spec> Notes 7: XML Wrapping

  22. Collapsing hierarchy <dataxw:starter=”START” xw:terminator=”END” xw:childterminator="\n”> <!-- ’data’ is made of rows --> <xw:collapse> <date xw:position=”5 14"/> <sum xw:position=”16 21"/></xw:collapse> . . . </data> Notes 7: XML Wrapping

  23. <data> . . . </data> Collapsing hierarchy (2) START 17.8.1996 95.50 END • Split source data into parts according to specified separators Notes 7: XML Wrapping

  24. 17.8.1996 95.50 Collapsing hierarchy (3) <data> <xw:collapse> </xw:collapse> . . .</data> • split parts into sub-parts, according to sub-elements Notes 7: XML Wrapping

  25. <data><date></date> <sum> </sum> . . .</data> 17.8.1996 17.8.1996 95.50 95.50 Collapsing hierarchy (4) <data> <xw:collapse><date></date> <sum></sum></xw:collapse> . . .</data> Notes 7: XML Wrapping

  26. 17.8.1996 17.8.1996 17.8.1996 <data></data> + <xw:collapse /> 17.8.1996 default: discardwhitespace=”true” Collapsing hierarchy (5) Input part wrapper element + <data /> result Notes 7: XML Wrapping

  27. Adding levels of hierarchy • Example: Recognizing IP addresses in binary data<xw:ELEMENT xw:name=”IP-address"> <a xw:type="byte"/> <b xw:type="byte"/> <c xw:type="byte"/> <d xw:type="byte"/> </xw:ELEMENT> Notes 7: XML Wrapping

  28. 193 167 232 253 <IP-address> <a>193</a> <b>167</b> <c>232</c> <d>253</d> </IP-address> <a>193</a> <b>167</b> <c>232</c> <d>253</d> Adding levels of hierarchy (2) • Binary data = string of bytes Notes 7: XML Wrapping

  29. Adding levels of hierarchy (3) • NB: an xw:ELEMENTdoes not correspond to parts of input data (like ordinary result elements do): <!-- Wrap first two lines as INTRO: --><data xw:childterminator="\n"/> <xw:ELEMENT xw:name="INTRO"> <!--lines are matched by these elements:--> <xw:collapse /><xw:collapse /> </xw:ELEMENT> … </data> Notes 7: XML Wrapping

  30. Rearranging content • Content can be rearranged by storing results temporarily in variables:<data xw:childterminator="\n"/><xw:STORE xw:name="lines"> <!-- lines are matched by these elements :--> <line1 /><line2 /> </xw:STORE> … <xw:COPY-OF xw:select="lines" /> </data> Notes 7: XML Wrapping

  31. Axy.. z$??? B.. 1one 2two 3three Rearranging result structures XW <whole> <xw:STORE xw:name="xx"> <a xw:starter="A" xw:terminator="$"/> </xw:STORE> <b xw:starter="B"> <b1 xw:starter="1"/> <b2 xw:starter="2"/> <xw:COPY-OF xw:select="xx"/> <b3 xw:starter="3"/> </b> </whole> <whole> <b> <b1>one</b1> <b2>two</b2> <b3>three</b3> </b> </whole> <a>xy..z</a> Notes 7: XML Wrapping

  32. Axy.. z$??? B.. 1one 2two 3three Rearranging result content XW <whole> <xw:STORE xw:name="xx"> <a xw:starter="A" xw:terminator="$"/> </xw:STORE> <b xw:starter="B"> <b1 xw:starter="1"/> <b2 xw:starter="2"/> <xw:VALUE-OF xw:select="xx"/> <b3 xw:starter="3"/> </b> </whole> <whole> <b> <b1>one</b1> <b2>two</b2> <b3>three</b3> </b> </whole> xy..z Notes 7: XML Wrapping

  33. XW: Implementation • Prototype implemented with Java • Apache Xerces 2.0.1 used as a SAX parser • to read the wrapper description, which is represented internally as .. • a wrapper tree • guides the parsing of source data Notes 7: XML Wrapping

  34. Wrapper Tree • Wrapper tree node • corresponds to an element of wrapper description • used for matching parts of source data • includes sets S, B, T and F of strings • computed from wrapper description • S: element's own starter strings • B: strings that can begin part of element = S  starters of subelements that can begin the part of the element • T: terminating delimiters for the part of element • F: strings that can follow the part of element Notes 7: XML Wrapping

  35. <xw:wrapper xw:name="Wrapper tree example" xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001"> <doku xw:childterminator="\n" terminator="$"> <a xw:starter="\^A" xw:minoccurs="0"/> <b xw:starter="\^B" /> <c xw:starter="\^C"/> <xw:CHOICE xw:minoccurs="0" xw:maxoccurs="unbounded"> <d xw:starter="\^D"/> <e xw:starter="\^E"/> </xw:CHOICE> </doku> </xw:wrapper> doku S: B:\^A ,\^B T: $ F: xw:CHOICE S: B:\^D,\^E T: F:\^D,\^E, $ c S:\^C B:\^C T:\n F:\^D,\^E, $ b S:\^B B:\^B T:\n F:\^C a S:\^A B:\^A T:\n F:\^B d S:\^D B:\^D T:\n F:\^D,\^E, $ e S:\^E B:\^E T:\n F:\^D,\^E, $ Aaaa Bbbb Cccc Eeee Dddd Dddd $ Notes 7: XML Wrapping

  36. Executing a wrapper (simplified) • Traverse the wrapper tree; In each node: • scan input until the start of corresponding part found (= a delimiter belonging to set B) • report startElement(…) • Either • process child nodes recursively, or • report characters(…) for a leaf-level element • scan input until the end of the part (using sets T and F) • report endElement(…) • if node iterative, and a string in B found, reprocess node Notes 7: XML Wrapping

  37. Development status • Fall 2001: language designed from concrete examples • 2002: Design of implementation principles, implementation • wrapping of separator-based and positional text data implemented • wrapping of binary data (and few other details) unimplemented Notes 7: XML Wrapping

  38. XW: Some possible extensions • Evaluation of expressions • for generating computed attributes (implemented recently) • for guiding repetition (min/maxoccurs) by content values • Namespace support for results • Describing recursive (unlimited nesting) source structures => recognizing LL(k) languages(Usefulness for wrapping data formats?) Notes 7: XML Wrapping

  39. XW: Summary • XW: a convenient "XML wrapper description language” • for translating legacy data to XML • declarative wrapper description • easier than procedural ad-hoc conversion programs • working prototype implementation • to be available at www.cs.uku.fi/research/XRAKE Notes 7: XML Wrapping

More Related