Processing XML Documents

Processing XML Documents SNU IDB Lab.

Processing XML documents Processing XML Data Document Formatting (XSL & XSLT)

Contents : processing XML data Concepts Writing XML Reading XML Event processing Tree manipulation Events or trees? Transformation tools

Concepts (1/4) Application <--> ------- <--> XML processor data errors rules Developing software to generate XML output is a trivial matter. However, reading an XML documents can be complicated by a number of issues and features of the language. Thus the DTD may need to be processed, either to add default information, or to compare against the document instance in order to validate it.

Concepts (2/4) • Programmers wishing to read XML data files need an XML-aware processing module, termed an XML processor. • XML processor • XML processor is responsible for marking the content of the document available to the application • detect problems such as file formats that the application cannot process, or URLs that do not point to valid resources.

Concepts (3/4) • Two fundamentally different approaches to reading the content of an XML document are known as the ‘event-driven’ and ‘tree-manipulation’ techniques. • Event-driven • Document is processed in strict sequence. • Each element in the data stream is considered as event trigger, which may precipitate some special action on the part of the application.

Concepts (4/4) • Tree-manipulation • The tree approach provides access to the entire document, allowing its contents to be interrogated and manipulated in any order.

Writing XML (1/3) To produce XML data, it is only necessary to include XML tags in the output strings. However, one decision that has to be made is whether to output line-end codes or whether to omit them. In many respects it is simpler and safer to omit line-end codes. But if the XML document is likely to be viewed or edited using tools that are not XML-aware, this approach makes the document very difficult to read.

Writing XML (2/3) Some text editors will only display as much text as will fit on one line in the window Although some editors are able to display more text by creating ‘soft’ line breaks at the right margin, the content is still not very legible. It would seem to be more convenient to break the document into separate lines at obvious points in the text. However, there may be a problem for the recipient application in determining when line-end codes are there purely to make the XML data file more legible.

<book><front><title>The Book Title</title><author>J. Smith</author><date>October 1917</date></front><body> <chapter><title>First Chapter</title><para>This is the first chapter in the book.</para><para>This is the ……. ….. Writing XML (3/3) <book> <front> <title>The Book Title</title> <author>J. Smith</author> <date>October 1917</date> </front> <body> <chapter> <title>First Chapter</title> <para>This is the first chapter in the book.</para> <para>This is the …….

XML fragment Application <--> ------- <--> XML processor entity manager data XML document image Reading XML (1/4)

Reading XML (2/4) The XML processor hides many complications from the application. The XML processor has at least one sub-unit, termed the entity manager, which is responsible for locating fragments of the document held in entity declarations or in order data files, and handling replacement of all references to them

Reading XML (3/4) • The XML processor delivers data to application, but there are two distinct ways in which this can be done. • (1) Event driven • The simplest is to pass the data directly to the application as a stream. The application accepts the data stream and reacts to the markup as it is encountered.

Reading XML (4/4) • (2) Tree-walking • XML processor holding onto the data on the application’s behalf, and allowing the application to ask questions about the data and request portions of it. • Grove • A tree or group of trees can be stored in a data structure.

Event processing (1/2) The simplest method of processing an XML document is to read the content as a stream of data, and to interpret mark up as it is encountered. If out-of-sequence processing is required, such as needing to collect all the titles in a document for insertion at the start of the document as a table of contents, then a ‘two -pass’ processor is needed. In the first pass, the titles are collected. In the second pass, they are inserted where they required.

Event processing (2/2) • Simple API for XML(SAX 1.0) • To reduce the workload of the application developer, and make it easy to replace one parser with another, a common event-driven interface has been proposed for object-oriented languages such as JAVA.

Tree manipulation (1/3) Software that holds the entire document in memory needs to organized the content so that it can be easily searched and manipulated. There is no need for multi-pass parsing when any part of the document can be accessed instantly. Applications that benefit from this approach include XML-aware editors, pagination engines and hypertext-enabled browsers.

Tree manipulation (2/3) The abstract description of the model for SGML documents is called grove, and the grove scheme is equally applicable to XML. The name ‘grove’ is appropriate because it mainly describes a series of trees. A grove is a ‘directed graph of nodes’ Each node is an object of a specified type: a package of information that conforms to a pre-defined template.

Tree manipulation (3/3) node property Property value type element gi para A property has a name and a value, so can be compared to an attribute. A node that describes a person mat have a property called ‘age’ which holds the value representing the age of an individual. A node must have a type property, and name property, so that it can be identified, or referred to.

Events or trees ? (1/3) • Event-driven benefits • The parser does not have to hold much information about the documents in memory. • The document structure does not have to be managed in memory, either by the parser or, depending on what it needs to do, by the application. This make parsing very fast. • It does not have to do anything special in order to process the document in a simple linear fashion, from start to end.

Events or trees ? (2/3) • Tree-walking benefits. • With the entire document held in memory, the document structure can be analyzed several times over, quickly and easily. • The data structure management module may be profitably utilized by the application to the manage the document components on its behalf. • A documents that contains errors can be rejected before the application begins to process its contents, thereby eliminating the need for messy roll-back routines.

Events or trees ? (3/3) • Other considerations • The memory usage advantage of the event-driven approach may be only theoretical. • If the application uses an event-driven API, the parser need not build a document tree, but if the application uses a tree-walking API, it can itself use the event-driven API to build its tree model.

Transformation tools When the intent is simply to change an XML document structure into a new structure, there are existing tools. These tools can usually do much more advanced things, such as changing the order of elements, sorting them, and generating new content new content automatically. It can transform XML document into another XML document, or into an HTML document.

Processing XML documents Processing XML Data Document Formatting (XSL & XSLT)

Contents : Document Formatting Concepts Selecting a style sheet XSLT Style sheet DTD issues XSL

Concepts of XSL XML Stylesheet Language XML documents are intended to be easily read by both people and software People don’t want to see documents with tags It is necessary to replace the tags with appropriate text styles

Concepts of Style sheets (1/2) <title>An example of style</title> <intro><para>This example shows how important style Is to material intended to be read.</para></intro> <para>This is a <em>normal</em> paragraph.</para > <warning><para>Styles are important!</para><warning> Removal of tag ? An example of style This example shows how important style Is to material intended to be read. This is a normal paragraph. Styles are important! Style applied An example of style This example shows how important style Is to material intended to be read. This is a normal paragraph. Warning: Styles are important!

This is a title This paragraph contains a highlighted term Concepts of Style sheets (2/2) DTD style sheet authoring presentation documents <title>This is a title</title> <p>This paragraph contains a <em>highlighted</em> term.</p> This is a title This paragraph contains a highlighted term

Concepts of DTD and style sheet DTD Style sheet A Presentation Authoring Documents Style sheet B Presentation A single style sheet may be applied to a number of documents formatted in the same way An XML document can be associated with more than one style sheet.

Concepts of Styling with XSL • A set of formatting objects • In this first version, all allowed formatting objects are rectangular • FO DTD(Formatting Objects DTD) • Elements such as ‘block’ • Attributes such as ‘text-align’

Concepts of Transforming with XSLT(1/2) To author XML document with FO DTD is obviously negate the entire philosophy of XML – self describing, not self formatting of HTML An XSLT processor takes an existing XML document as input, and generates a new XML document with new DTD as output.

Concepts of Transforming with XSLT (2/2) Source DTD XSLT style sheet <template match=“emph”> <fo:inline-sequence font-weight=“bold”> <apply-templates/> </fo:inline-sequence> </template> XSLT processor XML document An <emph>emphasized</emph> word. FO DTD New XML document XSL processor Presentation An emphasized word.

Selecting a style sheet An XML processing instruction is used for selecting a style sheet. <?xml-stylesheethref=“mystyles.xsl” type=“text/xsl” title=“default” ?> <?xml-stylesheethref=“myBIGstyles.xsl” type=“text/xsl” title=“bigger font” alternative=“yes” ?>

XSLT : general structure (1/3) • Root element – stylesheet, transform • <stylesheet xmlns=“http://www.w3.org/XSL/Transform/1.0”> • <transform xmlns=“http://www.w3.org/XSLT/Transform/1.0”> • Another namespace – an XSLT style sheet may also contain elements that are not part of stylesheet or transform • <stylesheet xmlns=“http://www.w3.org/XSL/Transform/1.0” xmlns:X=“………….”>…… <X:my-element>…</X:my-element>…

XSLT : general structure (2/3) • Result namespace – Indicator of what the output of the XSL processor is • <stylesheet xmlns=“http://www.w3.org/XSL/Transform/1.0” xmlns:X=“……” result-ns=“X”> • Id – embedded stylesheet in a larger XML document • <?xml-stylesheet type=“text/xsl” href=“#MyStyles” ?><X:book> <stylesheet id=“MyStyles” …> … </stylesheet> …

XSLT : general structure (3/3) • Result VersionResult Encoding – to specify which version of XML and a character set encoding scheme should be used for the output file • <stylesheet … result-version=“2.0” result-encoding=“ISO-8859-1”>

XSLT : White space • An XSLT processor creates a tree of nodes, including nodes for each text string in and between the markup tags. • Default – all white space is preserved.Default Space – when ‘strip’ applied, it is possible to remove the white space. • <stylesheet … default-space=“strip”> <preserve-space elements=“pre poetry”/> …</stylesheet>

XSLT : Templates • The body of the style sheet consists of at least one transformation rule, as represented by the Template element • <template match=“para”> …</template> • <template match=“warning/para”> …</template>

XSLT : Imports and Inclusions • Multiple style sheets may share some definitions. • <stylesheet …> <import href=“tables.xsl”> <import href=“colours.xsl”> <template …>…</template> • <include href=“…”>…</include> • Import rules are not considered to be as important as other rules. • The include element can be used anywhere and included rules are not considered to be less important than other rules

XSLT : Priorities • When more than one complex rule matches the current element, it is necessary to explicitly give one rule a higher priority than the others, using the Priority attribute. • <template match=“chapter//para”> …</template> • <template match=“warning//para” priority = “2”> …</template> • It the priority attribute is not used, or not used correctly, an XSLT processor may choose to simply select the last rule.

XSLT : Recursive processing • If an animal element existed within the paragraph, and there was no rule for this element, but it could contain the emphasis element, then the emphasized text would not be formatted. • <para>A <animal><emph>Giraffe</emph></animal> is an animal.</para> • To eliminate this problem, a rule is needed to act as a catch-all, representing the elements not covered by explicit formatting rules • <template match=“/|*”> <apply-templates />

XSLT : Selective processing • The Apply Templates element can take a Select attribute, which overrides the default action of processing all children. Using Xpath patterns, it is possible to select specific children, and ignore the rest. • <template match=“names”> <apply-templates select=“name[@type=‘company’]” /></template> • The Apply Templates element can be used more than once in a template.

XSLT : Output formats • An XSLT transformation tool is expected to write out a new XML document. One way to do this is simply to insert the appropriate elements into the templates. • <template match=“para”> <html:p><apply-templates/></html:p></template> • Comments and processing instructions can be inserted into the output document using comment and processing instruction elements • <processing-instruction name=“ACME”>INSERT_TOC</processing-instruction> • <comment>This is the HTML version</comment>

XSLT : Sorting elements • The Sort element is used within the Apply Templates element to sort the elements it selects: • <list> <item sortcode=“1”>ZZZ</item> <item sortcode=“3”>MMM</item> <item sortcode=“2”>AAA</item></list><template match=“list”> <apply-templates><sort/></apply-templates></template><sort select=“@sortcode” />

XSLT : Automatic numbering • In many XML documents, list items are not physically numbered in the text, making it easy to insert, move or delete items without having to edit all the items, so the style sheet must add the required numbering. • <template match=“selection/title”> <number level=“multi” count=“chapter|section” format=“1.A” /> <apply-templates/></template> • 1.A First section of Chapter One2.C Third section of Chapter Two

XSLT : Variables and templates(1/3) • A style sheet often contains a number of templates that produce output that is identical, or very similar, and XSLT includes some mechanisms for avoiding such redundancy. • Variable, Value Of • <variable name=“Colour”>red</variable><html:h1> The colour is <xsl:value-of select=“$Colour”/>.<html:h1>The colour is red.

XSLT : Variables and templates (2/3) • When the same formatting is required in a number of places, it is possible to simply reuse the same template. • <template name=“CreateHeader”> <html:h2>*****<apply-templates/>*****</html:h2></template><template match=“title”> <call-template name=“CreateHeader” /></template><template match=“head”> <call-template name=“CreateHeader” /></template>

XSLT : Variables and templates (3/3) • Such a mechanism is even more useful when the action performed by the named template can be modified, by passing parameters to it that override default values. • <template name=“CreateHeader”> <param name=“Prefix”>%%%</param> <html:h2><value-of select=“$Prefix”/> <apply-templates/>*****</html:h2></template><call-template name=“CreateHeader”> <with-param name=“Prefix”>%%%%%</with-param></call-template>%%%Header*****

XSLT : Creating and copying elements(1/2) • An element can be created in the output document using the Element element, with the element name specified using the Name attribute, and an optional namespace specified using the Namespace attribute • Elements can also be created that are copies of the source element, using the Copy element. • <template match=“third-header-level”> <element namespace=“html” name=“h3”> <apply-templates/> </element></template>

XSLT: Creating and copying elements(2/2) • Source document elements can also be selected and copied out to the destination document using the Copy Of element, which uses a Select attribute to identify the document fragment or set of elements to be reproduced at the current position. • <template match=“body”> <body> <copy-of select=“//h1 | //h2” /> <apply-templates/> </body></template>

Processing XML Documents