XML

XML

Today we will • Learn what XML is and where it comes from • Learn how to parse and create XML documents • See the difference between SAX and DOM ( Document Object Model) API • Learn about XML validation through DTDs (Document Type Definition) and XML Schema • Learn how to use XSLT style sheets for transforming XML into presentation form such as HTML

XML history • Has its roots in variants of SGML (Standard Generalized Markup Language) wich became international standard in 1986. • SGML is a complex tagging language. HTML was inspired by SGML, and by adding the <A>-tag the hyperlink was born. • HTML was easy and flexible for sharing strucutured information with embedded hyperlinks. But HTML was targeted at human interpretation • During 1999 the process of merging HTML and XML began steadily. • HTML4 = XHTML 1.0 became a W3C recommendation in 2000. • An XHTML document is also an XML document. • XML is targeted towards machine interpretation.

XML and stylesheets • XML is a markup language • Uses tags to identify the components of the document • Does not imply how the components should be presented. • Is all about data structure. • Presentation details is left for the stylesheets to define. • A markup language states which parts of the text are 1st level headings and the stylesheet defines how these heading should look like.

<html> <body> <h1>Heading Text Goes Here</h1> <p>This is a paragraph with some <b>boldfaced</b> text as well as some text that forms a list <ul> <li>First list item <li>Second item </ul> </body> </html> The presentation of HTML can be changed by stylesheets In HTML all tags have predefined “meaning”. You cannot define your own tags. HTML Example

XML • Can be used to mark up just about any information. • It is called Extensible Markup Langeage • The plain-text structure makes it “portable”, since it may be edited in any simple text editor. • Standard committees tries to define suits of XML tags/structures that fit the needs for particular business branches. • Just as Java promises portable programs, XML promises portable datastructures.

XML structure • An XML element is the combination of an opening tag, a closing tag, and all the data in between. • <tag> some text in between</tag> • If the opening and closing tags are collapsed, it is written • <tag/> • Data appearing within an element may contain other tags. • Proper nesting: A nested element must be closed before the its containing element is closed.

Using attributes or nested elements • Any attribute may be included in the opening tag of an XML element. • <tag2 height=“12.1” length=“7”> some text </tag2> • These may also be represented as nested elements. • <tag1><height>12.1</height><length>7</length> some text </tag1>

Namespaces • With many user defined element and tag names there is a considerable risk for name conflicts. • This is handled through namespaces • Like C++ and other languages as well. • A namespace is typically declared in the root element of the document • <tagRoot xmlns=“http://www.defaulttags.com/tags” xmlns:xyz=“http://www.xyztags.com/tags”> • <xyz:tag1>some text here</xyz:tag1> • <tag1>some other text here<tag1> • </tagRoot> • Above we have specified the default namespace and the xyz-namespace.

Well formed and Valid? • XML documents is always well-formed. • but may not be valid XML documents. • There may be additional rules that require tags to be used in a predefined order. • Some tags or attributes may be mandatory • A well-formed but invalid XML document is like a syntactically correct (compilable) program that executes improperly.

XML Example (complete file) <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="dinosaurs.xsl"?> <!DOCTYPE DinoList SYSTEM "dinosaurs.dtd"> <DinoList> <Dinosaur period="Late Cretaceous"> <Name>Tyrannosaurus Rex</Name> <Group>Carnosaur</Group> <Range> <Region>Europe</Region> <Region>North America</Region> </Range> <PhysicalAttr> <Length unit="feet">39</Length> <Weight unit="tons">6</Weight> </PhysicalAttr> </Dinosaur> <Dinosaur period="Late Jurassic"> <Name>Stegosaurus</Name> <Group>Stegosaur</Group> <Range> <Region>Europe</Region> <Region>Asia</Region> <Region>North America</Region> </Range> <PhysicalAttr> <Length unit="metres">9</Length> <Weight unit="kgs">3100</Weight> </PhysicalAttr> </Dinosaur> </DinoList>

XML Example continued • Top row identifies XML version • DinoList is top level element • Top level element is called “root” element. There should be only one root element in a document. • The sub elements is quite straght forward to understand… • Notice the tree structure of elements and sub elements. <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="dinosaurs.xsl"?> <!DOCTYPE DinoList SYSTEM "dinosaurs.dtd"> <DinoList> <Dinosaur period="Late Cretaceous"> <Name>Tyrannosaurus Rex</Name> <Group>Carnosaur</Group> <Range> <Region>Europe</Region> <Region>North America</Region> </Range> <PhysicalAttr> <Length unit="feet">39</Length> <Weight unit="tons">6</Weight> </PhysicalAttr> Etc...

JAXP, SAX and DOM • JAXP (Java API for XML Processing) • is the official API for XML processing from Sun. • Contains SAX, DOM and XML Schema support. • javax.xml.*, org.w3c.dom.*, org.xml.sax.* • Both SAX and DOM are language independent APIs for processing XML documents. • Many hope that JDOM will be part of JAXP(JSAX JavaScript Abstractions for X(HT)ML) in the future org.jdom.*

SAX • … is an event-based API for XML processing. • An XML tree is not viewed as a data structure, but as a stream of events generated by the parser. • … reports parsing events (such as the start and end of elements) directly to the application through callbacks. • The application implements handlers to deal with the different events, much like handling events in a graphical user interface. • … is efficient when we are only interested in a subset of the entire XML source. • When you are only interested in one pass over the XML source, and not interested in building up a complete in memory tree representation, SAX may be the prefered choice. • If only a fraction of the document needs to be processed, or if the document is very large compared to internal memory, SAX is more efficient than DOM.

DOM • …is a tree based API. • A DOM parser automatically maps an XML document into an internal tree structure, which allows an application to navigate that tree with random access. • It therefore often consumes more resources that SAX • It is more convenient, when performance is not an issue.

A DOM parser public class DOMParser { public static void main(String[] args) throws Exception { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = factory.newDocumentBuilder(); Document document = parser.parse(new InputSource("dinosaurs.xml")); Element dinoList = document.getDocumentElement(); NodeList dinosaurs = dinoList.getElementsByTagName("Dinosaur"); Element currElement = null; String groupName = null; for (int i = 0; i < dinosaurs.getLength(); i++) { currElement = (Element) dinosaurs.item(i); String nameValue = getSimpleElementText(currElement, "Name"); if (nameValue.equals("Dilophosaurus")) { groupName = getSimpleElementText(currElement, "Group"); } } System.out.println("Dilophosaurus group: " + groupName); }

A DOM parser /** * Method to return the first element of a specified * name from the given element */ public static Element getFirstElement(Element element, String name) { NodeList nl = element.getElementsByTagName(name); if (nl.getLength() < 1) { throw new RuntimeException("Element: " + element + " does not contain: " + name); } return (Element) nl.item(0); } /** * Method to return the text contained within the * element with the given name found within the * specified element */ public static String getSimpleElementText(Element node, String name) { Element nameEl = getFirstElement(node, name); Node textNode = nameEl.getFirstChild(); if (textNode instanceof Text) { return textNode.getNodeValue(); } else { throw new RuntimeException("No text in " + name); } } } This code shows how to scan through a DOM data structure to find specific information

JDOM vs DOM • DOM is a technology independent of programming language. • It doesn’t utilize the Java Colloection framework. • JDOM is a third party (free) Java adapted DOM implementation, that utilizes the Collections framework.

Generating XML content with DOM public class DOMPrinter { public static void main(String[] args) throws Exception { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); DOMImplementation domImpl = builder.getDOMImplementation(); Document document = domImpl.createDocument(null, "tagRoot", null); Element root = document.getDocumentElement(); root.setAttribute("testAttr", "testValue"); Element tag1Element = document.createElement("tag1"); Text tag1Text = document.createTextNode("sample text"); tag1Element.appendChild(tag1Text); root.appendChild(tag1Element); Element tag2Element = document.createElement("tag2");

Generating XML content with DOM Text tag2Text = document.createTextNode("more text"); tag2Element.appendChild(tag2Text); tag1Element.appendChild(tag2Element); Element tag3Element = document.createElement("tag3"); root.appendChild(tag3Element); TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(); Source source = new DOMSource(document); FileOutputStream fos = new FileOutputStream("tags.xml"); Result output = new StreamResult(fos); transformer.transform(source, output); } }

Output: • <?xml version=“1.0” encoding=“UTF-8”?><tagRoot testAttr=“testValue”> <tag1>sample text <tag2>more text</tag2> </tag1> <tag3/> • </tagRoot>

Validating • Document Type Definition (DTD) • XML Schema • DTD or XML Schema • DTD is well established, and simple • XML Schema is more elaborate and may have a bright future.

Example: A DTD for the dinosaurs <?xml version='1.0' encoding="UTF-8"?> <!ELEMENT DinoList (Dinosaur+)> <!ELEMENT Dinosaur (Name,Group,Range,PhysicalAttr)> <!ATTLIST Dinosaur period CDATA #IMPLIED> <!ELEMENT Group (#PCDATA)> <!ELEMENT Height (#PCDATA)> <!ATTLIST Height unit CDATA #IMPLIED> <!ELEMENT Length (#PCDATA)> <!ATTLIST Length unit CDATA #IMPLIED> <!ELEMENT Name (#PCDATA)> <!ELEMENT PhysicalAttr (Height?,Length?,Weight?)> <!ELEMENT Range (Region+)> <!ELEMENT Region (#PCDATA)> <!ELEMENT Weight (#PCDATA)> <!ATTLIST Weight unit CDATA #IMPLIED>

RegExp repetition • “*”, “+”, “?” • “*” = 0 or more • “+” = 1 or more • “?” = 0 or 1

Comments on the DTD dinosaur example • <Dinosour> element has exactly one <Name>,<Group>,<Range>,<PhysicalAttr> IN THAT ORDER. • CDATA – Character data • Often in attributes • PCDATA – Parsed Character data • Often inside elements. (May contain other elements) • To apply a DTD to a XML file we must modify the XML file header. • Insert <!DOCTYPE DinoList SYSTEM “dinosaurs.dtdt”> • Identifies the DinoList as the root element.

Validation - Finally activate validation public static void main(String[] args) throws Exception { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setValidating(true); DocumentBuilder parser = factory.newDocumentBuilder(); parser.setErrorHandler(new ParserErrorHandler()); Document document = parser.parse(new InputSource("dinosaurs.xml")); Element dinoList = document.getDocumentElement(); NodeList dinosaurs = dinoList.getElementsByTagName("Dinosaur"); Element currElement = null; String groupName = null; for (int i = 0; i < dinosaurs.getLength(); i++) { currElement = (Element) dinosaurs.item(i); String nameValue = getSimpleElementText(currElement, "Name"); if (nameValue.equals("Dilophosaurus")) { groupName = getSimpleElementText(currElement, "Group"); } } System.out.println("Dilophosaurus group: " + groupName); }

XML Schema • *.xsd • An alternative to DTDs • Is itself an XML document • It includes the full capabilities of DTDs, so that existing DTDs can be converted to XML Schema. • XML Schemas have additional capabilities compared to DTDs.

XML Schema example <?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="DinoList"> <xsd:complexType> <xsd:sequence> <xsd:element maxOccurs="unbounded" minOccurs="1" ref="Dinosaur"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Dinosaur"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Name"/> <xsd:element ref="Group"/> <xsd:element ref="Range"/> <xsd:element ref="PhysicalAttr"/> </xsd:sequence> <xsd:attribute name="period" type="xsd:string" use="optional"/> </xsd:complexType> </xsd:element> <xsd:element name="Group" type="xsd:string"/>

Transforming XML into other forms • Common target: http documents. • By using different stylesheets (XSL) we can present the same data to a web browser, mobile phone and a PDA.

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="html"/> <xsl:template match="/"> <html><head><title>Dinosaurs!</title></head> <body><h1>Dinosaurs!</h1> <xsl:apply-templates select="DinoList/Dinosaur"/> </body></html> </xsl:template> <xsl:template match="Dinosaur"> <h2><xsl:value-of select="Name"/></h2> <table border="1" width="400" cellpadding="5"> <tr> <th>Period</th> <td><xsl:value-of select="@period"/></td> </tr> <tr> <th>Group</th> <td><xsl:value-of select="Group"/></td> </tr> <xsl:apply-templates select="Range"/> <xsl:apply-templates select="PhysicalAttr"/> </table> </xsl:template> <xsl:template match="Range"> <tr> <th>Range</th> <td> <ul> <xsl:for-each select="Region"> <li><xsl:value-of select="."/></li> </xsl:for-each> </ul> </td> </tr> </xsl:template> <xsl:template match="PhysicalAttr"> <xsl:if test="Height"> <tr> <th>Height</th> <td> <xsl:value-of select="Height"/> <xsl:text disable-output-escaping="yes">   </xsl:text> <xsl:value-of select="Height/@unit"/> </td> </tr> </xsl:if> etc Example

Stylesheet transforms… • Tree structured starting with the root node. • See Figure with guiding comments.

XML

XML

Presentation Transcript

XML

XML

XML

XML

XML

XML

XML

XML

XML & XML Schema

XML & XML Schema

XML

XML

XML

XML to XML through XML

XML

XML

XML

XML

Presentation Transcript

XML

XML

XML

XML

XML

XML

XML

XML

XML &amp; XML Schema

XML &amp; XML Schema

XML

XML

XML

XML to XML through XML

XML

XML

XML & XML Schema

XML & XML Schema