Processing XML with Java

Processing XML with Java Representation and Management of Data on the Internet

XML • XML is eXtensible Markup Language • It is a metalanguage: • A language used to describe other languages using “markup” tags that describe properties of the data • Designed to be structured • Strict rules about how data can be formatted • Designed to be extensible • Can define own terms and markup

XHTML XML Family • XML is an official recommendation of the W3C • Aims to accomplish what HTML cannot and be simpler to use and implement than SGML HTML XML SGML

The Essence of XML • Syntax: The permitted arrangement or structure of letters and words in a language as defined by a grammar (XML) • Semantics:The meaning of letters or words in a language • XML uses Syntax to add Semantics to the documents

Using XML • In XML there is a separation of the content from the display • XML can be used for: • Data representation • Data exchange

Databases and XML • Database content can be presented in XML • XML processor can access DBMS or file system and convert data to XML • Web server can serve content as either XML or HTML

improper nesting proper nesting allow start tags, without end tags like empty tags must have a trailing slash, as in unquoted attribute values quoted attribute values HTML is case insensitive XML is case sensitive Whitespace is ignored Whitespace is important Begins with <html> Begins with <?xml version=‘1.0’ ?> HTML vs. XML HTML XML

Well defined set of tags Can use any tag you like tags have a known meaning tags have no known meaning HTML vs. XML HTML XML

Some Things in Common • Comments are allowed -  • Special characters must be escaped (e.g., > for >)

Processing XML – The Idea

Sample Document <transaction> <account>89-344</account> <buy shares=“100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“30”> <ticker exch=“NYSE”>GE</ticker> </sell> </transaction>

DOM Parser • DOM = Document Object Model • Parser creates a tree object out of the document • User accesses data by traversing the tree • The API allows for constructing, accessing and manipulating the structure and content of XML documents

Document as Tree Methods like: getRoot getChildren getAttributes etc. transaction account buy sell 89-344 shares shares ticker ticker 100 30 exch exch NASDAQ NYSE WEBM GE

Advantages and Disadvantages • Advantages: • Natural and relatively easy to use • Can repeatedly traverse tree • Disadvantages: • High memory requirements – the whole document is kept in memory • Must parse the whole document before use

SAX Parser • SAX = Simple API for XML • Parser creates “events” while traversing tree • Parser calls methods (that you write) to deal with the events • Similar to an IOStream, goes in one direction

 End tag: account Start tag: transaction Start tag: account Text: 89-344 Value: 100  Start tag: buy Attribute: shares Document as Events <transaction> <account>89-344</account> <buy shares=“100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“30”> <ticker exch=“NYSE”>GE</ticker> </sell> </transaction>

Advantages and Disadvantages • Advantages: • Requires little memory • Fast • Disadvantages: • Cannot reread • Less natural for object oriented programmers (perhaps)

Which should we use?DOM vs. SAX • If your document is very large and you only need a few elements - use SAX • If you need to manipulate (i.e., change) the XML - use DOM • If you need to access the XML many times - use DOM

XML Parsers

XML Parsers • There are several different ways to categorise parsers: • Validating versus non-validating parsers • DOM parsers versus SAX parsers • Parsers written in a particular language (Java, C++, Perl, etc.)

Validating Parsers • A validating parser makes sure that the document conforms to the specified DTD • This is time consuming, so a non-validating parser is faster

Using an XML Parser • Three basic steps • Create a parser object • Pass the XML document to the parser • Process the results • Generally, writing out XML is not in the scope of parsers (though some may implement proprietary mechanisms)

SAX – Simple API for XML

The SAX Parser • SAX parser is an event-driven API • An XML document is sent to the SAX parser • The XML file is read sequentially • The parser notifies the class when events happen, including errors • The events are handled by the implemented API methods to handle events that the programmer implemented

Handles document events: start tag, end tag, etc. Used to create a SAX Parser Handles Parser Errors Handles DTDs and Entities

Problem • The SAX interface is an accepted standard • There are many implementations • Like to be able to change the implementation used without changing any code in the program • How is this done?

Factory Design Pattern • Have a “Factory” class that creates the actual Parsers. • The Factory checks the value of a system property that states which implementation should be used • In order to change the implementation, simply change the system property

Creating a SAX Parser • Import the following packages: • org.xml.sax.*; • org.xml.sax.helpers.*; • Set the following system property: • System.setProperty("org.xml.sax.driver", "org.apache.xerces.parsers.SAXParser"); • Create the instance from the Factory: • XMLReader reader = XMLReaderFactory.createXMLReader();

Receiving Parsing Information • A SAX Parser calls methods such as “startDocument”, “startElement”, etc., as it runs • In order to react to such events we must: • implement the ContentHandler interface • set the parser’s content handler with an instance of our class

ContentHandler // Methods (partial list) public void startDocument(); public void endDocument(); public void characters(char[] ch, int start, int length); public void startElement(String namespaceURI, String localName, String qName, Attributes atts); public void endElement(String namespaceURI, String localName, String qName);

Namespaces and Element Names <?xml version='1.0' encoding='utf-8'?> <forsale date="12/2/03" xmlns:xhtml = "urn:http://www.w3.org/1999/xhtml"> <book> <title> <xhtml:em> DBI: </xhtml:em> The Course I Wish I never Took </title> <comment> My <xhtml:b> favorite </xhtml:b> book! </comment> </book> </forsale>

Namespaces and Element Names namespaceURI = "" localName = book qName = book <?xml version='1.0' encoding='utf-8'?> <forsale date="12/2/03" xmlns:xhtml = "urn:http://www.w3.org/1999/xhtml"> <book> <title> <xhtml:em> DBI: </xhtml:em> The Course I Wish I never Took </title> <comment> My <xhtml:b> favorite </xhtml:b> book! </comment> </book> </forsale> namespaceURI = urn:http://www.w3.org/1999/xhtml localName = em qName = xhtml:em

Receiving Parsing Information (cont.) • An easy way to implement the ContentHandler interface is the extend the DefaultHandler, which implements this interface (and a few others) in an empty fashion • To actually parse a document, create an InputSource from the document and supply the input source to the parse method of the XMLReader

import java.io.*; • import org.xml.sax.*; • import org.xml.sax.helpers.*; • public class InfoWithSax extends DefaultHandler { • public static void main(String[] args) { • System.setProperty("org.xml.sax.driver", • "org.apache.xerces.parsers.SAXParser"); • try { • XMLReader reader = • XMLReaderFactory.createXMLReader(); • reader.setContentHandler(new InfoWithSax()); • reader.parse(new InputSource(new FileReader(args[0]))); • } catch(Exception e) { e.printStackTrace()} • }

public static startDocument() throws SAXException { System.out.println(“START DOCUMENT”); } public static endDocument() throws SAXException { System.out.println(“END DOCUMENT”); } int depth; String indent = “ ”; private void println(String header, String value) { for (int i = 0 ; i < depth ; i++) System.out.print(indent); System.out.println(header + ": " + value); }

public void characters(char buf[], int offset, int len) throws SAXException { String s = (new String(buf, offset, len)).trim(); if (!"".equals(s)) println("CHARACTERS", s); } public void endElement(String namespaceURI, String localName, String name) throws SAXException { depth--; String elementName = name; if (!"".equals(namespaceURI) && !"".equals(localName)) elementName = namespaceURI + ":" + localName; println("END ELEMENT", elementName); }

public static startElement(String namespaceURI, String localName, String name, Attributes attrs) throws SAXException { String elementName = name; if (!"".equals(namespaceURI) && !"".equals(localName)) elementName = namespaceURI + ":" + localName; println("START ELEMENT", elementName); if (attrs != null && attrs.getLength() > 0) { for (int i = 0; i < attrs.getLength(); i++) println("ATTRIBUTE", attrs.getLocalName(i) + “=” + attrs.getValue(i)); } depth++; } Example Input Example Output

Bachelor Tags • What do you think happens when the parser parses a bachelor tag? <rating stars="five" />

Attributes Interface • Elements may have attributes • There is no distinction between attributes that are defined explicitly from those that are specified in the DTD (with a default value)

Attributes Interface (cont.) • int getLength(); • String getQName(int i); • String getType(int i); • String getValue(int i); • String getType(String qname); • String getValue(String qname); • etc.

Attributes Types • The following are possible types for attributes: • "CDATA", • "ID", • "IDREF", "IDREFS", • "NMTOKEN", "NMTOKENS", • "ENTITY", "ENTITIES", • "NOTATION"

Setting Features • It is possible to set the features of a parser using the setFeature method. • Examples: • reader.setFeature(“http://xml.org/sax/features/namespaces”, true) • reader.setFeature(“http://xml.org/sax/features/validation", false) • For a full list, see: http://www.saxproject.org/?selected=get-set

ErrorHandler Interface • We implement ErrorHandler to receive error events (similar to implementing ContentHandler) • DefaultHandler implements ErrorHandler in an empty fashion, so we can extend it (as before) • An ErrorHandler is registered with • reader.setErrorHandler(handler); • Three methods: • void error(SAXParseException ex); • void fatalError(SAXParserExcpetion ex); • void warning(SAXParserException ex);

public void warning(SAXParseException err) throws SAXException { System.out.println(“Warning in line” + err.getLineNumber() + “ and column ” + err.getColumnNumber()); } public void error(SAXParseException err) throws SAXException { System.out.println(“Oy va’avoi, an error!”); } public void fatalError(SAXParseException err) throws SAXException { System.out.println(“OY VA’AVOI, a fatal error!”); } Extending the InfoWithSax Program Will these methods be called in the case of a problem?

Lexical Events • Lexical events have to do with the way that a document was written and not with its content • Examples: • A comment is a lexical event () • The use of an entity is a lexical event (>) • These can be dealt with by implementing the LexicalHandler interface, and set on a parser by • reader.setProperty("http://xml.org/sax/properties/ lexical-handler", mylexicalhandler);

LexicalHandler // Methods (partial list) public void startEntity(String name); public void endEntity(String name); public void comment(char[] ch, int start, int length); public void startCDATA(); public void endCDATA();

DOM – Document Object Model

Creating a DOM Tree • How can we create a DOM Tree independently of the implementation chosen? • Creating a DOM Tree using the Apache Xerces package: • Import: org.apache.xerces.parsers.DOMParser • Import: org.w3c.dom.*; • Use the following lines of code: DOMParser dom = new DOMParser(); dom.parse(fileName); Document doc = dom.getDocument();

API Application XML File DOM Parser DOM Tree Using a DOM Tree

Figure as appears in : “The XML Companion” - Neil Bradley NodeList NamedNodeMap Nodes in a DOM Tree DocumentFragment Document Text CDATASection CharacterData Comment Attr Element Node DocumentType Notation Entity EntityReference ProcessingInstruction DocumentType

Processing XML with Java