600 likes | 829 Views
Briefly: The Power of XML. XML is Extensible Markup LanguageText-based representation for describing data structureBoth human and machine readableOriginated from Standardized Generalized Markup Language (SGML)Became a World Wide Web Consortium (W3C) standard in 1998XML is a great choice for exc
E N D
1. Web Programming Course Lecture 12 – XML
2. Briefly: The Power of XML XML is Extensible Markup Language
Text-based representation for describing data structure
Both human and machine readable
Originated from Standardized Generalized Markup Language (SGML)
Became a World Wide Web Consortium (W3C) standard in 1998
XML is a great choice for exchanging data between disparate systems
3. Synergy between Java and XML Java+XML=Portable language+Portable Data
Allows use Java to generate XML data
Use Java to access SQL databases
Use Java to format data in XML
Use Java to parse data
Use Java to validate data
Use Java to transform data
4. HTML and XML HTML and XML look similar, because they are both SGML languages
use elements enclosed in tags (e.g. <body>This is an element</body>)
use tag attributes (e.g.,<font face="Verdana" size="+1" color="red">)
More precisely,
HTML is defined in SGML
XML is a (very small) subset of SGML
5. HTML and XML HTML is for humans
HTML describes web pages
Browsers ignore and/or correct many HTML errors, so HTML is often sloppy
XML is for computers
XML describes data
The rules are strict and errors are not allowed
In this way, XML is like a programming language
Current versions of most browsers display XML
6. Example XML document
7. Overall structure An XML document may start with one or more processing instructions or directives:
<?xml version="1.0"?><?xml-stylesheet type="text/css" href="ss.css"?>
Following the directives, there must be exactly one root element containing all the rest of the XML:
<weatherReport> ...</weatherReport>
8. XML building blocks Aside from the directives, an XML document is built from:
elements: high in <high scale="F">103</high>
tags, in pairs: <high scale="F">103</high>
attributes: <high scale="F">103</high>
entities: <afternoon>Sunny & hot</afternoon>
data: <high scale="F">103</high>
9. Elements and attributes Attributes and elements are interchangeable
Example:
Elements are easier to use from Java
Attributes may contain elaborate metadata, such as unique IDs
10. Well-formed XML In XML, every element must have both a start tag and an end tag, e.g. <name> ... </name>
Empty elements can be abbreviated: <break />.
XML tags are case sensitive and may not begin with the letters xml, in any combination of cases
Elements must be properly nested
e.g. not <b><i>bold and italic</b></i>
XML document must have one and only one root element
The values of attributes must be enclosed in quotes
e.g. <time unit="days">
11. DTDs and Namespaces DTDs are used to define the tags that can be used in an XML document
A document may refer to a number of DTDs
Namespaces specify which DTD defines a given tag
This helps to avoid collisions between names
XML: myDTD:myTag
Note that colon (:) is used rather than a dot (.)
12. XML as a tree An XML document represents a hierarchy
A hierarchy is a tree
13. Viewing XML XML is designed to be processed by computer programs, not to be displayed to humans
Nevertheless, almost all current Web browsers can display XML documents
They do not all display it the same way
They may not display it at all if it has errors
This is just an added value. Remember: HTML is designed to be viewed, XML is designed to be used
14. Stream Model Stream seen by parser is a sequence of elements
As each XML element is seen, an event occurs
Some code registered with the parser (the event handler) is executed
This approach is popularized by the Simple API for XML (SAX)
Problem:
Hard to get a global view of the document
Parsing state represented by global variables set by the event handlers
15. Data Model The XML data is transformed into a navigable data structure in memory
Because of the nesting of XML elements, a tree data structure is used
The tree is navigated to discover the XML document
This approach is popularized by the Document Object Model (DOM)
Problem:
May require large amounts of memory
May not be as fast as stream approach
Some DOM parsers use SAX to build the tree
16. SAX and DOM SAX and DOM are standards for XML parsers
DOM is a W3C standard
SAX is an ad-hoc (but very popular) standard
There are various implementations available
Java implementations are provided as part of JAXP (Java API for XML Processing)
JAXP package is included in JDK starting from JDK 1.4
Is available separately for Java 1.3
17. Difference between SAX and DOM DOM reads the entire document into memory and stores it as a tree data structure
SAX reads the document and calls handler methods for each element or block of text that it encounters
Consequences:
DOM provides "random access" into the document
SAX provides only sequential access to the document
DOM is slow and requires huge amount of memory, so it cannot be used for large documents
SAX is fast and requires very little memory, so it can be used for huge documents
This makes SAX much more popular for web sites
18. Parsing with SAX SAX uses the source-listener-delegate model for parsing XML documents
Source is XML data consisting of a XML elements
A listener written in Java is attached to the document which listens for an event
When event is thrown, some method is delegated for handling the code
19. Callbacks SAX works through callbacks:
The program calls the parser
The parser calls methods provided by the program
20. Simple SAX program The program consists of two classes:
Sample -- This class contains the main method; it
Gets a factory to make parsers
Gets a parser from the factory
Creates a Handler object to handle callbacks from the parser
Tells the parser which handler to send its callbacks to
Reads and parses the input XML file
Handler -- This class contains handlers for three kinds of callbacks:
startElement callbacks, generated when a start tag is seen
endElement callbacks, generated when an end tag is seen
characters callbacks, generated for the contents of an element
21. The Sample class import javax.xml.parsers.*; // for both SAX and DOMimport org.xml.sax.*;import org.xml.sax.helpers.*;
// For simplicity, we let the operating system handle exceptions// In "real life" this is poor programming practicepublic class Sample { public static void main(String args[]) throws Exception {
// Create a parser factory SAXParserFactory factory = SAXParserFactory.newInstance();
// Tell factory that the parser must understand namespaces factory.setNamespaceAware(true);
// Make the parser SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
22. The Sample class // Create a handler Handler handler = new Handler();
// Tell the parser to use this handler parser.setContentHandler(handler);
// Finally, read and parse the document parser.parse("hello.xml");
} // end of Sample class
The parser reads the file hello.xml
It should be located
In the same directory
In a directory that is included in the classpath
23. The Handler class public class Handler extends DefaultHandler {
DefaultHandler is an adapter class that defines empty methods to be overridden
We define 3 methods to handle (1) start tags, (2) contents, and (3) end tags.
The methods will just print a line
Each of these 3 methods throws a SAXException
// SAX calls this when it encounters a start tag public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes attributes) throws SAXException { System.out.println("startElement: " + qualifiedName); }
24. The Handler class // SAX calls this method to pass in character data public void characters(char ch[ ], int start, int length) throws SAXException { System.out.println("characters: \"" + new String(ch, start, length) + "\""); }
// SAX call this method when it encounters an end tag public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { System.out.println("Element: /" + qualifiedName); }} // End of Handler class
25. Results If the file hello.xml contains: <?xml version="1.0"?> <display>Hello World!</display>
Then the output from running java Sample will be: startElement: display characters: "Hello World!" Element: /display
26. More results Now suppose the file hello.xml contains:
<?xml version="1.0"?><display> <i>Hello</i> World!</display>
Notice that the root element, <display>, contains a nested element <i> and whitespace (including newlines)
The result will be as shown at the right: startElement: displaycharacters: ""characters: "" characters: " " startElement: icharacters: "Hello"endElement: /icharacters: "World!"characters: " "endElement: /display
27. Factories SAX uses a parser factory
A factory is a design pattern alternative to constructors
Factories allow the programmer to:
Decide whether or not to create a new object
Decide what kind of object to create
class TrustMe { private TrustMe() { } // private constructor public TrustMe makeTrust() { // factory method if ( /* test of some sort */) return new TrustMe(); } }}
28. Parser factories To create a SAX parser factory, call static method:SAXParserFactory.newInstance()
Returns an object of type SAXParserFactory
It may throw a FactoryConfigurationError
Then, the parser can be customized:
public void setNamespaceAware(boolean awareness)
Call this with true if you are using namespaces
The default (if you don’t call this method) is false
public void setValidating(boolean validating)
Call this with true if you want to validate against a DTD
The default (if you don’t call this method) is false
Validation will give an error if you do not have a DTD
29. Getting a parser Once a SAXParserFactory factory was set up, parsers can be created with: SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
Note: SAXParser is not thread-safe
If a parser will be used by in multiple threads, create a separate SAXParser object for each thread
30. Declaring which handler to use Since the SAX parser will call the handlers, we need to supply these methods
Binding the parser with a handler: Handler handler = new Handler(); parser.setContentHandler(handler);
These statements could be combined: parser.setContentHandler(new Handler());
Finally, the parser is invoked on the file to parse: parser.parse("hello.xml");
Everything else is done in the handler methods
31. SAX handlers A callback handler must implement 4 interfaces:
interface ContentHandler
Handles basic parsing callbacks, e.g., element starts and ends
interface DTDHandler
Handles only notation and unparsed entity declarations
interface EntityResolver
Does customized handling for external entities
interface ErrorHandler
Must be implemented or parsing errors will be ignored!
Implementing all these interfaces is a lot of work
It is easier to use an adapter class
32. Class DefaultHandler DefaultHandler is in an adapter from package org.xml.sax.helpers
DefaultHandler implements ContentHandler, DTDHandler, EntityResolver, and ErrorHandler
DefaultHandler provides empty methods for every method declared in each of the interfaces
To use this class, extend it and override the methods that are important to the application
33. ContentHandler methods public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException
This method is called at the beginning of elements
When SAX calls startElement, it passes in a parameter of type Attributes
The following methods look up attributes by name rather than by index:
public int getIndex(String qualifiedName)
public int getIndex(String uri, String localName)
public String getValue(String qualifiedName)
public String getValue(String uri, String localName)
34. ContentHandler methods endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException
The parameters to endElement are the same as those to startElement, except that the Attributes parameter is omitted
public void characters(char[] ch, int start, int length) throws SAXException
ch is an array of characters
Only length characters, starting from ch[start], are the contents of the element
35. Error Handling SAX error handling is unusual
Most errors are ignored unless you an error handler org.xml.sax.ErrorHandler is registered
Ignored errors can cause unexpected behavior
The ErrorHandler interface declares:
public void fatalError (SAXParseException exception) throws SAXException // XML not well structured
public void error (SAXParseException exception) throws SAXException // XML validation error
public void warning (SAXParseException exception) throws SAXException // minor problem
36. External parsers Alternatively, you can use an existing parser:
Xerces, Electric XML, Expat, MSXML, CMarkup
Stages of the parsing
Get the URL object for the source
Create InputSource object encapsulating the data source
Create the parser
Launch the parser on the data source
37. Creating InputSource import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.net.*;
import java.util.*;
public class MyServlet extends HttpServlet {
private static string URL url;
public void init() throws ServletException {
try {
url = new URL(“http://server/data.xml”);
} catch (MalformedURLException e) {
System.err.println(e);
}
}
38. Creating InputSource & Parser public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException, ServletException {
resp.setContentType(“text/html”);
PrintWriter out = resp.getWriter();
out.println(“<html><title> mytitle </title><body>”);
InputStream in = url.openStream();
InputSource src = new InputSource(in);
try {
XMLReader parser = XMLReaderFactory.createXMLReader(
“org.apache.xerces.parsers.SAXParser”);
parser.parse(src);
}
catch (SAXException e) { System.err.println(e); }
catch (IOException e) { System.err.println(e); }
out.println(“</body></html>”);
}
39. Problems with SAX SAX provides only sequential access to the document being processed
SAX has only a local view of the current element being processed
Global knowledge of parsing must be stored in global variables
A single startElement() method for all elements
In startElement() there are many “if-then-else” tests for checking a specific element
When an element is seen, a global flag is set
When finished with the element global flag must be set to false
40. DOM DOM represents the XML document as a tree
Hierarchical nature of tree maps well to hierarchical nesting of XML elements
Tree contains a global view of the document
Makes navigation of document easy
Allows to modify any subtree
Easier processing than SAX but memory intensive!
As well as SAX, DOM is an API only
Does not specify a parser
Lists the API and requirements for the parser
DOM parsers typically use SAX parsing
41. Simple DOM program First we need to create a DOM parser, called a DocumentBuilder
The parser is created, not by a constructor, but by calling a static factory method
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
42. Simple DOM program An XML file hello.xml will be be parsed <?xml version="1.0"?> <display>Hello World!</display>
To read this file, we add the following line : Document document = builder.parse("hello.xml");
document contains the entire XML file as a tree
The following code finds the content of the root element and prints it
Element root = document.getDocumentElement(); Node textNode = root.getFirstChild(); System.out.println(textNode.getNodeValue());
The output of the program is: Hello World!
43. Reading in the tree The parse method reads in the entire XML document and represents it as a tree in memory
For a large document, parsing could take a while
If you want to interact with your program while it is parsing, you need to use parser in a separate thread
Practically, an XML parse tree may require up to 10 times memory as the original XML document
If you have a lot of tree manipulation to do, DOM is much more convenient than SAX
If you do not have a lot of tree manipulation to do, consider using SAX instead
44. Structure of the DOM tree The DOM tree is composed of Node objects
Node is an interface
Some of the more important sub-interfaces are Element, Attr, and Text
An Element node may have children
Attr and Text nodes are the leaves of the tree
Hence, the DOM tree is composed of Node objects
Node objects can be downcast into specific types if needed
45. Operations on Nodes The results returned by getNodeName(), getNodeValue(), getNodeType() and getAttributes() depend on the subtype of the node, as follows: Element Text AttrgetNodeName() getNodeValue()getNodeType()getAttributes()
46. Distinguishing Node types An easy way to handle different types of nodes:
switch(node.getNodeType()) {
case Node.ELEMENT_NODE:
Element element = (Element)node;...;break;
case Node.TEXT_NODE:
Text text = (Text)node;...break;
case Node.ATTRIBUTE_NODE:
Attr attr = (Attr)node;...break;
default: ...
}
47. Operations on Nodes Tree-walking methods that return a Node:
getParentNode()
getFirstChild()
getNextSibling()
getPreviousSibling()
getLastChild()
Test methods that return a boolean:
hasAttributes()
hasChildNodes()
48. Operations for Elements String getTagName()
Returns the name of the tag
boolean hasAttribute(String name)
Returns true if this Element has the named attribute
String getAttribute(String name)
Returns the value of the named attribute
boolean hasAttributes()
Returns true if this Element has any attributes
NamedNodeMap getAttributes()
Returns a NamedNodeMap of all the Element’s attributes
49. Operations on Texts Text is a subinterface of CharacterData and inherits the following operations (among others):
public String getData() throws DOMException
Returns the text contents of this Text node
public int getLength()
Returns the number of Unicode characters in the text
public String substringData(int offset, int count) throws DOMException
Returns a substring of the text contents
50. Operations on Attributes String getName()
Returns the name of this attribute.
Element getOwnerElement()
Returns the Element node this attribute is attached to
boolean getSpecified()
Returns true if this attribute was explicitly given a value in the document
String getValue()
Returns the value of the attribute as a String
51. Pre-order traversal The DOM is stored in memory as a tree
Trees can be traversed using pre-order, in-order, or post-order
A simple way to traverse a tree is in preorder
The general form of a pre-order traversal is:
Visit the root
Traverse each one of the sub-trees, in order
52. Pre-order traversal in Java static void simplePreorderPrint(String indent, Node node) { printNode(indent, node); if(node.hasChildNodes()) { Node child = node.getFirstChild(); while (child != null) { simplePreorderPrint(indent + " ", child); child = child.getNextSibling(); } } }
static void printNode(String indent, Node node) { System.out.print(indent); System.out.print(node.getNodeType() + " "); System.out.print(node.getNodeName() + " "); System.out.print(node.getNodeValue() + " "); System.out.println(node.getAttributes()); }
53. Trying out the program Input:<?xml version="1.0"?><novel> <chapter num="1">The Beginning</chapter> <chapter num="2">The Middle</chapter> <chapter num="3">The End</chapter></novel> Output:1 novel null 3 #text null 1 chapter null num="1“ 3 #text The Beginning null 3 #text null 1 chapter null num="2“ 3 #text The Middle null 3 #text null 1 chapter null num="3“ 3 #text The End null 3 #text null
54. Overview DOM, unlike SAX, gives allows to create and modify XML trees
There are three basic kinds of operations:
Creating a new DOM
Modifying the structure of a DOM
Modifying the content of a DOM
Creating a new DOM requires a few extra methods just to get started
Afterwards, you can add elements through modifying its structure and contents
55. Creating a new DOM
56. Creating structure The following are instance methods of Document:
public Element createElement(String tagName)
public Element createElementNS(String namespaceURI, String qualifiedName)
public Attr createAttribute(String name)
public Attr createAttributeNS(String namespaceURI, String qualifiedName)
public ProcessingInstruction createProcessingInstruction (String target, String data)
public EntityReference createEntityReference(String name)
public Text createTextNode(String data)
public Comment createComment(String data)
57. Methods of Node public Node appendChild(Node newChild)
public Node insertBefore(Node newChild, Node refChild)
public Node removeChild(Node oldChild)
public Node replaceChild(Node newChild, Node oldChild)
setNodeValue(String nodeValue)
Functionality depends on the type of the node
58. Methods of Element public void setAttribute(String name, String value)
public Attr setAttributeNode(Attr newAttr)
public void setAttributeNodeNS(String namespaceURI, String qualifiedName, String value)
public Attr setAttributeNodeNS(Attr newAttr)
public void removeAttribute(String name)
public void removeAttributeNS(String namespaceURI, String localName)
public Attr removeAttributeNode(Attr oldAttr)
59. Method of Attribute public void setValue(String value)
This is the only method that modifies an Attribute
The rest just retrieve information
60. Writing out the DOM as XML There are no Java-supplied methods for writing out a DOM as XML
Writing out a DOM is conceptually simple
It is just a tree walk
Practically, there are a lot of details
Various node types
Binding attributes
…
Doing a good job isn’t complicated, but it is lengthy