3.4 Streaming API for XML ( StAX ). Could we process XML documents more conveniently than with SAX, and yet more efficiently? A: Yes, with Streaming API for XML ( StAX ) general introduction an example comparison with SAX. StAX : General. Latest of standard Java XML parser interfaces
StAX: General • Latest of standard Java XML parser interfaces • Origin: the XMLPull API (A. Slominski, ~ 2000) • developed as a Java Community Process lead by BEA Systems (2003) • included in JAXP 1.4, in Java WSDP 1.6, and in Java SE 6 (JDK 1.6) • An event-driven streaming API, like SAX • does not build in-memory representation • A "pull API" • lets the application to ask for individual events • unlike a "push API" like SAX
Advantages of PullParsing • A pull APIprovidesevents, on demand, from the chosenstream • cancancelparsing, say, afterprocessing the header of a long message • canreadmultipledocumentssimultaneously • application-controlledaccess (~ iterator design pattern) usuallysimplerthanSAX-stylecall-backs (~ observer design pattern)
Cursor and IteratorAPIs • StAXconsists of twosets of APIs • (1)cursorAPIs, and (2) iteratorAPIs • differbyrepresentation of parseevents • (1) cursor API XMLStreamReader • lower-level • methodshasNext() andnext() to scanevents, represented by as int constants START_DOCUMENT, START_ELEMENT, ... • access methods, depending on current event type: • getName(), getAttributeValue(..), getText(), ...
(2) XMLEventReaderIterator API • XMLEventReader provides contents of an XML document to the application using an event objectiterator • Parseeventsrepresented as immutableXMLEventobjects • receivedusingmethodshasNext()and nextEvent() • eventpropertiesaccessedthroughtheirmethods • canbestored (ifneeded) • requiremoreresourcesthan the cursor API (Seelater) • Eventlookahead, withoutadvancing in the stream, withXMLEventReader.peek() and XMLStreamReader.getEventType()
WritingAPIs • StAX is a bidirectionalAPI • allowsalso to write XML data • through an XMLStreamWriteror anXMLEventWriter • Useful for "marshaling" data structures into XML • Writersarenotrequired to forcewell-formedness (not to mentionvalidity) • providesomesupport: escaping of reservedcharslike & and <, and addingunclosedend-tags
Example of Using StAX (1/6) • Use StAXiteratorinterfaces to • fold element tagnames to uppercase, and to • strip comments • Outline: • Initialize • an XMLEventReader for the input document • an XMLEventWriter (for System.out ) • an XMLEventFactory for creating modified StartElement and EndElement events • Use them to read all input events, and to write some of them, possibly modified
StAX example (2/6) • First import relevant interfaces & classes: importjava.io.*; importjavax.xml.stream.*; importjavax.xml.stream.events.*; importjavax.xml.namespace.QName; public class capitalizeTags { public static void main(String[] args) throws FactoryConfigurationError,XMLStreamException,IOException { if (args.length != 1) System.exit(1); InputStream input = new FileInputStream(args[0]);
StAX example (3/6) • Initialize XMLEventReader/Writer/Factory: XMLInputFactoryxif = XMLInputFactory.newInstance(); xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true); XMLEventReaderxer = xif.createXMLEventReader(input); XMLOutputFactoryxof = XMLOutputFactory.newInstance(); XMLEventWriterxew = xof.createXMLEventWriter(System.out); XMLEventFactoryxef = XMLEventFactory.newInstance();
StAX example (4/6) • Iterate over events of the InputStream: while (xer.hasNext() ) { XMLEventinEvent = xer.nextEvent(); if (inEvent.isStartElement()) { StartElementse= (StartElement) inEvent; QNameinQName = se.getName(); String localName = inQName.getLocalPart(); xew.add( xef.createStartElement( inQName.getPrefix(), inQName.getNamespaceURI(), localName.toUpperCase(), se.getAttributes(), se.getNamespaces() ) );
StAX example (5/6) • Event iteration continues, to capitalize end tags: } else if (inEvent.isEndElement()) { EndElementee = (EndElement) inEvent; QNameinQName = ee.getName(); String localName = inQName.getLocalPart(); xew.add( xef.createEndElement( inQName.getPrefix(), inQName.getNamespaceURI(), localName.toUpperCase(), ee.getNamespaces() ) );
StAX example (6/6) • Output other events, except for comments; Finish when input ends: } else if (inEvent.getEventType() != XMLStreamConstants.COMMENT) { xew.add(inEvent); } } // while (xer.hasNext()) xer.close(); input.close(); xew.flush(); xew.close(); } // main() } // class capitalizeTags
Efficiency of StreamingAPIs? • An experiment of SAXvsStAX for scanningdocuments • Task: Count and report the number of elements, attributes, characterfragments, and totalcharlength • Inputs: Similarprose-orienteddocuments, of differentsize • repeatedfragments of W3C XML SchemaRec (Part 1) • Tested on OpenJDK 1.6.0 (differentupdates), with • Red Hat Linux 6.0.52, 3 GHz Pentium ,1 GB RAM ("OLD") • 64 b Centos Linux 5, 2.93 GHz Intel Core 2 Duo, 4GB RAM("NEW")
Essentials of the SAXSolution • Obtain and use a JAXP SAXparser: StringdocFile; // initializedfromcmdline SAXParserFactoryspf = SAXParserFactory.newInstance(); spf.setValidating(validate); //fromcmd option spf.setNamespaceAware(true); SAXParsersp = spf.newSAXParser(); CountHandlerch = new CountHandler(); sp.parse( new File(docFile), ch ); ch.printResult(); // print the statistics
SAX Solution: CountHandler publicstaticclassCountHandlerextendsDefaultHandler{ // Instancevars for statistics: intelemCount = 0, charFragCount = 0, totalCharLen = 0, attrCount = 0;public void startElement(String nsURI, String locName, String qName, Attributes atts) { elemCount++; attrCount += atts.getLength(); } public void characters(char[] buf, int start,int length){ charFragCount++; totalCharLen += length; }
Essentials of the StAXSolution • First, initialize: XMLInputFactoryxif = XMLInputFactory.newInstance(); xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true); InputStream input = new FileInputStream( docFile ); intelemCount = 0, charFragCount = 0, totalCharLen = 0, attrCount = 0; • Thenparse the InputStream,using(a) the cursor API, or (b) the eventiterator API
(a) StAXCursor API Solution (1) XMLStreamReaderxsr = xif.createXMLStreamReader(input); while(xsr.hasNext() ) { inteventType = xsr.next(); switch (eventType) { case XMLEvent.START_ELEMENT: elemCount++; attrCount += xsr.getAttributeCount(); break;
(a) StAXCursor API Solution (2) case XMLEvent.CHARACTERS: charFragCount++; totalCharLen += xsr.getTextLength(); break; default: break; } // switch } // while (xsr.hasNext() ) xsr.close(); input.close();
(b) StAXIterator API Solution (1) XMLEventReaderxer = xif.createXMLEventReader ( input );while (xer.hasNext() ) {XMLEventevent = xer.nextEvent(); if (event.isStartElement()) { elemCount++; Iteratorattrs =event.asStartElement().getAttributes(); while (attrs.hasNext()) { attrs.next(); attrCount++; } } // if (event.isStartElement())
(b) StAXIterator API Solution (2) if (event.isCharacters()) { charFragCount++; totalCharLen +=((Characters) event).getData().length(); } } // while (xer.hasNext() ) xer.close(); input.close();
Efficiency of SAX vsStAX
Efficiency of SAX vsStAX (NEW)
Observations • StAXcursor API is the mostefficient • Overhead of XMLEventobjectsmakesStAXiteratorsome 50 – 80% slower • SAX is on smalldocuments ~ 40 - 100% slowerthan the StAXcursor API • Overhead of DTD validationadds ~5 – 10 % to SAX parsingtime • StAXlosesitsadvantagewithbiggerdocuments:
Times on LargerDocuments Why? Let'stake a look at memoryusage
MemoryUsage of SAX vsStAX < 6 MB StAXimplementationhas a memoryleak! (Shouldgetfixed in futurereleases)
MemoryUsage of SAX vsStAX (NEW) Memory-leakalso in the SAX implementation!
Circumventing the MemoryLeak • The bugappears to berelated to a DOCTYPE declarationwith an external DTD • Without a DOCTYPE declaration • In firstexperiment, each API useslessthan 6 MB • In secondexperiment, the StAXEventobjectsstillrequireincreasingamounts of memory; Seenext
SAX vsStAXmemoryneed (w.o. DTD)
Speed on documentswithout DTD
Speed on documentswithout DTD (NEW)
StAX: Summary • Event-based streaming pull-API for XML documents • More convenient than SAX • and often more efficient, esp. the cursor API with small docs • Supports also writing of XML data • A potential substitute for SAX • NB: Sun Java Streaming XML Parser (in JDK 1.6) is non-validating (but the API allows validation, too) • once some implementation bugs (in JDK 1.6) get eliminated