310 likes | 540 Views
3.4 Streaming API for XML ( StAX ). Could we process XML documents more conveniently than with SAX, and yet more efficiently? A: Yes, with Streaming API for XML ( StAX ) general introduction an example comparison with SAX. StAX : General. Latest of standard Java XML parser interfaces
E N D
3.4 Streaming API for XML (StAX) • Could we process XML documents more conveniently than with SAX, and yet more efficiently? • A: Yes, with Streaming API for XML (StAX) • general introduction • an example • comparison with SAX 3.4 Streaming API for XML
StAX: General • Latest of standard Java XML parser interfaces • Origin: the XMLPull API (A. Slominski, ~ 2000) • developed as a Java Community Process lead by BEA Systems (2003) • included in JAXP 1.4, in Java WSDP 1.6, and in Java SE 6 (JDK 1.6) • An event-driven streaming API, like SAX • does not build in-memory representation • A "pull API" • lets the application to ask for individual events • unlike a "push API" like SAX 3.4 Streaming API for XML
Advantages of PullParsing • A pull APIprovidesevents, on demand, from the chosenstream • cancancelparsing, say, afterprocessing the header of a long message • canreadmultipledocumentssimultaneously • application-controlledaccess (~ iterator design pattern) usuallysimplerthanSAX-stylecall-backs (~ observer design pattern) 3.4 Streaming API for XML
Cursor and IteratorAPIs • StAXconsists of twosets of APIs • (1)cursorAPIs, and (2) iteratorAPIs • differbyrepresentation of parseevents • (1) cursor API XMLStreamReader • lower-level • methodshasNext() andnext() to scanevents, represented by as int constants START_DOCUMENT, START_ELEMENT, ... • access methods, depending on current event type: • getName(), getAttributeValue(..), getText(), ... 3.4 Streaming API for XML
(2) XMLEventReaderIterator API • XMLEventReader provides contents of an XML document to the application using an event objectiterator • Parseeventsrepresented as immutableXMLEventobjects • receivedusingmethodshasNext()and nextEvent() • eventpropertiesaccessedthroughtheirmethods • canbestored (ifneeded) • requiremoreresourcesthan the cursor API (Seelater) • Eventlookahead, withoutadvancing in the stream, withXMLEventReader.peek() and XMLStreamReader.getEventType() 3.4 Streaming API for XML
WritingAPIs • StAX is a bidirectionalAPI • allowsalso to write XML data • through an XMLStreamWriteror anXMLEventWriter • Useful for "marshaling" data structures into XML • Writersarenotrequired to forcewell-formedness (not to mentionvalidity) • providesomesupport: escaping of reservedcharslike & and <, and addingunclosedend-tags 3.4 Streaming API for XML
Example of Using StAX (1/6) • Use StAXiteratorinterfaces to • fold element tagnames to uppercase, and to • strip comments • Outline: • Initialize • an XMLEventReader for the input document • an XMLEventWriter (for System.out ) • an XMLEventFactory for creating modified StartElement and EndElement events • Use them to read all input events, and to write some of them, possibly modified 3.4 Streaming API for XML
StAX example (2/6) • First import relevant interfaces & classes: importjava.io.*; importjavax.xml.stream.*; importjavax.xml.stream.events.*; importjavax.xml.namespace.QName; public class capitalizeTags { public static void main(String[] args) throws FactoryConfigurationError,XMLStreamException,IOException { if (args.length != 1) System.exit(1); InputStream input = new FileInputStream(args[0]); 3.4 Streaming API for XML
StAX example (3/6) • Initialize XMLEventReader/Writer/Factory: XMLInputFactoryxif = XMLInputFactory.newInstance(); xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true); XMLEventReaderxer = xif.createXMLEventReader(input); XMLOutputFactoryxof = XMLOutputFactory.newInstance(); XMLEventWriterxew = xof.createXMLEventWriter(System.out); XMLEventFactoryxef = XMLEventFactory.newInstance(); 3.4 Streaming API for XML
StAX example (4/6) • Iterate over events of the InputStream: while (xer.hasNext() ) { XMLEventinEvent = xer.nextEvent(); if (inEvent.isStartElement()) { StartElementse= (StartElement) inEvent; QNameinQName = se.getName(); String localName = inQName.getLocalPart(); xew.add( xef.createStartElement( inQName.getPrefix(), inQName.getNamespaceURI(), localName.toUpperCase(), se.getAttributes(), se.getNamespaces() ) ); 3.4 Streaming API for XML
StAX example (5/6) • Event iteration continues, to capitalize end tags: } else if (inEvent.isEndElement()) { EndElementee = (EndElement) inEvent; QNameinQName = ee.getName(); String localName = inQName.getLocalPart(); xew.add( xef.createEndElement( inQName.getPrefix(), inQName.getNamespaceURI(), localName.toUpperCase(), ee.getNamespaces() ) ); 3.4 Streaming API for XML
StAX example (6/6) • Output other events, except for comments; Finish when input ends: } else if (inEvent.getEventType() != XMLStreamConstants.COMMENT) { xew.add(inEvent); } } // while (xer.hasNext()) xer.close(); input.close(); xew.flush(); xew.close(); } // main() } // class capitalizeTags 3.4 Streaming API for XML
Efficiency of StreamingAPIs? • An experiment of SAXvsStAX for scanningdocuments • Task: Count and report the number of elements, attributes, characterfragments, and totalcharlength • Inputs: Similarprose-orienteddocuments, of differentsize • repeatedfragments of W3C XML SchemaRec (Part 1) • Tested on OpenJDK 1.6.0 (differentupdates), with • Red Hat Linux 6.0.52, 3 GHz Pentium ,1 GB RAM (”OLD”) • 64 b Centos Linux 5, 2.93 GHz Intel Core 2 Duo, 4GB RAM(”NEW”) 3.4 Streaming API for XML
Essentials of the SAXSolution • Obtain and use a JAXP SAXparser: StringdocFile; // initializedfromcmdline SAXParserFactoryspf = SAXParserFactory.newInstance(); spf.setValidating(validate); //fromcmd option spf.setNamespaceAware(true); SAXParsersp = spf.newSAXParser(); CountHandlerch = new CountHandler(); sp.parse( new File(docFile), ch ); ch.printResult(); // print the statistics 3.4 Streaming API for XML
SAX Solution: CountHandler publicstaticclassCountHandlerextendsDefaultHandler{ // Instancevars for statistics: intelemCount = 0, charFragCount = 0, totalCharLen = 0, attrCount = 0;public void startElement(String nsURI, String locName, String qName, Attributes atts) { elemCount++; attrCount += atts.getLength(); } public void characters(char[] buf, int start,int length){ charFragCount++; totalCharLen += length; } 3.4 Streaming API for XML
Essentials of the StAXSolution • First, initialize: XMLInputFactoryxif = XMLInputFactory.newInstance(); xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true); InputStream input = new FileInputStream( docFile ); intelemCount = 0, charFragCount = 0, totalCharLen = 0, attrCount = 0; • Thenparse the InputStream,using(a) the cursor API, or (b) the eventiterator API 3.4 Streaming API for XML
(a) StAXCursor API Solution (1) XMLStreamReaderxsr = xif.createXMLStreamReader(input); while(xsr.hasNext() ) { inteventType = xsr.next(); switch (eventType) { case XMLEvent.START_ELEMENT: elemCount++; attrCount += xsr.getAttributeCount(); break; 3.4 Streaming API for XML
(a) StAXCursor API Solution (2) case XMLEvent.CHARACTERS: charFragCount++; totalCharLen += xsr.getTextLength(); break; default: break; } // switch } // while (xsr.hasNext() ) xsr.close(); input.close(); 3.4 Streaming API for XML
(b) StAXIterator API Solution (1) XMLEventReaderxer = xif.createXMLEventReader ( input );while (xer.hasNext() ) {XMLEventevent = xer.nextEvent(); if (event.isStartElement()) { elemCount++; Iteratorattrs =event.asStartElement().getAttributes(); while (attrs.hasNext()) { attrs.next(); attrCount++; } } // if (event.isStartElement()) 3.4 Streaming API for XML
(b) StAXIterator API Solution (2) if (event.isCharacters()) { charFragCount++; totalCharLen +=((Characters) event).getData().length(); } } // while (xer.hasNext() ) xer.close(); input.close(); 3.4 Streaming API for XML
Efficiency of SAX vsStAX 3.4 Streaming API for XML
Efficiency of SAX vsStAX (NEW) 3.4 Streaming API for XML
Observations • StAXcursor API is the mostefficient • Overhead of XMLEventobjectsmakesStAXiteratorsome 50 – 80% slower • SAX is on smalldocuments ~ 40 - 100% slowerthan the StAXcursor API • Overhead of DTD validationadds ~5 – 10 % to SAX parsingtime • StAXlosesitsadvantagewithbiggerdocuments: 3.4 Streaming API for XML
Times on LargerDocuments Why? Let'stake a look at memoryusage 3.4 Streaming API for XML
MemoryUsage of SAX vsStAX < 6 MB StAXimplementationhas a memoryleak! (Shouldgetfixed in futurereleases) 3.4 Streaming API for XML
MemoryUsage of SAX vsStAX (NEW) Memory-leakalso in the SAX implementation! 3.4 Streaming API for XML
Circumventing the MemoryLeak • The bugappears to berelated to a DOCTYPE declarationwith an external DTD • Without a DOCTYPE declaration • In firstexperiment, each API useslessthan 6 MB • In secondexperiment, the StAXEventobjectsstillrequireincreasingamounts of memory; Seenext 3.4 Streaming API for XML
SAX vsStAXmemoryneed (w.o. DTD) 3.4 Streaming API for XML
Speed on documentswithout DTD 3.4 Streaming API for XML
Speed on documentswithout DTD (NEW) 3.4 Streaming API for XML
StAX: Summary • Event-based streaming pull-API for XML documents • More convenient than SAX • and often more efficient, esp. the cursor API with small docs • Supports also writing of XML data • A potential substitute for SAX • NB: Sun Java Streaming XML Parser (in JDK 1.6) is non-validating (but the API allows validation, too) • once some implementation bugs (in JDK 1.6) get eliminated 3.4 Streaming API for XML