450 likes | 659 Views
Simple API for XML (SAX). Aug’10 – Dec ’10 . Introduction to SAX. Simple API for XML or SAX was developed as a standardized way to parse an XML document To enable more efficient analysis of large XML documents This chapter covers the following ❑ What is SAX
E N D
Simple API for XML(SAX) Aug’10 – Dec ’10
Introduction to SAX Simple API for XML or SAX was developed as a standardized way to parse an XML document To enable more efficient analysis of large XML documents This chapter covers the following ❑ What is SAX ❑ Where to download and how to set it up ❑ How and when to use the primary SAX Interfaces Aug’10 – Dec ’10
Problem with DOM Before traversing starts, it has to build up a massive in-memory map of the document This takes up space and time If used to extract small amount of information from the document, this can be extremely difficult Better suited for small XML documents Aug’10 – Dec ’10
How SAX works As the XML parser parses the documents, it returns a stream of events back to the application There are events for start of the document, end of the document, start and end of each element, contents of each element etc Once started , cannot interrupt the parser to go back and look at an earlier part of the document Unlike DOM, which gives access to the entire document at once, SAX stores little or nothing from event to event This makes SAX faster compared to DOM Aug’10 – Dec ’10
Where to get SAX SAX is specified as a set of JAVA interfaces Downloads available at http://saxproject.org Xerces-J – Parser developed to work with SAX Downloads available at http://xml.apache.org/xerces-j Needs Java Development Kit, release 1.1 or later Aug’10 – Dec ’10
Receiving SAX Events Write Java class that implements one of the SAX Interfaces public class Myclass implements ContentHandler ContentHandler is the name of the interface. Most important interface in SAX ContentHandler interface defines the callback methods for content related events Better to use DefaultHandler class– provides default implementations of functions in ContentHandler interface public class Myclass extends DefaultHandler Aug’10 – Dec ’10
ContentHandler Interface Designed to control the reporting of events for the content of the document Includes information about text, attributes, processing instructions, elements and the document itself ContentHandler Methods Event Description startDocument Event to notify the application that the parser has read the start of the document endDocument Event to notify the application that the parser has read the end of the document startElement Event to notify the application that the parser has read an element start-tag Aug’10 – Dec ’10
ContentHandler Interface Event Description endElement Event to notify the application that the parser has read an element end-tag. skippedEntity Event to notify the application that the parser has skipped an external entity processingInstruction Event to notify the application that the parser has read a processing instruction startPrefixMapping Event to notify the application that the parser has read an XML namespace declaration, and that a new namespace prefix is in scope Aug’10 – Dec ’10
Example : TrainReader createXMLReader function XMLReader reader = XMLReaderFactory.createXMLReader( “org.apache.xerces.parsers.SAXParser” ); Creates an XMLReader object using a factory helper object by sending a registered parser name to the factory function setContentHandler function reader.setContentHandler(this); To tell the XMLReader which class should receive events about the content of the XML document Aug’10 – Dec ’10
Handling Element Events startElement function public void startElement(String uri, String localName, String qName, Attributes atts) The first three parameters help to identify the element the parser encountered The fourth parameter Attributes – to lookup attributes and values endElement function public void endElement(String uri, String localName, String qName) Aug’10 – Dec ’10
Handling Element Events startElement The first three paramters help identify the element based on its namespace name and local name or by its prefix This behavior allows to identify similar elements in different vocabularies If the parser encounters the following element : <myPrefix:myElement xmlns:myPrefix=“http://example.com”> uri http://example.com localName myElement qName myPrefix:myElement If there is no prefix for element name, then the localName and qName should be the same Aug’10 – Dec ’10
Handling Element Events Attributes The Attributes interface gives the ability to easily lookup the attributes and their values at the start of each element The default Attrributes interface provides the following functions : getLength Determine the number of attributes available in the Attributes interface getIndex Retrieves the index of a specific attribute in the list. Uses attribute’s qualified name or both local name and namespace URI getLocalName Retrieves a specific attribute’s local name by sending the index in the list. getQname Retrieves a specific attribute’s qualified name by sending the index in the list. getURI Retrieve a specif attribute’s namespace URI by sending the index in the list. Aug’10 – Dec ’10
Handling Element Events getType Retrieve a specific attribute’s type by sending the index in the list, by using the attribute’s qualified name, or by using both the local name and the namespace URI. If there is no DTD, this function will always return CDATA getValue Retrieve a specific attribute’s value by sending the index in the list, by using the attribute’s qualified name, or by using both the local name and the namespace URI. Some parsers expose extended behavior through an interface called Attributes2, which allows to check whether an attribute was declared in a DTD whether or not the attribute value appeared in the XML document if it appeared because of a DTD attribute default declaration Aug’10 – Dec ’10
Element and Attribute Events public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { if(localName.equals(“car”)) { if(atts != null) { System.out.println(“Car: “ + atts.getValue(“type”)); } } } Output : Running train reader Start of the train Car : Engine Car : Baggage Car : Dining End of the train Aug’10 – Dec ’10
Handling Character Content public void characters(char[] ch, int start, int len) throws SAXException To retrieve character content between two tags Characters are delivered as a buffer start and len indicates the starting position and length of data to be read if it is going to be copied from the buffer Parser reports the characters for an element in multiple chunks Aug’10 – Dec ’10
Handling Character Content To retrieve character content in color tag in train.xml private boolean isColor; private String trainCarType = “”; private StringBuffer trainCarColor = new StringBuffer(); public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { if(localName.equals(“car”)) { if(atts ! = null) trainCarType = atts.getValue(“type”); } if(localName.equals(“color”)) { trainCarColor.setLength(0); isColor = true; } else isColor = false; } Aug’10 – Dec ’10
Handling Character Content public void characters(char[] ch, int start, int len) throws SAXException { if(isColor) trainCarColor.append(ch,start,len); } public void endElement(String uri, String localName, String qName) throws SAXException { if(isColor) { System.out.println(“The color of the” + trainCarType + “ car is “ + trainCarColor.toString()); } } Aug’10 – Dec ’10
Handling Character Content Output: Running train reader Start of the train The color of the Engine car is Black The color of the Baggage car is Green The color of the Dining car is Green and Yellow End of the train Aug’10 – Dec ’10
When to ignore IgnorableWhitespace public void ignorableWhitespace(char[] ch, int start, int len) throws SAXException Similar to characters event Parser may call this function multiple times within a single element Whitespaces such as spaces, tabs and line feeds which are used to make the xml document more readable are often not important to the application <car type=“Engine”> <color>Black</color> <weight>512 tons</weight> </car> Aug’10 – Dec ’10
When to ignore IgnorableWhitespace The only way for the SAX parser to know that the whitespace is ignorable is when an element is declared in DTD to not contain PCDATA Only validating parsers can report this event If parser has no knowledge of the DTD, then it assumes that all character data including whitespace is important Aug’10 – Dec ’10
Skipped Entities Alerts the application that the SAX parser has encountered information that the application can or must skip An entity can be skipped for several reasons : - The entity is a reference to an external resource that cannot be parsed or cannot be found The entity is an external general entity and the http://xml.org/sax/features/external-general-entities feature is set to false The entity is an external parameter entity and the http://xml.org/sax/features/external-parameter-entities feature is set to false public void skippedEntity (String name) throws SAXException Aug’10 – Dec ’10
Skipped Entities The skippedEntity event is declared as follows:- public void skippedEntity (String name) throws SAXException The name parameter is the name of the entity that was skipped The name parameter will begin with “%” in case of a parameter entity SAX considers the external DTD subset an entity If the name parameter is “[dtd]” it means the external DTD subset was not processed. Aug’10 – Dec ’10
Processing Instructions To pass specific instructions to applications SAX allows to receive these special instructions in application through the processingInstruction event public void processingInstruction (String target, String data) throws SAXException If the processing instruction in the XML document is as follows : <?TrainApplication instructionForTrainPrograms?> target - TrainApplication data - instructionForTrainPrograms XML declaration is not a processing instruction Aug’10 – Dec ’10
Namespace Prefixes SAX processors fire a startPrefixMapping and endPrefixMapping event for any namespace declaration public void startPrefixMapping (String prefix, String uri) throws SAXException public void endPrefixMapping (String prefix) throws SAXException The prefix parameter is the namespace prefix that is being declared In case of a default namespace declaration, the prefix should be an empty string The uri parameter is the namespace URI that is being declared xmlns:example = http://example.com prefix - example uri - http://example.com Aug’10 – Dec ’10
Stopping the process in exceptional circumstances To stop processing, create and throw new SAXException For example, to check and throw exception if the Engine color is not Black public void endElement(String uri, String localName, String qName) throws SAXException { if(isColor) { System.out.println(“The color of the “ + trainCarType + “ car is “+trainCarColor.toString()); if ((trainCarType.equals(“Engine”)) && (!trainCarColor.toString().equals(“Black”)) { throw new SAXException(“The engine is not black ! “); } } isColor = false; } If the Engine color is not Black, parsing process will be stopped. Aug’10 – Dec ’10
Stopping the process in exceptional circumstances Output: Running train reader.. Start of the train The color of the Engine car is Red Exception in thread “main” org.xml.sax.SAXException : The engine is not black ! at TrainReader.endElement (TrainReader.java:80) at org.apache.xerces… . . . . at TrainReader.read at TrainReader.main When the exception is raised it stops the whole application This is because the exception is not handled anywhere Aug’10 – Dec ’10
Stopping the process in exceptional circumstances Add a try..catch block to handle exception public void read (String filename) throws Exception { XMLReader reader = XMLReaderFactory.createXMLReader( “org.apache.xerces.parsers.SAXParser”); reader.setContentHandler(this); try { reader.parse(fileName); } catch (SAXException e) { System.out.println(“Parsing stopped ! “ + e.getmessage()); } } Output: Running train reader.. Start of the train The color of the Engine car is Red Parsing stopped ! The engine is not black ! Aug’10 – Dec ’10
Providing the location of the Error SAX can provide line number and column position information of the error using setDocumentLocator event setDocumentLocator event allows the parser to pass the application a Locator interface The methods of the Locator object include : getLineNumber Retrieves the line number for the current event getColumnNumber Retrieves the column number for the current event getSystemId Retrieves the system identifier of the document for the current event getPublicId Retrieves the public identifier of the document for the current event Aug’10 – Dec ’10
Providing the location of the Error private Locator trainLocator = null; public void setDocumentLocator (Locator loc) { trainLocator = loc; } public void endElement(String uri, String localName, String qName) throws SAXException { if(isColor) { System.out.println(“The color of the “ + trainCarType + “ car is “+trainCarColor.toString()); if ((trainCarType.equals(“Engine”)) && (!trainCarColor.toString().equals(“Black”)) { if (trainLocator != null) throw new SAXException(“The engine is not black ! at line “ + trainLocator.getLineNumber() + “, column” + trainLocator.getColumnNumber() ); } } isColor = false; } Aug’10 – Dec ’10
Providing the location of the Error Output : Running train reader.. Start of the train The color of the Engine car is Red Parsing stopped ! The engine is not black ! at line 4, column 20 Locator object : - easily notify the user where the error occurred in the XML document The information provided by the Locator object is not always absolute Aug’10 – Dec ’10
ErrorHandler Interface To receive error events Add call to setErrorHandler and set validation feature warning Allows the parser to notify the application of a warning it has encountered in the parsing process error Allows the parser to notify the application that it has encountered an error. Even though the parser has encountered an error, parsing can continue. fatalError Allows the parser to notify the application that it has encountered a fatal error and cannot continue parsing. Well-formedness errors are reported through this event Aug’10 – Dec ’10
ErrorHandler Interface Add internal DTD to validate the document <?xml version=“1.0”?> <!DOCTYPE train [ <!ELEMENT train (car*)> <!ELEMENT car (color,weight,length,occupants)> <!ATTLIST car type CDATA #IMPLIED> <!ELEMENT color (#PCDATA)> <!ELEMENT weight (#PCDATA)> <!ELEMENT length (#PCDATA)> <!ELEMENT occupants (#PCDATA)> ]> Aug’10 – Dec ’10
ErrorHandler Interface Modifying read function public void read(String fileName) throws Exception { XMLReader reader = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser"); reader.setContentHandler(this); reader.setErrorHandler(this); try { reader.setFeature( “http://xml.org/sax/features/validation”, true); } catch(SAXException e) { System.err.println(“Cannot activate validation”); } try { reader.parse(fileName); } catch(SAXException e) { System.out.println(“Parsing stopped ! “ + e.getMessage()); } } Aug’10 – Dec ’10
ErrorHandler Interface Add Error Handling Functions public void warning (SAXParseException exception) throws SAXException { System.err.println(“[Warning] “ + exception.getMessage() + “ at line “ + exception.getLineNumber() + “, column “ + exception.getColumnNumber()); } public void error (SAXParseException exception) throws SAXException { System.err.println(“[Error] “ + exception.getMessage() + “ at line “ + exception.getLineNumber() + “, column “ + exception.getColumnNumber()); } public void fatalError (SAXParseException exception) throws SAXException { System.err.println(“[Fatal Error]“ + exception.getMessage() + “ at line “ + exception.getLineNumber() + “, column “ + exception.getColumnNumber()); throw exception; } Introduce Errors into the XML document and check : Change Element to new Element name Remove closing > bracket for tag – Violating Well formedness Aug’10 – Dec ’10
DTDHandler Interface To receive events about declarations It supports generating events for only notations and unparsed entities NotationDecl Allows the parser to notify the application that it has read a notation declaration UnparsedEntityDecl Allows the parser to notify the application that it has read an unparsed entity declaration. Events for declarations of elements, attributes and internal entities are made available in one of the extension interfaces, DeclHandler To use the DTDHandler interface, reader.setDTDHandler(this); Aug’10 – Dec ’10
EntityResolver Interface Allows to control how a SAX parser behaves when it attempts to resolve external entity references within the DTD. The EntityResolver interface defines one function: resolveEntity Allows the application to handle the resolution of entity lookups for the parser To use EntityResolver interface : reader.setEntityResolver(this); Allows application to control how the processor opens and connects to external resources. Aug’10 – Dec ’10
Features and Properties Some behavior of SAX parsers is controlled through setting features and properties Working with Features To change the value of a feature in SAX, call the setFeature function of the XMLReader public void setFeature(String name, boolean value) throws SAXNotRecognizedException, SAXNotSupportedException Parsers may not support or recognize every feature The getFeature function allows to check the value of any feature public boolean getFeature (String name) throws SAXNotRecognizedException, SAXNotSupportedException Aug’10 – Dec ’10
Features and Properties Working with Features http://xml.org/sax/features/validation Controls whether or not the parser will validate the document as it parses http://xml.org/sax/features/external-general-entities Controls whether or not external general entities should be processed http://xml.org/sax/features/external-parameter-entities Controls whether or not external parameter entities should be processed http://xml.org/sax/features/xml-1.1 Read-only property that returns true if the parser supports XML 1.1 and XML 1.0 Aug’10 – Dec ’10
Features and Properties Working with Properties Used to connect helper objects to an XMLReader SAX comes with an extension set of interfaces called DeclHandler and LexicalHandler that’s allows to receive additional events about the XML document The only way to register these events with the XMLReader is through the setProperty function public void setProperty(String name, Object value) throws SAXNotRecognizedException, SAXNotSupportedException public object getProperty(String name) throws SAXNotRecognizedException, SAXNotSupportedException Aug’10 – Dec ’10
Features and Properties Working with Properties http://xml.org/sax/properties/declaration-handler Specifies the DeclHandler object registered to receive events for declarations within the DTD http://xml.org/sax/properties/lexical-handler Specifies the LexicalHandler object registered to receive lexical events such as comments, CDATA sections and entity references http://xml.org/sax/properties/document-xml-version Read-only property that describes the actual version of the XML document such as “1.0” or “1.1” Aug’10 – Dec ’10
Extension Interfaces DeclHandler Interface – for declarations within the DTD The DeclHandler interface declares the following events: AttributeDecl Allows the parser to notify the application that it has read an attribute declaration ElementDecl Allows the parser to notify the application that it has read an element declaration ExternalEntityDecl Allows the parser to notify the application that it has read an external entity declaration InternalEntityDecl Allows the parser to notify the application that it has read an internal entity declaration Aug’10 – Dec ’10
Extension Interfaces LexicalHandler Interface – for lexical events The LexicalHandler interface declares the following events: comment Allows the parser to notify the application that it has read a comment startCDATA Allows the parser to notify the application that it has encountered a CDATA section start marker endCDATA Allows the parser to notify the application that it has encountered a CDATA section end marker Other events supported are startDTD, endDTD, startEntity, endEntity To register, reader.setProperty(“http://xml.org/sax/properties/lexical-handler”, this); Aug’10 – Dec ’10
Good SAX and Bad SAX Advantages Simple Doesn’t load the whole document into memory The parser has a smaller footprint than DOM It is faster Focuses on real content rather than the way it is laid out Good for filtering data and lets concentrate on the subset of interest Aug’10 – Dec ’10
Good SAX and Bad SAX Disadvantages Receive the data in the order SAX gives. Absolutely no control over the order in which the parser searches SAX programming requires fairly intricate state keeping If the focus is on analyzing an entire document, DOM is much better Aug’10 – Dec ’10
Consumers, Producers and Filters In addition to consuming events from an XMLReader, it is possible to write a class that produce SAX events eg: class that reads a comma-delimited file and fires SAX events Can filter events as they pass from XMLReader to event handler A SAX filter acts as a middleman between the parser and the application Filters can insert, remove or modify events before passing them on to the application Other Languages C++, Perl, Python, Pascal, Visual Basic, .NET, Curl Aug’10 – Dec ’10