820 likes | 991 Views
Introduction to SAX: a standard interface for event-based XML parsing. Cheng-Chia Chen. What is SAX ?. SAX : Simple API for XML Started as community-driven project xml-dev mailing list Originally designed as Java API Others (C++, Python, Perl) are now supported SAX2 Namespaces
E N D
Introduction to SAX:a standard interface for event-based XML parsing Cheng-Chia Chen
What is SAX ? • SAX : Simple API for XML • Started as community-driven project • xml-dev mailing list • Originally designed as Java API • Others (C++, Python, Perl) are now supported • SAX2 • Namespaces • configurable features and properties
SAX Features • Event-driven • You provide various event handlers • Fast and lightweight • Document does not have to be entirely in memory • Sequential read access only • Does not support modification of document
What is an Event-Based Interface? Two major types of XML APIs: • Tree-based APIs ==> DOM • compiles an XML document into an internal tree structure, then allows an application to navigate that tree. • Event-based APIs. ==> SAX • reports parsing events (such as the start and end of elements) directly to the application through callbacks, • usually does not build an internal tree. • The application implements handlers to deal with the different events, much like handling events in a graphical user interface. • Comparison: For tree-based APIs • useful for many applications • require more system resources, especially if the document is large.
How an event-based API works • Sample document: • <?xml version="1.0“ ?> • <doc> • <para>Hello, world!</para> • </doc> • An event-based interface will break down the structure of this document into a sequence of SAX events: • start document • start element: doc • start element: para • characters: Hello, world! • end element: para • end element: doc • end document
Quick Start for SAX2 Application Writers 1. Make sure you have the required library(available in jdk): 1. the SAX2 interfaces and classes and 2. XML parsers that supports SAX2. Xerces => org.apache.xerces.parsers.SAXParser or • com.sun.org.apache.xerces.internal.parsers.SAXParser 2. Get the parser via XMLReaderFactory#createXMLReader() • XMLReader parser = XMLReaderFactory.createXMLReader() ; 3. Create event handlers to receive information about the document. • The most important one is the ContentHandler, which receives events for the start and end of elements, character data, processing instructions, and other basic XML structure. • can just subclss a builtin adapter classDefaultHandler , and then implement only the methods that you need.
Example: (MyHandler.java) • prints a message each time an element starts or ends: import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.Attributes; import static java.lang.System.out; public class MyHandlerextends DefaultHandler { public void startElement (String uri, String localName, String qName, Attributes atts) { out.println("Start element: " + localName); } public void endElement (String uri, String localName, String qName) { out.println("End element: " + qName); } }
The main program (SAXApp.java) import org.xml.sax.XMLReader; import org.xml.sax.helper.DefaultHandler; import org.xml.sax.helpers.XMLReaderFactory; public class SAXApp { // static final String parserClass = / / “org.apache.xerces.parsers.SAXParser"; // use my own parser! public static void main (String args[]) throws Exception { XMLReader xr = XMLReaderFactory.createXMLReader (/*parserClass*/); DefaultHandler handler = new MyHandler(); xr.setContentHandler(handler); for (int i = 0; i < args.length; i++) { xr.parse(args[i]); } }
The input • the input XML document (roses.xml): • <?xml version="1.0"?> • <poem> • <line>Roses are red,</line> • <line>Violets are blue.</line> • <line>Sugar is sweet,</line> • <line>and I love you.</line> • </poem> • To parse this with your SAXApp application, you would supply the absolute URL of the document on the command line: java SAXApp file://localhost/tmp/roses.xml or java SAXApp file:///tmp/roses.xml
The output • The output should be as follows: Start element: poem Start element: line End element: line Start element: line End element: line Start element: line End element: line Start element: line End element: line End element: poem
[ ] SAX Driver’s parser classname supplied by application writer SAX Implementation of Parser AttrbuteList Locator (supplied by Driver writer)
[ ] XMLReader SAX Driver’s parser classname supplied by application writer Content SAX 2 XMLReader Implementation of Parser Attrbutes Locator (supplied by Driver writer)
SAX 2.0: Java Road Map • The SAX Java distribution contains • 17 core classes/interfaces, • 10 helper classes • 2 extension interfaces + 6 extension implementations • For application writers • 7 interfaces available, but most XML applications will need only one or two of them.
SAX classes and interfaces • Falling into five groups: 1. interfaces implemented by the parser: • XMLReader, Attributes (required), and Locator (optional) 2.interfaces implemented by the application: • ContentHandler, ErrorHandler, DTDHandler, and • EntityResolver • (all optional: ContentHandler will be the most important one for typical XML applications) • XMLFilter : for cascaded applications • DeclHandler, LexicalHandler: for additional DTD/Lexical events 3.standard SAX classes supplied by SAX2: • InputSource, • SAXException, • SAXParseException, • SAXNotSupportedException, SAXNotRecognizedException
SAX classes and interfaces 4. Helper classes in the org.xml.sax.helpers package: • Default implementations: • AttributesImpl, LocatorImpl, XMLFilterImpl • NameSpaceSupport: • NameSpaceSupport • Factory Classes: • XMLReaderFactory 5. Legacy SAX 1.0 classes: Parser, ParserFactory, HandlerBase, AttributeList, AttributeListImpl, DocumentHandler. 6. Conversion b/t SAX1.0 and SAX 2.0 Parser/XMLReader • ParserAdaptor, XMLReaderAdaptor
Interfaces for Parser Writers (org.xml.sax package) • A SAX-conformant XML parser needs to implement only two or three simple interfaces; 1.XMLReader • the main interface to a SAX parser: • allow the user to register handlers for callbacks, to set the locale for error reporting, and to start an XML parse. 2.Attributes • allow users to iterate through an attribute list. • a convenience implementation available in the AttributesImpl. 3.Locator • allows users to find the location of current event in the XML source document.
Interfaces for Application Writers (org.xml.sax package) • A SAX application may implement any or none of the following interfaces, as required. • may need only ContentHandler and possibly ErrorHandler. • can implement all of these interfaces in a single class. • ContentHandler • receive notification of basic document-related events like the start and end of elements. • applications use most often • in many cases, it is the only one needed. 2. ErrorHandler • used for special error handling.
Interfaces for Application Writers (cont’d) 3.DTDHandler • to receive notification of the NOTATION and unparsed ENTITY declarations. 4.EntityResolver • redirection of URIs in documents (or other types of custom handling). 5. DECLHandler: • To receive notification of Element and AttributeList declaration in DTD. 6. LexicalHandler • To receive notification of markup Boundary Events. • Comment, CDATASection (begin and end) • Entity Expansion (begin and end),… 7. XMLFilter: • For cascading applcations.
Standard SAX Classes (org.xml.sax package) 1. InputSource • Input for a parser. • wrap information for a single input, including a public identifier, system identifier, byte stream, and character stream (as appropriate). • may be instantiated by EntityResolvers. 2. SAXException : • represents a general SAX exception. • SAXParseException : represents a SAX exception tied to a specific point in an XML source document. • SAXNotSupportedException, SAXNotRecognizedException 4. DefaultHandler • default implementations for ContentHandler, ErrorHandler, DTDHandler, and EntityResolver. • users can subclass this to simplify handler writing.
Helper Classes (org.xml.sax.helpers package) • provided simply as a convenience for Java programmers. 1. XMLReaderFactory • used to load SAX parsers dynamically at run time, based on the class name. 2. AttributesImpl • default implementation of Attributes. • can be used to make a copy of an Attributes 3. LocatorImpl • used to make a persistent snapshot of a Locator's values at a specific point in the parse. 4. XMLFilterImpl
SAX2: Features and Properties • standard methods to query and set features and properties in an XMLReader. • Features are boolean properties. • can request an XMLReader • to validate (or not to validate) a document, or • to internalize (or not to internalize) all names, • Use getFeature, setFeature, getProperty, and setProperty methods to get/set feature/property of an XMLReader: • EX: // check if a parser is doing validation! try{ if( xmlReader.getFeature( "http://xml.org/sax/features/validation")){ out.println("Parser is validating."); }else{ out.println("Parser is not validating.");} }catch(SAXException e){ out.println("Parser may or may not be validating."); }
SAX2 features • See SAX2 standard feature flags for more • Anyone can define his own features (by designating a unique uri) . • A feature may be read-only or read/write, and it may be modifiable only when parsing, or only when not parsing. • http://xml.org/sax/features/namespaces • true => Perform Namespace processing. • (URI + localPart ) reported + prefixMapping events generated • false: Optionally do not perform Namespace processing (implies namespace-prefixes). • access: (parsing) read-only; (not parsing) read/write • …/namespace-prefixes // qName + xmlns* attributes reported • true: qualified names (pref:local) reported and namespace declarations (xmlns*) treated as attributes as well. • false: no Namespace declarations reported, and optionally no qualified names reported. • access: (parsing) read-only; (not parsing) read/write
standard Features supplied by SAX2 • …/string-interning • true => All element names, prefixes, attribute names, Namespace URIs, and local names are internalized using java.lang.String#intern(). • access: (parsing) read-only; (not parsing) read/write • …/validation • true => Report all validation errors (implies external-general-entities and external-parameter-entities). • access: (parsing) read-only; (not parsing) read/write • …/external-general-entities • true => Include all external general (text) entities. • access: (parsing) read-only; (not parsing) read/write • .../external-parameter-entities • true: Include all external parameter entities, including the external DTD subset. • false: Do not include any external parameter entities, even the external DTD subset. • access: (parsing) read-only; (not parsing) read/write
SAX2 Properties • See standard SAX2 Properties for more • http://xml.org/sax/properties/lexical-handler • data type: org.xml.sax.ext.LexicalHandler • description: The registered lexical handler. access: read/write • …/declaration-handler • data type: org.xml.sax.ext.DeclHandler • description: The registered Declaration handler. access: read/write • …/document-xml-version • XML version; String:“1.0” or “1.1” • …/dom-node • data type: org.w3c.dom.Node • description: the current DOM node being visited if this is a DOM tree Walker • access: (parsing) read-only; (not parsing) read/write • …/xml-string// not supported by Xerces • data type: java.lang.String • description: The string source for the current event. • access: read-only
SAX2 Namespace Support • standardized Namespace support • essential for higher-level standards like XSL, XML Schemas, RDF, and XLink. • Namespace processing affects only element and attribute names. • ex: <x:e y:att = “z:val”/> // x,y mapping resolved but not z. • With Namespace processing: • name = [URI]+localName (must not contain : ) • and qName may be valid or not • Without Namespace processing: • name = qName (qualified name may contains :), • SAX2 • support either of these viewsor both simultaneously,
Sax2 namespace support • affects the ContentHandler and Attributes interfaces. • In SAX2, the startElement and endElement callbacks in a content handler look like this: public void startElement (String uri, String localName, String qName, Attributes atts)throws SAXException; public void endElement (String uri, String localName, String qName) throws SAXException; • By default, an XML reader will report a Namespace URIand a local namefor every element, in both the start and end handler. • Example: <html:hr xmlns:html= "http://www.w3.org/1999/xhtml"/> • uri = "http://www.w3.org/1999/xhtml" • localName=“hr” • qName = “html:hr” or “” depending on namespace-prefix feature set or not
startPrefixMapping, endPrefixMapping • SAX2 also reports the scope of Namespace declarations, so that applications can resolve prefixes in attribute values or character data if necessary. public void startPrefixMapping (String prefix, String uri) throws SAXException; public void endPrefixMapping (String prefix) throws SAXException; Ex: Before the start-element event, the XML reader would call : startPrefixMapping("html","http://www.w3.org/1999/xhtml") After the end-element event ,the XML reader would call : endPrefixMapping("html")
Configuring Namespace Support • "http://xml.org/features/namespaces" feature • true [default] => • Namespace URIs + local names valid, and • start/endPrefixMapping events reported. • "http://xml.org/features/namespace-prefixes" feature • true => • prefixed names (qName) valid and • Namespace declarations (xmlns* attributes) reported • in attributes: • false [default] => qualified prefixed names(qName) may optionally be reported (in practice, all are reported), but • xmlns* attributes must not be reported. Note: 1. At least one of both features must be true. Suggestion: 1. namespace-aware: use default setting. 2. no use of namespace: toggle the default setting.
Configuration Example • Consider the following simple sample document: <h:hello xmlns:h ="http://www.greeting.com/ns/“ id ="a1" h:person ="David"/> • NS true ,NSP false (the default) => report prefixMapping events + • h:hello => "http://www.greeting.com/ns/" + "hello"; • xmlns:h => not appearing in attrs; • id =>“”(empty string) + "id“ • h:person => "http://www.greeting.com/ns/" + "person". • namespaces, namespace-prefixes both true: prefixMapping events + • h:hello => "http://www.greeting.com/ns/" + "hello“ + “h:hello” • xmlns:h => “…” + “h” + “xmlns:h” • id =>“”(empty string) + "id“ + “id” • h:person => "http://www.greeting.com/ns/" + "person“ + “h:person”. • namespaces is false and namespace-prefixes is true: • “” + “” + "h:hello"; “” + “” + "xmlns:h"; • “” + “” + "id"; and “” + “” + "h:person".
SAX2 packages • 3 packages • org.xml.sax • org.xml.sax.helpers • XMLReaderFactoryDefaultHandler • AttributesImpl LocatorImpl • NamespaceSupport XMLFilterImpl • AttributeListImpl,ParserAdapter,ParserFactory, XMLReaderAdapter (sax 1.0 deprecated) • org.xml.sax.ext • DeclHandler : for DTD declaration events • LexicalHandler : for Lexical events • defaultHandler2 : • Locator2, Locator2Impl, EntityResolver2, Attributes2, Attributes2impl
Interfaces: AttributeList sax1 Attributes2 Attributes ContentHandler DocumentHandlersax1 DTDHandler EntityResolver2 EntiryResolver ErrorHandler Locator2 Locator Parsersax1 XMLReader XMLFilter Classes: HandlerBasesax1 InputSource Exceptions: SAXException SAXParseException SAXNotRecognizedException SAXNotSupportedException Package: org.xml.sax for SAX2
Methods index: getLength() Return the number of attributes in this list. getName(int index) Return the name of an attribute in this list (by position). getType(int index) Return the type of an attribute in the list (by position). getValue(int index) Return the value of an attribute in the list (by position). getIndex(String name) getType(String name) Return the type of an attribute in the list (by name). getValue(String name) Return the value of an attribute in the list (by name). Interface org.xml.sax.AttributeList(SAX1.0 deprecated)
int getLength() int getIndex(String qName) int getIndex(String uri, String localName) Look up the index of an attribute by qName or uri+localName. 0-based String getLocalName(int index) String getQName(int index) String getURI(int index) isDeclared (index | qName | uri,local)2 declared in DTD => true String getType(int index) String getType(String qName) String getType(String uri, String localName) possible results: "CDATA", "ID", "IDREF", "IDREFS", "NMTOKEN"(+enumeration), "NMTOKENS", "ENTITY", "ENTITIES", "NOTATION" String getValue(int index) String getValue(String qName) String getValue(String uri, String localName) isSpecified(index | qName | uri,local)2 interface org.xml.sax.ext.Attributes2 Attributes Note: All methods return null if namespace processing does not support them. e.g. if the namespace feature is false => getValue(uri, localName) returns null.
startDocument() endDocument() startElement( uri, localName, qName, Attributes atts) endElement(uri, localName, qName) startPrefixMapping(prefix, uri) Begin the scope of a prefix-URI Namespace mapping. endPrefixMapping(prefix) no guarantee of proper nesting among start- and end-prefixing mapping characters(char[] ch, int start, int length) Receive notification of character data. ignorableWhitespace(char[] ch, int start, int length) processingInstruction(target, data) setDocumentLocator(Locator locator) Receive an object for locating the origin of SAX document events. will be invoked only once and before any other method is called. skippedEntity( name) Receive notification of a skipped entity. interface ContentHandler
skippedEntity(name) Receive notification of a skipped entity. The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities features. <test> <a/>&ge1;bc<c/> </test> interface ContentHandler
Method Index notationDecl(String, String, String) throws SAXException Receive notification of a notation declaration event. parameters: name+pubId+sysId Ex: <!NOTATION GIF PUBLIC “abc” > notationDecl(“GIF”, “abc”, “”) unparsedEntityDecl(name, pubicId, systemId, notation) Receive notification of an unparsed entity declaration event. Ex: <!ENTITY aPic SYSTEM ‘here” NDATA GIF> =>unparsedEntityDecl( “aPic”, “”, // publicId “here”,// String systemId, “GIF” // notationName) Interface org.xml.sax.DTDHandler
Method index parse(InputSource) Parse an XML document. parse(String) Parse an XML document from a system identifier (URI). setDocumentHandler(DocumentHandler) Allow an application to register a document event handler. setDTDHandler(DTDHandler) Allow an application to register a DTD event handler. setEntityResolver(EntityResolver) Allow an application to register a custom entity resolver. setErrorHandler(ErrorHandler) Allow an application to register an error event handler. setLocale(Locale) Allow an application to request a locale for errors and warnings. Note: all return types are void. Interface org.xml.sax.Parser(SAX1.0; skipped!)
ContentHandler : getContentHandler() setContentHandler(ContentHandler handler) DTDHandler getDTDHandler() setDTDHandler(DTDHandler handler) EntityResolver getEntityResolver() setEntityResolver(EntityResolver resolver) ErrorHandler getErrorHandler() setErrorHandler(ErrorHandler handler) parse: parse(InputSource input) parse(String systemId) Features and Properties: boolean getFeature(name) Object getProperty(name) setFeature(name, boolean value) setProperty(name, Object value) interface XMLReader
Method Index characters(char[], int, int) Receive notification of character data. endDocument() Receive notification of the end of a document. endElement(String) Receive notification of the end of an element. ignorableWhitespace(char[], int, int) Receive notification of ignorable whitespace in element content. processingInstruction(String, String) Receive notification of a processing instruction. setDocumentLocator(Locator) Receive an object for locating the origin of SAX document events. startDocument() Receive notification of the beginning of a document. startElement(String, AttributeList) Receive notification of the beginning of an element. Interface org.xml.sax.DocumentHandler(SAX1.0 skipped)
Method Index getColumnNumber() Return the column number where the current document event ends. getLineNumber() Return the line number where the current document event ends. getPublicId() Return the public identifier for the current document event. getSystemId() Return the system identifier for the current document event. getEncoding()2 :String caharacter encoding used getXMLVersion()2:String XML version for the entity Note: If an implementation supports Locator2, XMLReader.getFeature (“…/use-locator2”) will return true. Interface org.xml.sax.Locator, org.xml.sax.ext.Locator2
InputSource resolveEntity(String pubilcId, String systemId) InputSource resolveEntity2(entityName, publicId, baseURI, systemId) // baseURI + systemId absolute URI Allow the application to resolve external entities The Parser will call this method before opening any external entity including: the external DTD subset( entityName is "[dtd]" ), external entities referenced within the DTD or within the document element parameter entity %name ; general entity name InputSource getExternalSubset2(rootName, baseURI) Allows applications to provide an external subset for docs that don't explicitly define one. // Either no DOCTYPE or has one but no external subset given. rootName: document root name; baseURI: absolute, additional hint. To use version 2, must setFeature(“…/use-entity-resolver2”, true) Version 2 will hide Version 1 if it is used. Interface org.xml.sax.EntityResolver, org.xml.sax.ext.EntityResolver2
Special entity processing for XHTML dtd import org.xml.sax.EntityResolver, org.xml.sax.InputSource; public class MyResolver implements EntityResolver { public InputSource resolveEntity (String publicId, String systemId) { if (publicId.equals(“-//W3c//DTD XHTML 1.0//EN”) || systemId.equals(“http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd") ) { // return my local xhtml1.0 DTD Reader reader = new FileReader(“myXhtmlDtdFile.dtd”); return new InputSource(reader); } else { // use the default behaviour return null; } } }
Method Index error(SAXParseException) Receive notification of a recoverable error. fatalError(SAXParseException) Receive notification of a non-recoverable error. warning(SAXParseException) Receive notification of a warning. Interface org.xml.sax.ErrorHandler
interface org.xml.sax.ext.DeclHandler • attributeDecl(String eName, String aName, String type, String valueDefault, String value) • Report an attribute type declaration. • valueDefault - "#IMPLIED", "#REQUIRED", "#FIXED" or null if none of these applies. • value - A string representing the attribute's default value, or null if there is none. • enumeartion or notations => [NOTATION](nm1|…|nmk) • elementDecl(name, String model) • Report an element type declaration. • externalEntityDecl(name, publicId, systemId) • Report a parsed external entity declaration. • parameter entity => name begins with %. • internalEntityDecl(name, String value) • Report an internal entity declaration. • parameter entity => name begins with %; value is replacement text.
Interface org.xml.sax.ext.LexicalHandler • optional extension handler for SAX2 to provide lexical information about an XML document, such as comments and CDATA section boundaries; • XMLreaders are not required to support. • apply to the entire document, not just to the document element, • all lexical handler events must appear between startDocument and endDocument events. • set an LexicalHandler/DeclHandler for an XMLreader: try{ setProperty("http://xml.org/sax/handlers/LexicalHandler“, aLexicalHandler) setProperty("http://xml.org/sax/handlers/DeclHandler“, aDeclHandler) }catch(SAXNotRecognizedException e){} catch(SAXNotSupportedException e){}
interface LexicalHandler • startDTD(String name, String publicId, String systemId) • Report the start of DTD declarations, if any. • endDTD() • Report the end of DTD declarations. • startCDATA() • Report the start of a CDATA section. • endCDATA() • Report the end of a CDATA section. • comment(char[] ch, int start, int length) • Report an XML comment anywhere in the document. • endEntity(String name) // general or parameter entity • Report the end of an entity [expansion]. • parameter entity begins with %
interface LexicalHandler • startEntity(String name) • Report the beginning of an entity in document. • name: name of the entity. • parameter entity begin with ‘%’ • external dtd subset “[dtd]” • NOTE: • Entity references in attribute values -- and the start and end of the document entity -- are never reported. • Skipped entities will be reported through the skippedEntity event, which is part of the ContentHandler interface.
Constructors: InputSource() Zero-argument default constructor. InputSource(InputStream) Create a new input source with a byte stream. InputSource(Reader) Create a new input source with a character stream. InputSource(String) Create a new input source with a system identifier. access order: char stream, byte stream, systmId, publicId. Methods getByteStream() Get the byte stream for this input source. getCharacterStream() Get the character stream for this input source. getEncoding() Get the character encoding for a byte stream or URI. getPublicId() Get the public identifier for this input source. getSystemId() Get the system identifier for this input source. Class org.xml.sax.InputSource
setByteStream(InputStream) Set the byte stream for this input source. setCharacterStream(Reader) Set the character stream for this input source. setEncoding(String) Set the character encoding, if known. setPublicId(String) Set the public identifier for this input source. setSystemId(String) Set the system identifier for this input source. Class org.xml.sax.InputSource