380 likes | 476 Views
3.2 Document Object Model (DOM). How access structured documents uniformly in parsers, browsers, editors, databases,...? Overview of the W3C DOM Spec Level 1, W3C Rec , Oct. 1998 Level 2 , W3C Rec , Nov. 2000 Level 3 Validation , Core , and Load and Save W3C Recs (Spring 2004)
E N D
3.2 Document Object Model (DOM) • How access structured documents uniformly in parsers, browsers, editors, databases,...? • Overview of the W3C DOM Spec • Level 1, W3C Rec, Oct. 1998 • Level 2, W3C Rec, Nov. 2000 • Level 3 Validation, Core, and Load and SaveW3C Recs (Spring 2004) W3C DOM Activity has been closed 3.2: Document Object Model
DOM: What is it? • An object-based, language-neutral API for XML and HTML documents • Allows programs and scripts to build, access, and modify documents • Supports designing of querying, filtering, transformation, formatting etc. applications on top of DOM implementations • Instead of “Serial Access XML” could think as “Directly Obtainable in Memory” 3.2: Document Object Model
DOM structure model • Based on O-O concepts: • objects (encapsulation of data and methods) • methods (to access or change object’s state) • interfaces (declaration of a set of methods) • Somewhat similar to the XPath data model (to be discussed with XSLT and XQuery) syntax-tree • Tree structure implied by abstract relationships defined by the API; Data structures of an implementation may differ 3.2: Document Object Model
<invoice form="00" type="estimated"> <addressdata> <name>John Doe</name> <address> <streetaddress>Pyynpolku 1 </streetaddress> <postoffice>70460 KUOPIO </postoffice> </address> </addressdata> ... form="00" type="estimated" invoice ... addressdata address name Document streetaddress postoffice John Doe Element Pyynpolku 1 70460 KUOPIO Text NamedNodeMap DOM structure model 3.2: Document Object Model
Structure of DOM Level 1 I: DOM Core Interfaces • Fundamental interfaces • basic interfaces: Document, Element, Attr, Text, ... • "Extended" (XML specific) interfaces • CDATASection, DocumentType, Notation, Entity, EntityReference, ProcessingInstruction II: DOM HTML Interfaces • more convenient access to HTML documents • we'll ignore these 3.2: Document Object Model
DOM Level 2 • Level 1: basic representation and manipulation of document structure and content (No access to the contents of a DTD) • DOM Level 2 adds • support for namespaces • Document.getElementById("id_val"), to access elements by ID attr values • optional features (we’ll skip these) • interfaces to document views and style sheets • an event model (for user actions on elements) • methods for traversing the document tree and manipulating regions of document (e.g., selected in an editor) 3.2: Document Object Model
DOM Language Bindings • Language-independence: • DOM interfaces are defined using OMG Interface Definition Language (IDL, defined in Corba Specification) • Language bindings (implementations of interfaces) defined in the Recommendation for • Java (See the Java API doc) and • ECMAScript (standardised JavaScript) 3.2: Document Object Model
Core Interfaces: Node & its variants Node Document DocumentFragment Element Attr CharacterData “Extended interfaces” Comment Text CDATASection DocumentType Notation Entity EntityReference ProcessingInstruction 3.2: Document Object Model
Node getNodeType, getNodeName, getNodeValue getOwnerDocument getParentNode hasChildNodes, getChildNodes getFirstChild, getLastChild getPreviousSibling, getNextSibling hasAttributes, getAttributes appendChild(newChild) insertBefore(newChild,refChild) replaceChild(newChild,oldChild) removeChild(oldChild) Document Element Text NamedNodeMap DOM interfaces: Node form="00" type="estimatedbill" invoice ... addressdata name address John Doe streetaddress postoffice Pyynpolku 1 70460 KUOPIO 3.2: Document Object Model
Type and Name of aNode • node.getNodeType():short intconstants 1, 2, …, 12 forNode.ELEMENT_NODE,Node.ATTRIBUTE_NODE,Node.TEXT_NODE, … • node.getNodeName() • for an Element = element.getTagName() • for an Attr: the name of the attribute • for anonymous nodes: "#text", "#document", "#comment" etc 3.2: Document Object Model
The Value of aNode • node.getNodeValue() • content of a text node, value of attribute, …; null for an Element(Notice !) • (C.f. XPath, where node’s value is its full textual content) • DOM 3 provides full text content with methodnode.getTextContent() 3.2: Document Object Model
Object Creation in DOM • Each DOM Node n belongs to aDocument: n.getOwnerDocument() • Objects that implement interface X are created by factory methodsDocument.createX(…)E.g: when doc is aDocumentobject doc.createElement("A"), doc.createAttribute("href"), doc.createTextNode("Hello!") • Loading & saving specified in DOM3 (or implementation-specific , or via JAXP) 3.2: Document Object Model
Document Element Text NamedNodeMap Node DOM interfaces: Document Document getDocumentElement getElementById(IdVal) getElementsByTagName(tagName) createElement(tagName) createTextNode(data) form="00" type="estimated" invoice ... addressdata address name streetaddress postoffice John Doe Pyynpolku 1 70460 KUOPIO 3.2: Document Object Model
Document Element Text NamedNodeMap Node DOM interfaces: Element Element getTagName() hasAttribute(name) getAttribute(name) setAttribute(attrName, value) removeAttribute(name) getElementsByTagName(name) invoice form="00" type="estimatedbill" invoicepage addressee addressdata name address John Doe streetaddress postoffice 3.2: Document Object Model Pyynpolku 1 70460 KUOPIO
Text Content Manipulation in DOM • for objects c that implement the CharacterDatainterface (Text, Comments, CDATASections): • c.substringData(offset, count) • c.appendData(string) • c.insertData(offset, string) • c.deleteData(offset, count) • c.replaceData(offset, count, string)( = c.deleteData(offset, count);c.insertData(offset, string) ) 3.2: Document Object Model
DOMCharacterData • DOM strings are 0-based sequences of 16-bit characters: C: Hello world, nice to see you! 0 1 2 01234567890123456789012345678 C.getLength()-1 C.substringData(6, 5) = ? C.substringData(0, C.getLength()) = ? 3.2: Document Object Model
Interfaces to node collections (1) • NodeListfor ordered lists of nodes <- Node.getChildNodes()and Element/Document.getElementsByTagName("name") • (proper) descendant elements of type "name" in document order ("*" ~ any element type) 1 E .getElementsByTagName(“E")= 2 3 4 E A E 5 A 6 E 3.2: Document Object Model
Typical child-node access pattern • Accessing specific nodes, or iterating over a NodeList: • to process all children of node:for (i=0;i<node.getChildNodes().getLength(); i++) process(node.getChildNodes().item(i)); 3.2: Document Object Model
Interfaces to node collections (2) • NamedNodeMap for unordered sets of nodes accessed by their name: <- Node.getAttributes(), DocumentType.getEntities() • DocumentFragment • Termporary container of child nodes • Disappears when inserted in tree • NodeLists and NamedNodeMaps are "live": • reflect updates of the doc tree immediately • See next 3.2: Document Object Model
NodeLists are “live” • E.g., this would delete every other child of n:NodeListcList = n.getChildNodes();for (i=0; i<cList.getLength(); i++) n.removeChild(cList.item(i)); • What happens? n cList A B C D i=0 i=1 i=2 3.2: Document Object Model
DOM: XML Implementations • Java-based parsers e.g. Apache Xerces, Apache Crimson, … • In MS IE browser: COM programming interfaces for C/C++ and Visual Basic; ActiveX object programming interfaces for script languages • Perl: XML::DOM (Implements DOM Level 1) • Others, say, database APIs? • Vendors of different kinds of systems participated in the W3C DOM WG 3.2: Document Object Model
Document loaded succesfully > list the contents A Java-DOM Example • Command-line tool RegListMgrfor maintaining a course registration list • with single-letter commands for listing, adding, updating and deleting student records • Example: $ java RegListMgr reglist.xml l … 40: Tero Ulvinen, TKM1, tero@fake.addr.fi, 241: heli viinikainen, tkt5, heli@fake.addr.fi, 1 3.2: Document Object Model
Registration list: the XML file <?xml version="1.0" ?> <!DOCTYPE reglist SYSTEM "reglist.dtd"> <reglist lastID="41"> <student id="RDK1"> <name><given>Juho</given> <family>Ahopelto</family></name> <branchAndYear>TKT4</branchAndYear> <email>juho@fake.addr.fi</email> <group>2</group> </student> <!-- … and the otherstudents … --> </reglist> 3.2: Document Object Model
Registration List: the DTD <!ELEMENT reglist (student*)> <!ATTLIST reglist lastID CDATA #REQUIRED > <!ELEMENT student (name, branchAndYear, email, group)> <!ATTLIST student id ID #REQUIRED > <!ELEMENT name (given, family)> <!ELEMENT given (#PCDATA)> <!-- … and the same for family, branchAndYear, email,and group --> 3.2: Document Object Model
Loading and Saving the RegList • Loading of the registration list into DOMDocumentdoc implemented with a JAXP DocumentBuilder • (to be discussed later) • doc is a handle to the Document • Saving implemented with a JAXP Transformer • to be discussed later 3.2: Document Object Model
Listing student records (1) NodeList students = doc.getElementsByTagName("student"); for (int i=0; i<students.getLength(); i++) showStudent((Element) students.item(i)); private void showStudent(Element student) { // Collect relevant sub-elements: Node given = student.getElementsByTagName("given").item(0); Node family = given.getNextSibling(); Node bAndY = student. getElementsByTagName("branchAndYear").item(0); Node email = bAndY.getNextSibling(); Node group = email.getNextSibling(); 3.2: Document Object Model
Listing student records (2) // Method showStudent continues: System.out.print( student.getAttribute("id").substring(3)); System.out.print(": " + given.getFirstChild().getNodeValue() ); // or given.getTextContent() with DOM3 // .. similarly access and display the // value of family, bAndY, email, and group// … } // showStudent 3.2: Document Object Model
Lessons of accessing DOM • Access methods for relevant nodes • getElementsByTagname(“tagName”) • robust wrt structure modifications • Also others, if structure known (validated) • getFirstChild(), getLastChild(), getPreviousSibling(), getNextSibling() • Element nodes have no value! • Get the value from child Text nodes, or use getTextContent() 3.2: Document Object Model
addstudents Antti Last name: Ahkera Branch&year: tkt3 email: antti@fake.addr.fi group: 2 First name (or <return> to finish): Finished adding records > Adding New Records • Example: > a First name (or <return> to finish): l … 41: heli viinikainen, tkt5, heli@fake.addr.fi, 1 42: Antti Ahkera, tkt3, antti@fake.addr.fi, 2 3.2: Document Object Model
Implementing addition of records (1) Element rootElem = doc.getDocumentElement(); String lastID = rootElem.getAttribute("lastID"); int lastIDnum = java.lang.Integer.parseInt(lastID); System.out.print( "First name (or <return> to finish): "); String firstName = terminalReader.readLine().trim(); while (firstName.length() > 0) { // Get the next unused ID: ID = "RDK" + new Integer(++lastIDnum).toString(); // … Read values lastName, bAndY, email, // and group from the terminal, and then ... 3.2: Document Object Model
Implementing addition of records (2) Element newStudent = newStudent(doc, ID, firstName, lastName, bAndY, email, group); rootElem.appendChild(newStudent); System.out.print( "First name (or <return> to finish): "); firstName = terminalReader.readLine().trim(); } // while firstName.length() > 0 // Update the last ID used: String newLastID = java.lang.Integer.toString(lastIDnum); rootElem.setAttribute("lastID", newLastID); System.out.println("Finished adding records"); 3.2: Document Object Model
Creating new student records (1) private Element newStudent(Document doc, String ID, String fName, String lName, String bAndY, String email, String grp) { Element stu = doc.createElement("student"); stu.setAttribute("id", ID); Element newName = doc.createElement("name"); Element newGiven = doc.createElement("given"); newGiven.appendChild(doc.createTextNode(fName)); Element newFamily = doc.createElement("family"); newFamily.appendChild(doc.createTextNode(lName)); newName.appendChild(newGiven); newName.appendChild(newFamily); stu.appendChild(newName); 3.2: Document Object Model
Creating new student records (2) // method newStudent(…) continues:Element newBr = doc.createElement("branchAndYear"); newBr.appendChild(doc.createTextNode(bAndY)); stu.appendChild(newBr); Element newEmail = doc.createElement("email"); newEmail.appendChild(doc.createTextNode(email)); stu.appendChild(newEmail); Element newGrp = doc.createElement("group"); newGrp.appendChild(doc.createTextNode(group)); stu.appendChild(newGrp); return stu; } // newStudent 3.2: Document Object Model
Lessons of modifying DOM • Each node must be created with • Document.create...(“nameOrValue”) • Attributes of an element more easily with setAttribute(“name”, “value”) • ... and connected to the structure • Normally with parent.appendChild(newChild) • Updates and deletions in the RegListMgr similarly, by manipulating the DOM structures • -> exercises 3.2: Document Object Model
Efficiency of SAX vs DOM? • DOM has reputation of requiring more resources than streaming interfaces like SAX • Small experiment of this hypothesis: • Test task: Retrieve the title of the last section that mentions "XML Schema definition language" • Target docs: repeats of fragments from W3C XML Schema Recommendation (Part 1) • Environment: JDK 1.6, Red Hat Linux 6, 3 GHz Pentium with 1 GB RAM 3.2: Document Object Model
The speed of DOM vs SAX • On small documents, up to ~ 2 MB, the SAX & DOM based solutions are roughly equal: ~ 3.0 MB/s ~ 3.9 MB/s 3.2: Document Object Model
Resource needs of DOM vs SAX • On larger documents, up to ~ 60 MB, the DOM application becomes faster than SAX(!) • throughput ~ 8 MB/s • SAX ~ 4 MB/s • But DOM takes relatively much of RAM • here ~ 6 x the size of the input XML document • The SAX application runs in fixed space of ~ 6 MB 3.2: Document Object Model
Summary of XML APIs so far • Give applications access to the structure and contents of XML documents • Event-based APIs (e.g. SAX) • notify application through parsing events • efficient • Object-model (or tree) based APIs (e.g. DOM) • provide a full parse tree • more convenient, but require much resources with large documents • Major parsers support both SAX and DOM • used through proprietary methods • used through JAXP (-> next) 3.2: Document Object Model