CSE 121/131 Programming Spring 2001 Lecture Notes 7 ã 2000-2001 A. Sahuguet & V.Tannen

CSE 121/131ProgrammingSpring 2001 Lecture Notes 7ã 2000-2001 A. Sahuguet & V.Tannen

Data on the Web, today: HTML . . . <a name="primary"> <H2> Primary Faculty </H2> <DL> <DT> <BR> <A href="http://www.cis.upenn.edu/~alur/info.html"> <IMG SRC="images/resdesc.gif" ALIGN=right ALT="resdesc"></A> <A href="http://www.cis.upenn.edu/~alur/home.html"> <IMG SRC="images/home.gif" ALIGN=right ALT="Home"></A> <B>Rajeev Alur</B><BR> Associate Professor, Computer and Information Science <DD> Formal support for design and analysis of reactive, real-time, and hybrid systems. Hardware verification; Software engineering; Control of distributed multi-agent systems; Logic and concurrency theory; Distributed computing. . . .

Data on the Web, tomorrow: XML . . . <primary> <name> <first>Rajeev</first> <last>Alur</last> </name> <title>Associate Professor</title> <department>Computer and Information Science</department> <bio>http://www.cis.upenn.edu/~alur/info.html</bio> <homepage>http://www.cis.upenn.edu/~alur/home.html</homepage> <interest>Formal support for design and analysis of reactive, real-time, and hybrid systems. Hardware verification; Software engineering; Control of distributed multi-agent systems; Logic and concurrency theory; Distributed computing.</interest> </primary> . . .

What is XML? • Like HTML, XML is a “document markup language” i.e., a way to enrich text with tags and attributes. • HTML’s markup is about visual presentation. However, it is difficult for a program to manipulate the data in HTML. • XML’s markup is about the meaning of the information. This makes it easier for programs to manipulate XML. • Still, what we saw on the previous slide is an external format. Internally, XML is represented as trees.

How XML overcomes some HTML limitations • Using XML, content providers can separate form and content. XML Content XSL (Stylesheets) HTML(Web-TV) Wireless Markup Language HTML http://www.wapforum.org/docs/technical/wml-30-apr-98.pdf

Wireless Applications • Hand-held devices have some constraints • small display • narrowband network connection • limited memory and computational resources • HTML is not suitable to deliver information to them -> Need for a Wireless Markup Language (WML) • What WML offers • specific layout • new metaphor (deck, cards) • state management • binary XML format to make data more concise The same metaphor can be used for e-forms in various domains: interactive kiosks, medical forms, etc.

Manipulating XML documents • Manipulation • parsing: reading, checking syntax, transforming in internal format • navigating • modifying • Fortunately, XML comes with a standard API that offers all these features Document Object Model (DOM) API: Application Programming Interface

DOM • “DOM provides a programmatic access to the content, structure and style of XML documents and allows languages such as Java to extract information from documents containing specific tags as if they were objects.” [Ardent’s white paper on XML] • Platform neutral API designed by W3C using CORBA/IDL • Mapping to various programming languages (Java, C++, Perl, etc.) • DOM supported by all the major players • DOM makes XML documents parser and representation independent

DOM overview • What DOM is doing <TABLE> <TBODY><TR><TD>Shady Grove</TD><TD>Aeolian</TD></TR><TR><TD>Over the River, Charlie</TD><TD>Dorian</TD></TR></TBODY></TABLE>

The DOM API (overview) Node NodeList Attr CharacterData Document Element Entity Comment Text CDATASection interface DocumentcreateAttribute(…)createCDATASection(…)createComment(…) createElement(…) createTextNode(…) interface NodeappendChild(…) getAttributes(…) getChildNodes(…) interface Element getAttribute(name) getAttributeNode(name) getElementsByTagName(name) The full API can be found at http://www.w3c.org/DOM

DOM in action • We take an HTML page from the IBM Patent server and we XML-ize it. • From it, we want to extract some specific information, such as the name of the inventors. • 4 ways to do it • Java DOM • Java XQL • Perl • XML-QL (will return an XML document)

The Patent Example Converted using W4F

DOM with Java import com.ibm.xml.parser.*; import org.w3c.dom.*; import java.io.*; public class Test { public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] )); NodeList nodes = doc.getElementsByTagName("Inventor"); int n = nodes.getLength(); for(int i=0; i<n; i++) { Element node = (Element) nodes.item(i); String href= node.getAttribute("First_Name"); System.out.println(href); } } }

DOM with Java and XQL (GMD, IBM) import de.gmd.ipsi.xql.*; import org.w3c.dom.*; import com.ibm.xml.parser.*; import java.io.*; public class XQLTest { public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] )); XQLResult r = XQL.execute("//Inventor", doc ); for(int i=0; i<r.getLength(); i++) { Element inventor = (Element) r.getItem(i); String href = inventor.getAttribute("First_Name"); System.out.println(href); } } }

DOM with Perl • Extracting the name of the Inventors from the IBM Patent database. #!/usr/bin/perl use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("patent.xml"); my $nodes = $doc->getElementsByTagName ("Inventor"); my $n = $nodes->getLength; for (my $i = 0; $i < $n; $i++) { my $node = $nodes->item ($i); my $href = $node->getAttribute ("First_Name"); print $href, "\n"; } Include the Perl package Instantiate a new parserand parse the source file. Get the list of nodes that correspond to <Inventor>. For each node, extract the First_Name attribute and print it.

SAX, a low-level alternative to DOM • SAX • simple API for XML • supported by most XML parsers • event-driven parser • Instead of reading the entire file in memory and building a tree, SAX reads a stream of tokens and triggers events • startDocument • startElement • endElement • endDocument • The programmer has to write a document handler that captures these events and do something with the tokens.

public class OutputHandler implements DocumentHandler { private PrintWriter pw; } public OutputHandler() { this.pw = new PrintWriter( System.out ); } public OutputHandler(PrintWriter pw) { this.pw = pw; } public String toString() { pw.flush(); return ""; } public void characters(char[] ch, int start, int length) { pw.print(new String(ch, length)); return ""; } /* to be continued … */ public void endDocument() { pw.println(""); } public void endElement(String name) { pw.println("</" + name + ">"); } public void startDocument() { pw.println("<?xml version=\"1.0\"?>"); return; } public void startElement(String name, AttributeList atts) { pw.print("<" + name); if (atts != null) for(int i = 0; i < atts.getLength(); ++i) pw.print(" " + atts.getName(i) + "=\"" + atts.getValue(i) + "\""); pw.println(">"); return; } } An Example of SAX

SAX vs DOM • SAX • does not store anything in memory (great for stream-based processing) • navigation in the document is clumsy • does not permit to update an XML document • DOM • permits updates • offers the DOM API for navigation/construction • requires the entire document to be stored in main memory

XML (input) Application XML (output) The Missing Link • There is only a “gentlemen’s agreement” between the application and its XML environment. • Why do we need to go beyond that? • performance • static guarantees (helps to identify and control failures) • How do we create a tight contract between the application and its XML environment?

XML Binding • Requirements • high-level specification for XML (e.g. DTD, XML-Schemas, UML, etc.) • a mapping to your favorite programming language (e.g. Java) • a compiler that will generate code (“stubs” that define an API) (Same paradigm as CORBA/IDL or ODMG/ODL) Sun’s Proposal: <http://www.javasoft.com/xml/white-papers.html> XMLspec. stubs compiler

generic API generic parsing getElement(“order”) getAttribute(“date”) generic marshalling only runtime checks domain specific API domain specific parsing get_order() get_date() domain specific marshalling both static and runtime checks Generic (DOM/SAX) vsDomain Specific API • Instead of a generic API (e.g. SAX, DOM), the application will use a domain specific API generated from the specification. • Issues • mapping accurately XML “types” to a programming language • static checks vs runtime checks (some features from the specification cannot be checked statically)

XML programming • Resources • Java and XML, Brett McLaughlin, Mike Loukides • XML parsers (DOM/SAX) • Apache http://xml.apache.org/xerces-j/index.html • Oracle http://technet.us.oracle.com/tech/xml/ • Sun Project X http://java.sun.com/xml/ • Microsoft http://msdn.microsoft.com/xml/default.asp • XML-binding frameworks • Oracle ClassGenerator http://technet.us.oracle.com/tech/xml/classgen/index.htm • Castor http://castor.exolab.org/

CSE 121/131 Programming Spring 2001 Lecture Notes 7 ã 2000-2001 A. Sahuguet & V.Tannen