430 likes | 526 Views
XML and Its Specification. Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 31, 2008. Today’s Topics. XML DOM Schemas Manipulating XML via query languages Reminder: Assignment, milestone 1 due tonight before midnight
E N D
XML and Its Specification Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 31, 2008
Today’s Topics • XML • DOM • Schemas • Manipulating XML via query languages • Reminder: Assignment, milestone 1 due tonight before midnight • Create a .tar.gz file containing your java files: • From bash/csh: tar cvf {my-id}.tar.gz * • Send the {my-id}.tar.gz via email to svilen@cis.upenn.edu
Recall: Issues in Sending Data How do we send data within a program? • What is the implicit model? • How does this change when we need to make the data persistent? What happens when we are coupling systems? • How do we send data between programs on the same machine? • Between different machines?
One Aspect Is Marshalling (Serialization/Deserialization) • Converting from an in-memory data structure to something that can be sent elsewhere • Pointers -> something else • Specific byte orderings • Metadata • Note that the same logical data gets a different physical encoding • A specific case of Codd’s idea of logical-physical separation • “Data model” vs. “data”
Communication and Streams • When storing data to disk, we have a combination of sequential and random access • When sending data on “the wire”, data is only sequential • “Stream-based communication” based on packets • What are the implications here? • Pipelining, incremental evaluation, …
Why Data Interchange Is Hard Need to be able to understand: • Data encoding (physical data model) • May have syntactic heterogeneity • Endian-ness, marshalling issues • Impedance mismatches • Data representation (logical data model) • May have semantic heterogeneity • Imprecise and ambiguous values/descriptions
Examples MP3 ID3 format – record at end of file
Examples JPEG “JFIF” header: • Start of Image (SOI) marker -- two bytes (FFD8) • JFIF marker (FFE0) • length -- two bytes • identifier -- five bytes: 4A, 46, 49, 46, 00 (the ASCII code equivalent of a zero terminated "JFIF" string) • version -- two bytes: often 01, 02 • the most significant byte is used for major revisions • the least significant byte for minor revisions • units -- one byte: Units for the X and Y densities • 0 => no units, X and Y specify the pixel aspect ratio • 1 => X and Y are dots per inch • 2 => X and Y are dots per cm • Xdensity -- two bytes • Ydensity -- two bytes • Xthumbnail -- one byte: 0 = no thumbnail • Ythumbnail -- one byte: 0 = no thumbnail • (RGB)n -- 3n bytes: packed (24-bit) RGB values for the thumbnail pixels, n = Xthumbnail * Ythumbnail
Finding File Formats You may need to know details in order to, e.g., crawl certain kinds of data formats • http://www.wikipedia.org/ • http://www.wotsit.org/ • etc.
The Problem You need to look into a manual to find file formats • (At best, e.g., MS .DOC file format) The Web is about making data exchange easier… Maybe we can do better! • “The mother of all file formats”
Desiderata for Data Interchange • Ability to represent many kinds of information Different data structures • Hardware-independent encoding Endian-ness, UTF vs. ASCII vs. EBCDIC • Standard tools and interfaces • Ability to define “shape” of expected data With forwards- and backwards-compatibility! • That’s XML…
Consumers of XML A myriad of tools and interfaces, including: • DOM – document object model • Standard OO representation of an XML tree • SAX – simple API for XML • An event-driven parser interface for XML • startElement, endElement, etc. • Ant – Java-based “make” tool with XML “makefile” • XPath, XQuery, XSL, XSLT • Web service standards
XML as a Data Model XML “information set” includes 7 types of nodes: • Document (root) • Element • Attribute • Processing instruction • Text (content) • Namespace: • Comment XML data model includes this, plus typing info, plus order info and a few other things
Example XML Document Processing Instr. <?xml version="1.0" encoding="ISO-8859-1" ?> <dblp> <mastersthesismdate="2002-01-03" key="ms/Brown92"> <author>Kurt P. Brown</author> <title>PRPL: A Database Workload Specification Language</title> <year>1992</year> <school>Univ. of Wisconsin-Madison</school> </mastersthesis> <articlemdate="2002-01-03" key="tr/dec/SRC1997-018"> <editor>Paul R. McJones</editor> <title>The 1995 SQL Reunion</title> <journal>Digital System Research Center Report</journal> <volume>SRC1997-018</volume> <year>1997</year> <ee>db/labs/dec/SRC1997-018.html</ee> <ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee> </article> Open-tag Element Attribute Close-tag
XML Data Model Visualized(~ Document Object Model) attribute root p-i element Root text dblp ?xml mastersthesis article mdate mdate key key author title year school 2002… editor title journal volume year ee ee 2002… 1992 1997 The… ms/Brown92 tr/dec/… PRPL… Digital… db/labs/dec Univ…. Paul R. Kurt P…. SRC… http://www.
A Few Common Uses of XML Extensible HTML • Allows custom tags (e.g., used by MS Word, openoffice) • Supplement it with stylesheets (XSL) to define formatting Exchange format for data (still need to agree on terminology) • Tables, objects, etc. Format for marshalling and unmarshalling data in Web Services (remote function calls + returns)
XML as a Super-HTML(MS Word) <h1class="Section1"><aname="_top“ />CIS 550: Database and Information Systems</h1> <h2class="Section1">Fall 2003</h2> <pclass="MsoNormal"> <place>311 Towne</place>, Tuesday/Thursday <timeHour="13" Minute="30">1:30PM – 3:00PM</time> </p>
XML Easily Encodes Relations Student-course-grade <student-course-grade> <tuple> <sid>1</sid><course>330-f03</course><grade>B</grade></tuple> <tuple> <sid>23</sid><course>455-s04</course><grade>A</grade></tuple> </student-course-grade>
It Also Encodes Objects <projects> <project class=“cse455” > <type>Programming</type><memberList> <teamMember>Joan</teamMember> <teamMember>Jill</teamMember> </memberList><codeURL>www….</codeURL><incorporatesProjectFromclass=“cse330” /> </project> …
XML and Code Web Services (.NET, recent Java web service toolkits) are using XML to pass parameters and make function calls • Why? • Easy to be forwards-compatible • Easy to read over and validate (?) • Generally firewall-compatible • Drawbacks? XML is a verbose and inefficient encoding! • We’ll see Web Services in more detail soon…
Integrating XML: What If We Have Multiple Sources with the Same Tags? Namespaces allow us to specify a context for different tags Two parts: • Binding of namespace to URI • Qualified names <tagxmlns:myns=“http://www.fictitious.com/mypath”> <thistag>is in namespace myns</thistag> <myns:thistag>is the same</myns:thistag><otherns:thistag>is a different tag</otherns:thistag> </tag>
XML Isn’t Enough on Its Own It’s too unconstrained for many cases! • How will we know when we’re getting garbage? • How will we query? • How will we understand what we got?
Document Type Definitions (DTDs) DTD is an EBNF grammar defining XML structure • XML document specifies an associated DTD, plus the root element • DTD specifies children of the root (and so on) DTD defines special significance for attributes: • IDs – special attributes that are analogous to keys for elements • IDREFs – references to IDs • IDREFS – space-delimited list of IDREFs
An Example DTD Example DTD: <!ELEMENT dblp((mastersthesis | article)*)> <!ELEMENT mastersthesis(author,title,year,school,committeemember*)> <!ATTLIST mastersthesis(mdate CDATA #REQUIRED key ID #REQUIRED advisor CDATA #IMPLIED> <!ELEMENT author(#PCDATA)> … Example use of DTD in XML file: <?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE dblp SYSTEM “my.dtd"> <dblp>…
Representing Graphs in XML <?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE graph SYSTEM “special.dtd"> <graph> <author id=“author1”> <name>John Smith</name> </author> <article> <author ref=“author1” /> <title>Paper1</title> </article> <article> <author ref=“author1” /> <title>Paper2</title> </article> …
Graph Data Model Root graph ?xml !DOCTYPE article article author id title title author author name Paper1 author1 ref Paper2 ref John Smith author1 author1
Graph Data Model Root graph ?xml !DOCTYPE article article author id title title author author name Paper1 author1 ref Paper2 ref John Smith
DTDs Are Very Limited DTDs capture grammatical structure, but have some drawbacks: • Not themselves in XML – inconvenient to build tools for them • Don’t capture types of scalars • Global ID/reference space is inconvenient • No way of defining OO-like inheritance
XML Schema: DTDs Rethought Features: • XML syntax • Better way of defining keys using XPaths • Type subclassing • … And, of course, built-in datatypes
Basic Constructs of Schema Separation of elements (and attributes) from types: • complexType is a structured type • It can have sequences or choices • element and attribute have name and type • Elements may also have minOccurs and maxOccurs Subtyping, most commonly using: <complexContent> <extension base=“prevType”> … </…>
Simple Schema Example <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name=“mastersthesis" type=“ThesisType"/> <xsd:complexType name=“ThesisType"> <xsd:attribute name=“mdate" type="xsd:date"/> <xsd:attribute name=“key" type="xsd:string"/> <xsd:attribute name=“advisor" type="xsd:string"/> <xsd:sequence> <xsd:element name=“author" type=“xsd:string"/> <xsd:element name=“title" type=“xsd:string"/> <xsd:element name=“year" type=“xsd:integer"/> <xsd:element name=“school" type=“xsd:string”/> <xsd:element name=“committeemember" type=“CommitteeType” minOccurs=“0"/> </xsd:sequence> </xsd:complexType>
Embedding XML Schema <root xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="s1.xsd" > <grade>a</grade> </root> <s1:root xmlns:s1="http://www.schemaValid.com/s1ns" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.schemaValid.com/s1ns s1ns.xsd" > <s1:grade>a</s1:grade> </s1:root> But the XML parser is actually free to ignore this – the schema is typically specified “from outside” the document
Designing an XML Schema/DTD Often we are given a DTD/Schema; if not, we need to design one We orient the XML tree around the “central” objects in a particular application
Manipulating XML Sometimes: • Need to restructure an XML document • Or simply need to retrieve certain parts that satisfy a constraint, e.g.: • All books • All books by author XYZ
Document Object Model (DOM)vs. Queries • Build a DOM tree (as we saw earlier) and access via Java (etc.) DOMNode object • DOM objects have methods like “getFirstChild()”, “getNextSibling” • Common way of traversing the tree • Can also modify the DOM tree – alter the XML – via insertAfter(), etc. • Alternate approach: a query language • Define some sort of a template describing traversals from the root of the directed graph • In XML, the basis of this template is called an XPath • Can also declare some constraints on the values you want • The XPath returns a node set of matches
XPaths In its simplest form, an XPath is like a path in a file system: /mypath/subpath/*/morepath • The XPath returns a node set representing the XML nodes (and their subtrees) at the end of the path • XPaths can have node tests at the end, returning only particular node types, e.g., text(), processing-instruction(), comment(), element(), attribute() • XPath is fundamentally an ordered language: it can query in order-aware fashion, and it returns nodes in order
Sample XML <?xml version="1.0" encoding="ISO-8859-1" ?> <dblp> <mastersthesis mdate="2002-01-03" key="ms/Brown92"> <author>Kurt P. Brown</author> <title>PRPL: A Database Workload Specification Language</title> <year>1992</year> <school>Univ. of Wisconsin-Madison</school> </mastersthesis> <article mdate="2002-01-03" key="tr/dec/SRC1997-018"> <editor>Paul R. McJones</editor> <title>The 1995 SQL Reunion</title> <journal>Digital System Research Center Report</journal> <volume>SRC1997-018</volume> <year>1997</year> <ee>db/labs/dec/SRC1997-018.html</ee> <ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee> </article>
XML Data Model Visualized attribute root p-i element Root text dblp ?xml mastersthesis article mdate mdate key key author title year school 2002… editor title journal volume year ee ee 2002… 1992 1997 The… ms/Brown92 tr/dec/… PRPL… Digital… db/labs/dec Univ…. Paul R. Kurt P…. SRC… http://www.
Some Example XPath Queries • /dblp/mastersthesis/title • /dblp/*/editor • //title • //title/text()
Context Nodes and Relative Paths XPath has a notion of a context node: it’s analogous to a current directory • “.” represents this context node • “..” represents the parent node • We can express relative paths: subpath/sub-subpath/../.. gets us back to the context node • By default, the document root is the context node
Predicates – Filtering Operations A predicate allows us to filter the node set based on selection-like conditions over sub-XPaths: /dblp/article[title = “Paper1”] which is equivalent to: /dblp/article[./title/text() = “Paper1”] because of type coercion. What does this do: /dblp/article[@key = “123” and ./title/text() = “Paper1” and ./author/*/element()]
Axes: More Complex Traversals Thus far, we’ve seen XPath expressions that go down the tree (and up one step) • But we might want to go up, left, right, etc. • These are expressed with so-called axes: • self::path-step • child::path-step parent::path-step • descendant::path-step ancestor::path-step • descendant-or-self::path-step ancestor-or-self::path-step • preceding-sibling::path-step following-sibling::path-step • preceding::path-step following::path-step • The previous XPaths we saw were in “abbreviated form”
Users of XPath • XML Schema uses simple XPaths in defining keys and uniqueness constraints • XLink and XPointer, hyperlinks for XML • XSLT – useful for converting from XML to other representations (e.g., HTML, PDF, SVG) • XQuery – useful for restructuring an XML document or combining multiple documents • Might well turn into the “glue” between Web Services, etc.