XML & related technologies

XML & related technologies

Outline • Markup Language • XML • DTD • API for XML • DOM • SAX • Related Technologies • Name Space • Xpath • Xlink • XSL • Query Languages for XML • Quilt

A critique of HTML • Extraordinarily flexible, but low on structure • Fixed tag set (vocabulary) • No automatic validation • Unreliable use of the syntax • Nobody uses the data model

How XML solves this • Define your own tags (vocabulary) • Validate against the definition • Error handling and strict definition of the syntax • Smaller and simpler than SGML • Standardized APIs for working with it • A data model specification is coming

XML background • A subset of SGML • Simplifies SGML by: • leaving out many syntactical options and variants • leaving out some DTD features • leaving out some troublesome features • Recommendation approved by the W3C

Elements • A simple and complete element: <address> <street> 33, Terry Dr.</street> <city> Morristown </city> </address> markup Start tag Content End tag

Elements • Attach a meaning to a piece of a document • Have an element type (`example’, `name’) represented by a markup (tag). • Can be nested at any depth

Elements • Can contain: • other elements (sub-elements) <address> <street> 33, Terry Dr.</street> <city> Morristown </city> </address> • text (data content) <street> 33, Terry Dr.</street> • a combination of them (mixed content) <par>Today, <date>05-06-2000</date> Mr. <name>Bill Gates<name> is in California to talk to ... </par>

Document Element • It is the outermost element containing all the elements in a document example: <employee> … </employee> • It must always exist

Empty Elements • elements without content • They do not have end tags • Particular representation of start tags example: <medical-dossier …/>

Attributes • Used to annotate the element with extra information • Always attached to start tags: <el-name attr-name1=“v1” .. attr-name1=“v1” > …… <el-name/> • Elements can have any number of attributes, but all distinct

<Orders> <SalesOrder SONumber="12345"> <Customer CustNumber="543"> <CustName>ABC Industries</CustName> <Street>123 Main St.</Street> <City>Chicago</City> <State>IL</State> <PostCode>60609</PostCode> </Customer> <OrderDate>981215</OrderDate> <Line LineNumber="1"> <Part PartNumber="123"> <Description> Turkey wrench: Stainless steel, one piece construction, lifetime guarantee. </Description> <Price>9.95</Price> </Part> <Quantity>10</Quantity> </Line> </SalesOrder> </Orders>

An XML document <?XML version=“1.0”> <books> <book> <entry isbn=“1-55860-622-X”> <title>Data on the Web:...</title> <publisher>Morgan Kaufmann</publisher> </entry> <author> Serge Abiteboul</author> <author> Peter Buneman</author> <author> Dan Suciu</author> <bookRef to=“0-201-53771-O 1-55860-463-4”/> <articleLink href=“http://…/articles.xml#id(Abi97)”> </book> <book> <entry isbn=“0-201-53771-O”> <title>Foundation of Databases</title> <publisher>Addison Wesley</publisher> </entry> <author> Serge Abiteboul</author> ... </book> ... </books>

An element, when: I need fast searching process it is visible to everyone it is relevant for the meaning of the document An attribute, when: it is a choice it is visible only to the system it is not relevant for the meaning of the document Elements Vs Attributes Do I use an element or an attribute to store semantic info?

Other Stuff • Processing instructions, used mainly for extension purposes (<?target data?>) • Comments () • Character references (£) • Entities: • named files or pieces of markup • can be referred to recursively, inserted at point of reference

Document Types • Basic idea: we need a type associated with a document, just like objects and values • A document type is a class of documents with similar structure and semantics Examples: slide presentations, journal articles, meeting agendas, method calls, etc.

DTDs • DTDs provide a standardized means for declaratively describing the structure of a document type • This means: • which (sub-)elements an element can contain • whether it can contain text or not • which attributes it can have • some typing and defaulting of attributes

DTD • A DTD can be • Internal: the DTD is in the document • External: the DTD is in an external file and included in the document • mixed: part in the document and part outside • A DTD is logically composed of 2 parts: • Element Type Definition • Attribute List Declaration

Element Type Definition • The element type definition specifies: • structure of the document • allowed contents (content model) • allowed attributes (by the meaning of attribute list declarations)

Element Type Definition • The following are examples of possible • declarations: • <!ELEMENT A (B*, C, D?)> • <!ELEMENT A (B | C+)> • <!ELEMENT A (#PCDATA)> • <!ELEMENT A EMPTY> • <!ELEMENT A ANY> • <!ELEMENT A (#PCDATA| B | C)*>

Attribute-List Declarations • It is the list of allowed attributes for each element. For each attribute: name, type,and other information. • Attribute types. Three groups: • string types (CDATA) • tokenized types (ID,IDREF,IDREFS,...) • enumerated types (as the ones in Pascal)

Attribute-List Declarations • <!ELEMENT A (#PCDATA)> • <!ATTLIST A a CDATA #IMPLIED> • <!ATTLIST A a CDATA #IMPLIED b CDATA #REQUIRED> • <!ATTLIST A a CDATA #IMPLIED “aaa”> • <!ATTLIST A a CDATA #REQUIRED “aaa”> • <!ATTLIST A a CDATA #FIXED “aaa”> • <!ATTLIST A a (aaa|bbb) #IMPLIED “aaa”> • <!ATTLIST A id ID #REQUIRED> • <!ATTLIST A ref IDREF #IMPLIED>

A DTD <!DOCTYPE Orders[ <!ELEMENT Orders(SalesOrder)+> <!ELEMENT SalesOrder(Customer,OrderDate,Line*)> <!ELEMENT Customer(CustName,Street,City,State,PostCode,tel*)> <!ELEMENT CustName (#PCDATA)> <!ELEMENT Street (#PCDATA)> <!ELEMENT State (#PCDATA)> <!ELEMENT PostCode (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT OrderDate (#PCDATA)> <!ELEMENT Line (Part,Quantity)> <!ELEMENT Part(Description,Price)> <!ELEMENT Quantity (#PCDATA)> <!ELEMENT Description (#PCDATA)> <!ELEMENT Price (#PCDATA)> <!ATTLIST SalesOrder SONumber CDATA #REQUIRED> <!ATTLIST Customer CustNumer CDATA #REQUIRED> <!ATTLIST Line LineNumber CDATA #IMPLIED> <!ATTLIST Part PartNumber CDATA #REQUIRED> ]

A DTD <!DOCTYPE Books[ <!ELEMENT Books(book)+> <!ELEMENT book(entry, author+, bookRef, articleLink*)> <!ELEMENT entry(title, publisher)> <!ELEMENT bookRef EMPTY> <!ELEMENT articleLink EMPTY> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT pubblisher (#PCDATA)> <!ATTLIST entry isdn ID #REQUIRED> <!ATTLIST bookRef to IDREFS #IMPLIED> <!ATTLIST articleLink xmlns:xlink CDATA #FIXED “http://w3c.org/xlink” xlink:type CDATA #FIXED “simple” xlink:href CDATA #REQUIRED> ]>

Well-formed and Valid Docs • A document is well-formed if it follows the grammar rules provided by W3C. • A document is valid if it conforms to a DTD which specifies the allowed structure of the document

Uses of XML Entities • Physical partition • size, reuse, "modularity", … (both XML docs & DTDs) • Non-XML data • unparsed entities  binary data • Non-standard characters • character entities • Shorthand for phrases & markup, => effectively are macros

Types of Entities • Internal (to a doc) vs. External ( use URI) • General (in XML doc) vs. Parameter (in DTD) • Parsed (XML) vs. Unparsed (non-XML)

Entities & Physical Structure Mylife.xml A logical element can be split into multiple physical entities DTD... <mylife> Chap1.xml <teen>yada yada </teen> Chap2.xml <adult>blah blah.. </adult> </mylife>

External Text Entities DTD External Text Entity Declaration <!ENTITYchap1 SYSTEM "http://...chap1.xml"> URL XML Entity Reference <mylife> &chap1; &chap2;</mylife> Logically equivalent to inlining file contents <mylife> <teen>yada yada</teen> <adult> blah blah</adult> </mylife>

Internal Text Entities DTD Internal Text Entity Declaration <!ENTITY WWW "World Wide Web"> XML Entity Reference <p>We all use the &WWW;.</p> Logically equivalent to actually appearing <p>We all use the World Wide Web.</p>

Unparsed (& "Binary") Entities DTD ... and unparsed entity Declare external... <!ENTITY fusion SYSTEM "http://... fusion.ps" NDATA ps> XML Declare attribute type to be entity <!attlist fullPaper source ENTITY #REQUIRED> Element with ENTITY attribute <fullPaper source="fusion"/> NOTATION declaration (helper app.) <!NOTATION ps SYSTEM "ghostview.exe">

Processing XML • Non-validating parser: • checks that XML doc is syntactically well-formed • Validating parser: • checks that XML doc is also valid w.r.t. a given DTD • Parsing yields tree/object representation: • Document Object Model (DOM) API • Or a stream of events (open/close tag, data): • Simple API for XML(SAX)

API for handling XML Documents DOM, SAX

DOM Structure Model and API • hierarchy of Node objects: • document, element, attribute, text, comment, ... • language independent programming DOM API: • get... first/last child, prev/next sibling, childNodes • insertBefore, replace • getElementsByTagName • ... • alternative event-basedSAX API (Simple API for XML) • does not build a parse tree (reports events when encountering begin/end tags) • for (partially) parsing very large documents

DOM Summary • Object-Oriented approach to traverse the XML node tree • Automatic processing of XML docs • Operations for manipulating XML tree • Manipulation & Updating of XML on client & server • Database interoperability mechanism

SAX Event-Based API • Pros: • The whole file doesn’t need to be loaded into memory • XML stream processing • Simple and fast • Allows you to ignore less interesting data • Cons: • limited expressive power (query/update) when working on streams => application needs to build (some) parse-tree when necessary

Related XML technologies Namespace, Xlink, Xpath, XSL

Namespace

Namespace • Through namespaces it is possible to declare a set of names which meaning is not ambiguous, i.e. everyone agree on their meaning. • In other worlds, namespaces allow to distinguish two elements with the same name, but different meaning • Element/attribute name is a combination of 2 parts: prefix:name

Example: Namespace <person> <name>Rosalie Panelli</name> <address>33 Terry Dr.</address> </person> <webSite> <name>XML Italia</name> <address>http://www.xml.it</address> </webSite> <personxmlns:person=“http://namespaces.xml.it/person”> <person:name>Rosalie Panelli</person:name> <person:address>33 Terry Dr.</person:address> </person> <webSitexmlns:webSite=“http://namespaces.xml.it/webSite”> <webSite:name>XML Italia</webSite:name> <webSite:address>http://www.xml.it</webSite:address> </webSite>

Namespace declaration • Up to now, in order to parse a document containing a namespace against a DTD, it is necessary to include the prefixes in the element declarations: <!ELEMENT person (person:name, person:address)> <!ATTLIST person xmlns:person CDATA #FIXED “http://namespaces.xml.it/person”> <!ELEMENT person:name (#PCDATA)> <!ELEMENT person:address (#PCDATA)> • Note: the address http://namespaces.xml.it/person might be dangling

Xlink

Xlink • Only internal links can be represented by the means of ID/IDREF(S) attributes • Xlink is a language that allows the definition of links among documents (external links) through Xlink elements • Unidirectional links can be defined as in HTML, but also other kinds • Based on a namespace specifically tailored by W3C

It consists of the following attributes: Type Href Role Title Show Actuate From to By the means of these attributes it is possible to describe the different kind of links The Xlink Namespace

Optional and Required Attributes Attribute Link SIMPLE EXTENDED RESOURCE LOCATOR ARC TITLE TYPE R R R R R R HREF O R ROLE O O O O O TITLE O O O O O SHOW O O ACTUATE O O FROM O TO O

Description of attributes • Typeit is the type of Xlink element • HrefURI address of the used resource • Rolelink description used from the application • Titlelink description used from the human user • ShowIt specifies the behavior of an application when cross the link • ActuateIt specifies when the behavior selected by the show attribute must be executed • From/Tothey specify the role attributes of the sources and of the targets in an extended link

A simple link <dsi xmlns:xlink=“http://www.w3c.org/” xlink:type=“simple” xlink:href=“http://www.dsi.unimi.it” xlink:show=“new” xlink:actuate=“onRequest” xlink:role=“DSI” xlink:title=“Dipartimento di Scienze dell’Info..”> Dipartimento di Scienze dell’Informazione </dsi>

Actuate e Show • By the means of these attributes it is possible to specify when a particular link should be crossed and the behavior that an application should show when the link is effectively crossed

The values of show • Newwhen the link is crossed, the resource is loaded in a new page • Replacewhen the link is crossed, the target resource is substituted in the current page • Embedthe resource is included in the current document • Undefinedthe application is free to apply the behavior it likes

The values of actuate • onLoadthe link is crossed when the document is loaded into the application • OnRequestthe link is crossed when the user explicitly request to cross the link • Undefinedthe application crosses the link as it likes

XML & related technologies