600 likes | 737 Views
Chapter 4 Web Pages Using Web Standards. Chapter 3 XML – the ‘X’ in Ajax. Introduction. Integration of heterogeneous information systems is the key challenge to information technologies System integration: how to let distributed and heterogeneous systems communicate?
E N D
Chapter 4 Web Pages Using Web Standards Chapter 3 XML – the ‘X’ in Ajax
Introduction • Integration of heterogeneous information systems is the key challenge to information technologies • System integration: how to let distributed and heterogeneous systems communicate? • Data integration: how to let distributed and heterogeneous systems understand each other’s data? • XML technologies address the data integration problem • XML is important for providing different views of the same data
Extensible Markup Language • XML is a simplified descendant of SGML, or Standard Generalized Markup Language • Like XHTML, an XML document marks up data with tags and attributes • For each data type, a different set of tag names, attributes, and syntax rules could be defined in form of an XML dialect, which should be fixed or not extensible by its users, like XHTML • All XML documents based on the same XML dialect are called instance documents of the XML dialect • “Extensible” in XML means new XML dialects can always be introduced for new data types
Tags and Elements • Each XML element consists of a start tag and an end tag with nested elements or text in between (called element value) • The start tag is of form <tagName>, as <html> and <p> • The end tag is of form </tagName>, as </html> and </p> • An XML dialect will define what are its allowed tag names • Any string consisting of a letter followed by an optional sequence of letters or digits and having no variations of “xml” as its prefix is a valid XML tag
Tags and Elements … • Tag names are case-sensitive: <p> is not the same as <P> • If an element has no value, the start tag and end tag can be combined into <tagName/>, as <br/> • Elements cannot be partially overlapped, like <a><b>data</a>data</b>
XML Attributes • The start tag of an element could have attributes in the form of a sequence of attributeName="attributeValue" separated by white spaces • <dvd id="1"></dvd> • <letter class= "firstClass" type="business">…</letter> • If double quote is used in a value, single quote can also be used to delimit attribute values • Any string consisting of a letter followed by an optional sequence of letters or digits can be a valid attribute name • Attribute names are case-sensitive: “id” and “ID” are different
XML Document Structure • An XML document contains an optional XML declaration followed by a single top-level element, which may contain nested elements and text <?xml version="1.0" encoding="UTF-8"?> <!-- This XML document describes a DVD library --> <library> <dvd id="1"> <title>Gone with the Wind</title> <format>Movie</format> <genre>Classic</ genre > </dvd> <dvd id="2"> <title>Star Trek</title> <format>TV Series</format> <genre>Science fiction</genre> </dvd> </library>
XML Document Structure … • The optional XML declaration can be used as the first line • It declares the XML version; v1.0 is the popular one • It declares character encoding • XML data are based on Unicode for supporting international characters • UTF-8 is the most efficient Unicode standard for western languages (one byte for each keyboard character) • XML comment: <!-- multiple-line comments -->
library dvd dvd genre genre format format title title @id @id XML Document Structure … • The nesting structure of an XML document can be described by a tree growing downwards (prefix @ for attributes) • Here “library” is the root or top-level element
Using Special Characters • The following five characters are used for identifying XML document structures thus cannot be used in XML data directly: & < > " ' • As part of element or attribute values, they should be represented as & < > " ' • Invalid: <Organization>IBM & Microsoft</Organization> • Valid: <Organization>IBM & Microsoft</Organization> • These alternative representations of characters are examples of entity references to be introduced soon
Entity References • If a character has hexadecimal Unicode code point nnn, you can refer to it as &#xnnn; • If a character has decimal Unicode code point nnn, you can refer to it as &#nnn; • If a character or string has an entity name entityName, you can refer to it in XML as &entityName; • Define entity name euro for €: <!ENTITY euro "€"> • Define entity name cs for “computer science”: <!ENTITY cs "Computer Science"> • HTML: A &cs; book costs me €52. • View: A computer science book costs me €52
Well-Formed XML Documents • A well-formed XML document must conform to the following rules, among others: • Non-empty elements are delimited by a pair of matching start tag and end tag • Empty elements may be in their self-ending tag form, such as <tagName /> • All attribute values are enclosed in matching single (') or double (") quotes • Elements may be nested but must not partially overlap. Each non-root element must be completely contained in another element • The document complies with its declared or default character encoding
Defining XML Dialects • “Well-formedness” is the minimal requirement for an XML document; all XML parsers can check it • Any useful XML document must follow the syntax rules of a specific XML dialect: which tags and attributes can be used, how elements can be ordered or nested … • Major mechanisms for defining XML dialects: • Document Type Definition (DTD) • XML Schema (XSD) • XML validating parsers can read an XML instance document and its syntax definition DTD/XSD file to validate whether the XML document conforms to the syntax constraints
Document Type Definition • DTD is simpler than XSD and can specify less syntax constraints • DTD is part of XML specification • DTD syntax is not based on XML • Usually DTD is specified in a separate file so it can be referred to by many of its instance XML documents • Local DTD definitions, especially entity name definitions, can also be included at the top of an XML document to override some global definitions
External DTD Example <?xml version="1.0" encoding="UTF-8"?> <!ELEMENT library (dvd+)> <!ELEMENT dvd (title, format, genre )> <!ELEMENT title (#PCDATA)> <!ELEMENT format (#PCDATA)> <!ELEMENT genre (#PCDATA)> <!ATTLIST dvd id CDATA #REQUIRED> • A “library” element contains one or more “dvd” elements • A “dvd” element contains one “title” element, one “format” element, and one “genre” elements, in the same order • The “title”, “format” and “genre” elements all have strings as their values • A “dvd” element has a required attribute “id” whose value is a string
Declaring Elements • Empty elements • <!ELEMENT elementName (EMPTY)> • <!ELEMENT br (EMPTY)> • Elements with text or generic data • <!ELEMENT elementName (#CDATA)> • #CDATA means the element contains character data that is not supposed to be parsed by a parser for markups like entity references or nested elements • <!ELEMENT elementName (#PCDATA)> • #PCDATA means that the element contains data that is going to be parsed by a parser for markups including entity references but not for nested elements • <!ELEMENT elementName (ANY)> • The keyword ANY declares an element with any content as its value, including text, entity references and nested elements. Any element nested in this element must also be declared
Declaring Elements… • Elements with children (sequences) • <!ELEMENT elementName (childElementNames)> • <!ELEMENT index (term, pages)> • <!ELEMENT footnote (message)> • Elements with zero or more nested element • <!ELEMENT elementName (childName*)> • <!ELEMENT footnote (message*)> • Elements with one or more nested element • <!ELEMENT elementName (childName+)> • <!ELEMENT footnote (message+)> • Elements with optional nested elements • <!ELEMENT elementName (childName?)> • <!ELEMENT footnote (message?)>
Declaring Elements… • Elements with alternative nested elements • <!ELEMENT section (section1 | section2) • Elements with mixed content • <!ELEMENT email (to+, from, header, message*, #PCDATA)> • An email element must contain in the same order at least one to child element, exactly one from child element, exactly one header element, zero or more message elements, and some other parsed character data as well
Declaring Attributes • Syntax: <!ATTLIST elementNameattributeNameattributeTypedefaultValue> • DTD • XML<!ELEMENT circle EMPTY> • <!ATTLIST circle radius CDATA "1"> • XML • <circle radius="10"></circle> • <circle/> (having default radius 1) • DTD • <!ATTLIST circle type (solid|outline) "solid"> • XML • <circle type="solid"/> • <circle type="outline"/>
Declaring Attributes … • Declaring an optional attribute without default value • DTD • <!ATTLIST circle radius CDATA #IMPLIED> • XML • <circle radius="10"></circle> • <circle/> (having no attribute radius ) • Declaring a mandatory attribute • DTD • <!ATTLIST circle radius CDATA #REQUIRED> • XML • <circle radius="10"></circle> • <circle/> (invalid element)
Declaring Entity Names • Syntax: <!ENTITY entityName "entityValue"> • <!ENTITY euro "€"> • <!ENTITY cs "Computer Science"> • Example usage of entity names • HTML: A &cs; book costs me €52. • View: A computer science book costs me €52
Associating DTD Declarations to XML Documents • Including DTD declarations in an XML document • <!DOCTYPE rootElementTag [DTD-Declarations]> <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE library [ <!ELEMENT library (dvd+)> <!ELEMENT dvd (title, format, genre)> <!ELEMENT title (#PCDATA)> <!ELEMENT format (#PCDATA)> <!ELEMENT genre (#PCDATA)> <!ATTLIST dvd id CDATA #REQUIRED> ]> <library> <dvd id="1"> <title>Gone with the Wind</title> <format>Movie</format> <genre>Classic</genre> </dvd> </library>
Associating DTD Declarations to XML Documents … • Referencing an external DTD file from an XML document • <!DOCTYPE rootElementTag SYSTEM DTD-URL> <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE library SYSTEM "dvd.dtd"> <library> <dvd id="1"> <title>Gone with the Wind</title> <format>Movie</format> <genre>Classic</genre> </dvd> </library>
XML Schema (XSD) • An alternative industry standard for defining XML dialects • More expressive than DTD • Using XML syntax • Promoting declaration reuse so common declarations can be factored out and referenced by multiple element or attribute declarations
Example XSD <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="library"> <xs:complexType> <xs:sequence> <xs:element name="dvd" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="format" type="xs:string"/> <xs:element name="genre" type="xs:string"/> </xs:sequence> <xs:attribute name="id" type="xs:integer" use="required"/> </xs:complexType>
Example XSD … </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
XML Namespace • Tag and attribute names are supposed to be meaningful • Namespace is for reducing the chance of name conflicts • A namespace is any unique string, typically in URL form • An XML document can define a short prefix for each namespace to qualify tag and attribute names declared under that namespace • <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> • “http://www.w3.org/2001/XMLSchema” is the namespace for tag and attribute names of XML Schema • “xs” is defined as a prefix for this namespace • “xs:schema”: the “schema” tag name defined in namespace “xs” or “http://www.w3.org/2001/XMLSchema”
XML Namespace … • To declare "http://csis.pace.edu" to be the default namespace (to which all unqualified names belong), use attribute xmlns="http://csis.pace.edu" • To declare that all tag/attribute names declared in the current XSD file belong to namespace "http://csis.pace.edu", use the targetNamespace attribute: <xs:schema targetNamespace="http://csis.pace.edu" ……> • Declarations in a schema element without specifying targetNamespace value does not belong to any namespace
XML Declarations: Global vs. Local • Global declarations: XSD declarations immediately nested in the top-level schema element • Global declarations are the key for reusing declarations • Local declarations: only valid in their hosting elements (not schema)
Declaring Simple Elements • To declare element color that can take on any string value • <xs:element name="color“ type="xs:string"/> • Element “<color>blue</color>” will have value “blue”, and element “<color/>” will have no value • To declare element color that can take on any string value with “red” to be its default value • <xs:element name="color" type="xs:string" default="red"/> • Element “<color>blue</color>” will have value “blue”, and element <color /> will have the default value “red”
Declaring Simple Elements … • To declare element color that can take on only the fixed string value “red” • <xs:element name="color" type="xs:string" fixed="red"/> • Element “<color>red</color>” will be correct, element “<color>blue</color>” will be invalid, and element “<color />” will have the fixed (default) value “red”
Declaring Attributes • Attribute declarations are always nested in their hosting elements’ declarations • To declare that lang is an attribute of type xs:string, and its default value is “EN” • <xs:attribute name="lang" type="xs:string" default="EN"/> • If the above attribute lang doesn’t have a default value but it must be specified for its hosting element • <xs:attribute name="lang" type="xs:string" use="required"/>
Declaring Complex Elements • To declare that product is an empty element type with optional integer-typed attribute pid <xs:element name="product"> <xs:complexType> <xs:attribute name="pid" type="xs:integer"/> </xs:complexType> </xs:element> • Example product elements • <product/> • <product pid="1">
Declaring Complex Elements .. • To declare that an employee element’s value is a sequence of two nested elements: a firstName element followed by a lastName element, both of type string <xs:element name="employee"> <xs:complexType> <xs:sequence> <xs:element name="firstName" type="xs:string"/> <xs:element name="lastName" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> • Example <employee> <firstName>Tom</firstName> <lastName>Sawyer</lastName> </employee>
Using Global Type Declarations • Global declarations promote declaration reuse <xs:element name="employee" type="fullName"/> <xs:element name="manager" type="fullName"/> <xs:complexType name="fullName"> <xs:sequence> <xs:element name="firstName" type="xs:string"/> <xs:element name="lastName" type="xs:string"/> </xs:sequence> </xs:complexType>
Declaring Complex Type Elements • To declare a complexType element shoeSize with integer element value and a string-type attribute named country <xs:element name="shoeSize"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:integer"> <xs:attribute name="country" type="xs:string"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> • Example: <shoeSize country="france">35</shoeSize>
Declaring Mixed Complex Type Elements • A mixed complex type element can contain attributes, elements, and text • To declare a letter element that can have a mixture of elements and text as its value <xs:element name="letter"> <xs:complexType mixed="true"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="orderID" type="xs:positiveInteger"/> <xs:element name="shipDate" type="xs:date"/> </xs:sequence> </xs:complexType> </xs:element>
Declaring Mixed Complex Type Elements … • Example letter element <letter> Dear Mr.<name>John Smith</name>, Your order <orderID>1032</orderID> will be shipped on <shipDate>2008-09-23</shipDate>. </letter>
Specifying Unlimited Element Order • Use “xs:all” elements to replace “xs:sequence” elements is you allow the nested elements to occur in any order <xs:element name="employee"> <xs:complexType> <xs:all> <xs:element name="firstName" type="xs:string"/> <xs:element name="lastName" type="xs:string"/> </xs:all> </xs:complexType> </xs:element> • The firstName and lastName elements can occur in any order
Specifying Multiple Occurrence of an Element • Use occurrence indicators, maxOccurs and minOccurs, to indicate an element can occur how many times • Attribute maxOccurs has default value unbounded • Attribute minOccurs has default value 1 • To declare that the dvd element can occur zero or unlimited number of times <xs:element name="dvd" minOccurs="0" maxOccurs="unbounded">
Specifying an XML Schema without Target Namespace • Assume that • an XML dialect is specified with an XML Schema file schemaFile.xsd without using a target namespace • the Schema file has URL schemaFileURL, which is either a local file system path like “schemaFile.xsd” or a Web URL like “http://csis.pace.edu/schemaFile.xsd” • The instance documents of this dialect can be associated with its XML Schema declaration with the following structure, where rootTag is the name of a root element <rootTag xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schemaFileURL" >
Specifying an XML Schema with Namespace • Assume that • an XML dialect is specified with an XML Schema file schemaFile.xsd using target namespace namespaceString • the Schema file has URL schemaFileURL, which is either a local file system path like “schemaFile.xsd” or a web URL like http://csis.pace.edu/schemaFile.xsd • The instance documents of this dialect can be associated with its XML Schema declaration with the following structure, where rootTag is the name of a root element <rootTag xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="namaspaceStringschemaFileURL" >
XML Parsing and Validation with SAX and DOM • XML parsers are for • Read and parse XML documents • Check whether an XML document is well-defined • Check whether an XML (instance) document is conforming to the syntax specification of its DTD or XSD declarations • Two type of XML parsers • SAX (Simple API for XML) • SAX works as a pipeline. It reads in the input XML document sequentially, and fires events when it detects the start or end of language features like elements and attributes. It is memory-efficient for data sequential processing. • DOM (Document Object Model) • A DOM parser builds a complete tree data structure in the computer memory so it can be more convenient for detailed document analysis and language transformation.
XML Transformation with XSLT • XSL (Extensible Stylesheet Language) is the standard language for writing stylesheets to transform XML documents among different dialects or into other languages • XSL stylesheets are pure XML documents • XSL includes three components: • XSLT (XSL Transformation) as an XML dialect for specifying XML transformation rules or stylesheets • XPath as a standard notation system for specifying subsets of elements in an XML document • XSL-FO for formatting XML documents
Example XML Document <?xml version="1.0" encoding="UTF-8"?> <!-- This XML document describes a DVD library --> <library> <dvd id="1"> <title>Gone with the Wind</title> <format>Movie</format> <genre>Classic</ genre > </dvd> <dvd id="2"> <title>Star Trek</title> <format>TV Series</format> <genre>Science fiction</genre> </dvd> </library>
library dvd dvd genre genre format format title title @id @id Identifying XML Nodes with XPath • Visualize all components in an XML document, including the elements, attributes and text, as graph nodes • A node is connected to another node under it if the latter is immediately nested in the former or is an attribute or text value of the former • The attribute names have symbol @ as their prefix • The sibling nodes are ordered as they appear in the XML document
Path Expressions • Path expressions are used to select nodes in an XML document • An absolute location path starts with a slash / and has the general form of /step/step/… • A relative location path does not start with a slash / and has the general form of step/step/… • In both cases, the path expression is evaluated from left to right, and each step is evaluated in the current node set to refine it
Path Expressions … • Each step has the following general form: [axisName::]nodeTest[predicate] • the optional axis name specifies the tree-relationship between the selected nodes and the current node • the node test identifies a node type within an axis • zero or more predicates are for further refining the selected node set
Path Expressions … • library: all the library elements in the current node set • /library:the root element library • library/dvd: all dvd elements that are children of library elements in the current node set • //dvd: all dvd elements no matter where they are in the document (no matter how many levels they are nested in other elements) relative to the current node set • library//title: all title elements that are descendants of the library elements in the current node set no matter where they are under the library elements • //@id: all attributes that are named “id” relative to the current node set