1.05k likes | 1.15k Views
COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2008. Today’s Topics:. Document Structure Definition Document Type Definition (DTD) XML Schema (XSD) Querying XML Documents NOT the same as Web search engines! XPath XQuery. A. < A > < B > foo </ B >
E N D
COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008 Kaiser: COMS E6125
Today’s Topics: • Document Structure Definition • Document Type Definition (DTD) • XML Schema (XSD) • Querying XML Documents • NOT the same as Web search engines! • XPath • XQuery Kaiser: COMS E6125
A <A> <B>foo</B> <C>bar</C> <C>psl</C> </A> B C C A: B: "foo" "foo" "bar" "psl" children are ordered C: "bar" C: "psl" Pure XML - Instance Model • XML 1.0 implicit data model: • nested containers ("boxes within boxes") • labeled ordered trees (= semistructured data model) • Relational or object-oriented easy to encode Kaiser: COMS E6125
XML Namespaces • Allows mixing of different tag vocabularies • Only identifies the vocabulary (lexicon) • Additional mechanisms required for structure and meaning (or at least metadata) of tags Kaiser: COMS E6125
From Documents to Data <memo importance='high' date=‘2008-02-11'> <from>Gail Kaiser</from> <to>Swapneel Sheth</to> <subject>whim tomorrow</subject> <body>Remember to pick up the sign-in sheet after class tomorrow </body> </memo> • We want to be able to • Extract the element structure of a document • Re-use this structure for other similar documents • Share structure and metadata with others • Automate processing of this structure and metadata <invoice> <orderDate>2007-12-01</orderDate> <shipDate>2007-12-26</shipDate> <billingAddress> <name>Gail Kaiser</name> <street>500 West 120th Street</street> <city>New York</city> <state>NY</state> <zip>10027</zip> </billingAddress> <voice>212-555-1234</voice> <fax>212-555-4321</fax> </invoice> Kaiser: COMS E6125
Adding Structure and Semantics • A Document Structure Description (DSD) defines the syntax of XML documents for a particular application domain • Defines the grammar for an XML-based markup language Kaiser: COMS E6125
Processing XML • Non-validating parser: • checks that XML doc is syntactically well-formed, e.g., all open-tags have matching close-tags and they are properly nested, attributes only appear once in an element, etc. • Validating parser: • checks that XML doc is also valid wrt a given DSD (now usually XML Schema) Kaiser: COMS E6125
Using DSD Validators • A DSD processor can be useful both on the server side (when writing XML documents) and on the client side (when processing XML documents): • Checking validity (conformance) of XML documents • Performing default insertion (inserts missing fragments) Kaiser: COMS E6125
DSD Processing Kaiser: COMS E6125
Several Proposed DSDs • XML Document Type Definitions (DTDs): • Define the structure of “allowed” documents • Database schema • Non-XML syntax • XML Schemas (XSDs) • Define structure and data types • Allows developers to build their own libraries of interchange-able data types • Written in an XML vocabulary • Others (e.g., RELAX NG, Schematron) Kaiser: COMS E6125
Document Type Definitions • A DTD is a grammar defining XML structure • XML document specifies an associated DTD, plus the root element • DTD specifies children of the root element, their children, and so on Kaiser: COMS E6125
Example DTD <!ELEMENT bib (book *)> <!ELEMENT book (thesis | article)> <!ELEMENT thesis (title, author, year, school, committeemember*)> <!ATTLIST thesis date CDATA #REQUIRED key ID #REQUIRED advisor CDATA #IMPLIED idref IDREF> <!ELEMENT article (title, (author+ | editor+), publisher)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (name)> <!ATTLIST author id ID #REQUIRED> . . . Kaiser: COMS E6125
CDATA “Character Data”, a sequence of characters #PCDATA “Parsed Character Data”, text and character entities (e.g., & -> &, é -> acute e) ID unique IDREF reference to entity #IMPLIED A default value must be supplied by the processor. ( ... ) Specifies a group. A | B Both A and B are permitted in any order. A , B A must occur before B. A & B A and B must both occur once, but may do so in any order. A? A can occur zero or one times A* A can occur zero or more times A+ A can occur one or more times DTD Interpretation Kaiser: COMS E6125
DTD Defines Special Significance for Attributes • IDs – special attributes that are analogous to relational database keys (globally unique IDs for elements) • IDREF – reference to an ID • IDREFS – a list of IDREFs Kaiser: COMS E6125
Instance Visualization as a Graph <?xml version="1.0"?> <!DOCTYPE bib SYSTEM “http://webserver/bib.dtd"> <bib> <author id="author1"> <name>John Smith</name> </author> <article> <author idref="author1" /> <title>Paper1</title> </article> <article> <author idref="author1" /> <title>Paper2</title> </article> . . . Kaiser: COMS E6125
Graph Data Model Root bib ?xml !DOCTYPE article article author id title title author author name Paper1 author1 idref Paper2 idref John Smith author1 author1 Kaiser: COMS E6125
Drawbacks of DTDs • Not themselves XML - additional effort to build tools • No support for data types - cannot do data validation • No support for OO-like structures (e.g, inheritance) • Horrible syntax Kaiser: COMS E6125
Several Proposed DSDs • XML Document Type Definitions (DTDs): • Define the structure of “allowed” documents • Database schema • Non-XML syntax • XML Schemas (XSDs) • Defines structure and data types • Allows developers to build their own libraries of interchange-able data types • Written in an XML vocabulary • Others (e.g., RELAX NG, Schematron) Kaiser: COMS E6125
XML Schema Design Principles • More expressive than DTDs (which came from SGML, although modified slightly in XML 1.0) • Notation is itself an XML vocabulary • Self-describing • Usable by a wide variety of applications that employ XML • Straightforwardly usable on the Internet • Optimized for interoperability • Simple enough to be implemented with modest design and runtime resources • Coordinated with relevant W3C specs Kaiser: COMS E6125
Purpose of an XML Schema • Defines a class of XML instances • Neither instances nor schemas need exist as documents, per se, may exist as: • Byte stream sent between applications • Fields in a database record • Collection of XML “infoset” information items Kaiser: COMS E6125
What is an XML “infoset”? • XML Information Set, 2nd edition, W3C Recommendation February 2004 • For use by other specs that need to refer to the information in a well-formed XML document [or PSVI = post schema validated infoset] • Defines abstract data set generated by parser or by other means, conceptually tree of items each with several properties Kaiser: COMS E6125
(Some) Information Items • Document (root of infoset) – properties include base URI, XML version, character encoding, etc. • One root element - and its children • Attributes of elements • Namespace scoping for elements • Processing instructions • Unexpanded entities (processor may or may not expand all entities) Kaiser: COMS E6125
Example Instance Document <?xml version="1.0"?> <purchaseOrder orderDate=“2007-10-20"> <shipTo country="US"> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo country="US"> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </item> <item partNum="926-AA"> . . . </item> </items> </purchaseOrder> file po.xml
Where is the Schema? • The instance document may reference a schema explicitly, or a processor may obtain a schema separately without reference from the instance • Schema defines elements and attributes, and their complex and simple types • Determines the appearance of elements and their content in instance documents Kaiser: COMS E6125
Example Schema <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType> </xsd:schema> file po.xsd • The schema consists of a schema element and various subelements, e.g., element, complexType • The prefix xsd: associates names with the XML Schema namespace specified in the xmlns:xsd declaration • Same prefix, and hence same association, also appears on names of built-in types, e.g., xsd:string • Identifies elements and simple types as belonging to XML Schema language vocabulary rather than vocabulary of schema author
Example Schema <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType> </xsd:schema> file po.xsd • An annotation element may appear at the beginning of most schema constructions • Contains two subelements • Documentation: Human readable material • appInfo: For tools and applications Kaiser: COMS E6125
Complex Type Definitions <xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/> </xsd:complexType> • New complex types are defined using the complexType element; it contains element declarations, attribute declarations and element references • This example says elements of type USAddress must have • 5 subelements that must be called name, street, city, state and zip (in this order), each having the corresponding type declared above • 1 attribute called country may appear with the element; NMTOKEN represents an atomic indivisible value • All element declarations within USAddress involve simple types
Complex Type Definitions <xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/> </xsd:complexType> • An attribute may be specified as fixed or default. • Default attribute values apply when attributes are missing. • For fixed attributes, if a value appears, it must be the value declared with a fixed value. • The schema processor will provide the value for missing attributes. Kaiser: COMS E6125
Complex Type Definitions <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> • A declaration may reference an existing element, e.g., comment; the value of the ref attribute must reference a global element (i.e., declared under schema) • Every element of type PurchaseOrderType must consist of subelements shipTo and billTo, each containing the five subelements declared as part of USAddress, items and (optionally) comment; it may have one attribute called orderDate
Complex Type Definitions <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> • Occurrence constraint may specify minoccurs and/or maxoccurs Kaiser: COMS E6125
Complex Type Definitions <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> • Attributes may appear once or not at all (the default), but no more than once • use may be specified as optional, required, or prohibited Kaiser: COMS E6125
string, normalizedString, token byte, unsignedByte integer, positiveInteger, etc long, short, etc decimal, float, double boolean time, dateTime, duration, date, etc anyURI etc ID IDREF, IDREFS ENTITY, ENTITIES NMTOKEN, NMTOKENS The types in this column should only be used in attributes (to retain compatibility with XML 1.0 DTDs) Simple Built-in Types Kaiser: COMS E6125
Simple Derived Types <xsd:simpleType name="myInteger"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="10000"/> <xsd:maxInclusive value="99999"/> </xsd:restriction> </xsd:simpleType> • The simpleType element is used to define and name a new simple type • The restriction element indicates the base type and identifies the “facets” that constrain the range of values (here minInclusive and maxInclusive) Kaiser: COMS E6125
Simple Derived Types (pattern facet) <!-- Stock Keeping Unit, a code for identifying products --> <xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType> • Constrain the values of SKU using the pattern facet in conjunction with the regular expression "\d{3}-[A-Z]{2}“ (3 digits followed by a hyphen followed by 2 upper-case ASCII letters) Kaiser: COMS E6125
Simple Derived Types (enumeration facet) <xsd:simpleType name="USState"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="AK"/> <xsd:enumeration value="AL"/> <xsd:enumeration value="AR"/> <!-- and so on ... --> </xsd:restriction> </xsd:simpleType> • The enumeration facet limits a simple type to a set of distinct values • Enables a better definition of USAddress type <xsd:complexType name="USAddress"> . . . <xsd:element name="state" type="USState"/> . . . </xsd:complexType
Anonymous Type Definitions <xsd:complexType name="Items"> <xsd:sequence> <xsd:element name="item" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="USPrice" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/> </xsd:sequence> <xsd:attribute name="partNum" type="SKU" use="required"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType>
Recap Example Schema <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> <xsd:complexType name="USAddress"> . . . </xsd:complexType> <xsd:complexType name="Items"> . . . </xsd:complexType> </xsd:schema> file po.xsd
XML Schema Data Types • Complex types • Built-in simple types • Derived simple types • Also derived complex types, lists and unions of simple types • Define structure – what about the content? Kaiser: COMS E6125
Element Content: Simple content <xsd:element name="internationalPrice"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base="xsd:decimal"> <xsd:attribute name="currency“ type="xsd:string"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> • Declare an element that has an attribute and contains a simple value <internationalPrice currency="EUR">423.46</internationalPrice>
Element Content:Empty content • Declare an element with attributes only - no content at all <xsd:element name="internationalPrice"> <xsd:complexType> <xsd:attribute name="currency" type="xsd:string"/> <xsd:attribute name="value" type="xsd:decimal"/> </xsd:complexType> </xsd:element> <internationalPrice currency="EUR" value="423.46"/> Kaiser: COMS E6125
Element Content: Entire element omitted • The absence of an element does not carry any particular meaning; it could be • Information unknown • Information not applicable • I just forgot to enter the information • Absence does/should not imply some value like zero, empty string, empty list, etc. • Database systems faced with similar problems have introduce “null” values • XML does not provide a null value representation that actually appears in element content; instead, there is an attribute to indicate content is nil <xsd:element name="shipDate" type="shipDateType" nillable="true"> <shipDate xsi:nil="true"></shipDate>
Element Content:Mixed content <letterBody> <salutation>Dear Mr.<name>Robert Smith</name>.</salutation> Your order of <quantity>1</quantity> <productName>Baby Monitor</productName> shipped from our warehouse on <shipDate>1999-05-21</shipDate>. .... </letterBody> • Text appears between the elements salutation, quantity, productName, and shipDate (all children of letterBody) • To allow this, the mixed attribute of the parent’s complexType must be set to true Kaiser: COMS E6125
Element Content: Mixed content <xsd:element name="letterBody"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="salutation"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="quantity" type="xsd:positiveInteger"/> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/> <!-- etc. --> </xsd:sequence> </xsd:complexType> </xsd:element> • The order and number of child elements appearing in an instance must agree with order/number of child elements specified in the content model
Element Content:anyType • The anyType type does not constrain its content in any way • When no type is defined, anyType is the default, so could be written as <xsd:element name="anything" type="anyType"/> <xsd:element name="anything"/> Kaiser: COMS E6125
Grouping Content Elements – group & sequence • group – groups elements so that they can be used as a unit to build up types • sequence grouping (default) – elements in instance doc must appear in the listed order <xsd:group name="shipAndBill"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> </xsd:sequence> </xsd:group> Kaiser: COMS E6125
Content Groups - choice • choice grouping – only one element appears in an instance <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:choice> <xsd:group ref="shipAndBill"/> <xsd:element name="singleUSAddress" type="USAddress"/> </xsd:choice> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> Kaiser: COMS E6125
Content Groups - all • all grouping – elements may appear in any order, each element appears zero or one times • An all group must appear as the sole child at the top of a content model <xsd:complexType name="PurchaseOrderType"> <xsd:all> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:all> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>
Attribute Grouping • We can create a named attribute group containing all the desired attributes and reference this group by name in an element <xsd:element name="Item"> </xsd:complexType> . . . <xsd:attribute name="partNum" type="SKU" use="required"/> <xsd:attribute name="weightKg" type="xsd:decimal"/> <xsd:attribute name="shipBy"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="air"/> <xsd:enumeration value="land"/> <xsd:enumeration value="any"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:complexType> </xsd:element>
Attribute Groups <xsd:element name="Item"> </xsd:complexType> . . . <xsd:attributeGroup ref="ItemDelivery"/> </xsd:complexType> </xsd:element> <xsd:attributeGroup name="ItemDelivery"> <xsd:attribute name="partNum" type="SKU" use="required"/> <xsd:attribute name="weightKg" type="xsd:decimal"/> <xsd:attribute name="shipBy"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="air"/> <xsd:enumeration value="land"/> <xsd:enumeration value="any"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:attributeGroup>
Target Namespaces • Tired of repeating the prefix xsd: ? • We could make the XMLSchema namespace the default namespace (so no more xsd prefixes) but then we would have to prefix the locally defined types and locally declared elements and attributes • The solution is Target Namespaces • Target namespaces enable distinguishing between definitions and declarations from different vocabularies Kaiser: COMS E6125