310 likes | 381 Views
Lecture 5 XML Schema (Based on Møller and Schwartzbach, 2006, pp.113-159). CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226). David Meredith d.meredith@gold.ac.uk www.titanmusic.com/teaching/cis336-2006-7.html. Problems with DTDs.
E N D
Lecture 5 XML Schema (Based on Møller and Schwartzbach, 2006, pp.113-159) CIS336Website design, implementation and management(also Semester 2 of CIS219, CIS221 and IT226) David Meredith d.meredith@gold.ac.uk www.titanmusic.com/teaching/cis336-2006-7.html
Problems with DTDs • DTDs cannot constrain character data • e.g., cannot specify that (#PCDATA) must only be a valid integer representation • need more powerful datatype mechanism • Attribute types are too limited • e.g., cannot specify that an attribute value must be an integer, a URI etc. • Element and attribute definitions cannot depend on context • e.g., cannot specify that unit attribute only allowed if amount attribute is present • Character data cannot be combined with regular expression content model • i.e., mixed content always has form (#PCDATA | e1 | e2)* • cannot specify order in which character data may be interspersed with elements • Element content model lacks "interleaving" operator that allows us to specify that an element may occur anywhere inside an element • e.g., cannot (easily) specify that comment element may occur anywhere in contents of recipe element
More problems with DTDs • DTD provides very limited support for modularity, reuse and evolution of schemas • hard to write, maintain and read large DTD schemas • ID/IDREF mechanism is too limited • sometimes want to specify a more restricted scope for an ID attribute than the whole instance document • also might want to use multiple attribute values or character data as keys rather than just single attribute value • DTDs do not support namespaces
XML Schema • DTDs defined as part of the XML 1.0 specification (February 1998) • inherited from SGML • Shortly afterwards, W3C initiated XML Schema project to deal with problems in DTDs • XML Schema Requirements (1999) specifies that XML Schema should be: • more expressive than XML DTD • a well-formed XML language • self-describing • i.e., it should be possible to describe the syntax of XML Schema using an XML Schema (since XML Schema is an XML language) • simple enough to implement with modest design and runtime resources (which limits expressiveness) • XML Schema specification should be: • defined quickly to prevent competing schema languages gaining a foothold • precise, concise, human-readable and illustrated with examples
XML Schema technical requirements • XML Schema should • contain mechanism for constraining use of namespaces • allow creation of user-defined datatypes for describing character data and attribute values • enable inheritance for element, attribute and datatype definitions • support evolution of schemas • permit embedded structured documentation within schemas
XML Schema recommendation • Official XML Schema specification published as W3C recommendation in 2001 • in 2 parts: • XML Schema Part 1: Structures • Describes core XML Schema including, for example, element and attribute declarations • Most recent version: Second Edition, 28 October 2004 • Available online at http://www.w3.org/TR/xmlschema-1/ • XML Schema Part 2: Datatypes • Defines facilities for defining datatypes in XML Schema • Most recent version: Second Edition, 28 October 2004 • Available online at http://www.w3.org/TR/xmlschema-2/ • Does not satisfy all original requirements: • not simple • Partly remedied by XML Schema Part 0: Primer • Provides easily readable description of the XML Schema facilities • Most recent version: 28 October 2004 • Available online at • http://www.w3.org/TR/xmlschema-0/ • not fully self-describing • not sufficiently expressive • e.g., cannot express full syntax of RecipeML
XML Schema overview • Contains a sophisticated type system like those in common programming languages • Facilitates re-use and improves schema structure • Four central constructs in XML Schema all based on types and are as follows: • Simple type definition • Defines a family of Unicode text strings • Describes text without markup • Complex type definition • Defines validity requirements for attributes, sub-elements and character data in an element of that type • Describes text which may contain markup • Element declaration • Associates element name with either a simple or complex type • Attribute declaration • Associates attribute name with simple type • Attribute values are always unstructured text
An example schema written in XML Schema • Schema at left shows • one element declaration • student • two attribute declarations: • id, score • one complex type definition: • StudentType • one simple type definition: • Score • XML Schema elements identified by namespace http://www.w3.org/2001/XMLSchema • Namespace prefix ("xsd") is arbitrary but conventional • Root element in XML Schema document is named schema • usually contains targetNamespace attribute • defines namespace being defined by the schema • also declare this namespace with a prefix so that can refer to definitions within the schema • Definitions create new types; declarations describe constituents of the instance document • Definitions and declarations populate the target namespace
Syntax for element and attribute declarations • Element declaration has form<element name="name" type="type"/> • associates simple or complex type, type, with the element named name • Attribute declaration has form<attribute name="name" type="type"/> • associates simple type, type, with an attribute named name
Simple student instance document • Can avoid use of prefixes in attribute names Can avoid use of
Business card example • Instance doc at top left in language defined at bottom left • Assume we own the domain businesscard.org • so no-one else uses this namespace • Can fix it so that no need for prefix in uri attribute • Compare DTD
Connecting instance documents and schemas • Instance document can refer to a schema using schemaLocation attribute from the namespace, http://www.w3.org/2001/XMLSchema-instance • Value of schemaLocation attribute has two parts, separated by whitespace: • target namespace of schema • URI of schema document • schemaLocation indicates that document is supposed to be valid with respect to the schema • schemaLocation attributes may appear in any element • usually appear in root element • can also appear in another element to indicate that the schema applies to the subtree under that element • means XML languages can be combined at will • schemaLocation attribute value is actually sequence of "namespace URI" pairs • if more than one pair, all schemas apply independently
More on schemaLocation • All attributes defined in http://www.w3.org/2001/XMLSchema-instanceimplicitly declared for all elements in instance document • schemaLocation attributes are optional • make instance documents self-describing • Applications require documents to be valid relative to schemas decided by application developers, not schemas decided by document authors • XMLSchema does not directly enforce a particular root element • e.g., an XMLSchema definition of XHTML cannot express that the root element must be html • means that application must check root element as well as carrying out XML validation
Simple types • Simple type or datatype is set of Unicode strings with a particular semantic interpretation • e.g., decimal datatype is built-in XML Schema datatype which consists of all strings that represent decimal numbers (e.g., 3.1415) • 3.1415 is equal to 3.141500 • 42 is less than 117 • XML Schema contains some primitive simple types with pre-defined meanings • XML Schema also provides various mechanisms for deriving new types from existing ones
Simple Types (Datatypes) – Primitive stringany Unicode string boolean true, false, 1, 0 decimal 3.1415 float 6.02214199E23 double 42E970 dateTime 2004-09-26T16:29:00-05:00 time 16:29:00-05:00 date 2004-09-26 hexBinary 48656c6c6f0a base64Binary SGVsbG8K anyURI http://www.brics.dk/ixwt/ QName rcp:recipe, recipe ...
Some built-in derived simple types • normalizedString • as string but whitespace facet is replace • token • as string but whitespace facet is collapse • language • "en", "da", "en-US", etc. • NMTOKEN • e.g., "42", "my.form", "r103" • NMTOKENS • e.g., "42 my.form r103" • nonPositiveInteger • e.g., "-87", "0"
A simple type element declaration • <element name="serialnumber" type="nonNegativeInteger"/> • assigns built-in primitive simple type, nonNegativeInteger, to elements named serialnumber • contents of a serialnumber element must match nonNegativeInteger (possibly with surrounding whitespace) • serialnumber element cannot contain child elements or attributes
Deriving new simple types by restriction • Restriction of a simple type defines a new type by restricting possible values of a base type • restriction performed on facets of base type (see table above left) • restriction may contain multiple constraining facets • Facet restrictions operate at semantic not syntactic level • e.g., <totalDigits value="3"/> allows 123, 0123 and 0123.0 but not 1234 and 123.05
Deriving new simple types by restriction • enumeration facet restricts values to a finite set of possibilities (see above left) • pattern facet allows values to be constrained to satisfy regular expressions (see above right) • symbols that have a special meaning within regular expressions can be escaped by prefixing with a backslash (e.g., \*) • For most facets, restrictions may be changed in further derivations unless fixed="true" attribute is added to constraining facet
Deriving simple types using list and union • Use the list element inside a simpleType definition to define a whitespace separated string of values of a particular type (see above left) • e.g., "23 4 56 -7" is of type integerlist • Use union element inside a simpleType definition to specify that a value must be one of two or more types • e.g., "true" and "1.3" are both of type boolean_or_decimal
Complex types • An element declaration may assign a complex type to an element name:<element name="card" type="b:card_type"/> • means that elements with the name card must satisfy all the requirements specified in the definition of the type card_type • complex type definition may specify attributes, child element types and ordering and character data • Complex type defined using XML Schema element, complexType • content of complexType element can be either complex or simple
Element reference • Element reference takes the form<element ref="name" /> • name is the name of an element that has already been declared • Note difference between element element with name attribute and one with a ref attribute!
sequence element • Concatenation within the content of an element with a complex content model is expressed using the sequence element
choice element • Union (i.e., the '|' operator in a regular expression) corresponds to the choice element • At left, each card element contains either an email element or zero or 1 phone elements but not both
all element • A content sequence matches an all expression if each constituent of the expression is matched somewhere in the content model and every element in the content model is matched by a constituent in the expression • Essentially variant of sequence in which order does not matter
any element • any empty element is a wildcard that matches any element • Attribute namespace limits matching elements in various ways • whitespace separated list of URIs • ##targetNamespace • ##local • empty namespace • ##any • ##other • any namespace except targetNamespace
any element • Can be used to specify that a different language is used inside an element • e.g., XHTML used inside the info element in WidgetML (see above) • content must consist of one or more elements from the XHTML namespace
Some restrictions • all element may only contain element references • sequence and choice elements cannot contain all elements • complexType contents cannot consist of single element or any declaration • need to wrap it in a sequence or choice element
Attribute references • A complex type may optionally contain a number of attribute references of the form<attribute ref="name" /> • name is the name of the attribute that has been declared elsewhere • attribute reference must appear after the content model description of a complex type • attribute reference can contain an attribute named use which can take the values optional (default) or required
minOccurs and maxOccurs • minOccurs and maxOccurs attributes can be used with • element, sequence, choice, all and any elements • define possible cardinalities of the element • values must be non-negative integers or, for maxOccurs, unbounded • by default, minOccurs and maxOccurs are 1
mixed attribute • complexType may optionally have an attribute, mixed="true" • means arbitrary character data is permitted anywhere in the content in addition to the elements declared in the content model • Without mixed="true" attribute, only whitespace allowed between elements in content model • Character data cannot be constrained if we also want to allow elements in the content