300 likes | 312 Views
Learn about XML schemas and their role in validating XML documents, including DTDs and other schema definition languages. |
E N D
Chapter 4 - Quality Control with SchemasLearning XMLbyErik T. Ray Slides were developed by Jack DavisCollege of Information Scienceand TechnologyRadford University
Schemas • define an XML tag set- primarily elements, attributes, entities and structure • is a pass or fail test for XML documents (validation) • insure that a document fulfills a minimum set of requirements, finding flaws that could result in anomalous processing • are not required • A validating XML parser takes an XML instance as input and produces a validation report as output. This report typically lists errors found in the document (where it does not conform to the schema) • Validation considers:structure, data typing, integrity (status of links between nodes and resources), business rules (spell checking, checksum)
Schema Types • DTD - Document Type DefinitionThe oldest and most widely supported schema language.DTD's don't support namespaces (can't mix tag sets within a single DTD) and have very weak data typing. • The W3C built XML SchemaXML Schemas are themselves XML documents, so they can be checked for well-formedness and validity.XML Schema support namespaces and have a much broader ability to specify data types, including things like date types. • Other schema definition languages are available (RELAX NG, Schematron, …).
DTD's • XML elements and attributes are defined in a DTD • DTD's are extensible - meaning they can be extended to meet the needs of the current task • A DTD can be specified within an XML document (internal) or in a separate file (external). • Many free DTD's exist on the internet today and can be freely downloaded • DTD's declare a set of allowed elements. A conforming XML document can't use any elements not defined in this set. • DTD's define a content model for each element. This describes what elements or data can go inside an element, in what order, in what number, and whether they are required or optional. • DTD's declare a set of allowed attributes for each element with data types and default values. • DTD's provide mechanisms to manage the model, providing links to other components.
Element Declarations • Element declaration<!ELEMENT element_name (content model)>Content ModelText: • Description: text or character data • Syntax: (#PCDATA) • Elements: • Description: contains other elements • Syntax: (element_1, element_2, …) • Mixed Content: • Description: contains both text and other elements • Syntax: (#PCDATA | element_1 | element2 …)* • Empty: • Description: does not contain any content • Syntax: EMPTY • Any: • Description: can contain text or elements • Syntax: ANY
Element Declaration Syntax • Declaration syntax is flexible when it comes to whitespace. You can add extra space anywhere except in the string of characters at the beginning that identifies the declaration type.For example, these are all acceptable:<!ELEMENT thingie ALL><!ELEMENT thingie ALL><!ELEMENT thingie ( foo | bar | zap )*>
Element: Character Notations • Question Mark: • Character: ? • Description: element may occur zero or one time • Usage: email? • Asterisk: • Character: * • Description: element may occur zero or more times • Usage: email* • Plus: • Character: + • Description: element may occur one or many times • Usage: email+
Element: Character Notations (cont.) • Parentheses: • Character: ( ) • Description: used to indicate a set • Usage: (name, address, zip_code) • Vertical bar: • Character: | • Description: used to indicate a set of values • Usage: a | b | c • Comma: • Character: , • Description: used to indicate element sequence • Usage: (a, b, c)
Attribute Declarations <!ATTLIST element_name attribute_name-1 datatype default_value attribute_name-2 datatype default_value attribute_name-3 datatype default_value> <!ATTLIST student level CDATA #REQUIRED> <!ATTLIST student level (fr | soph | jr | sr) "fr">
Attribute Data Types • Data type: CDATA • Description: character data • Data type: ID • Description: unique identifier to give an element a label • Data type: Enumerated List (i.e., – (a, b, c) ) • Description: list of all possible values that the attribute can contain
Attributes: Default Values • Attribute type: #FIXED • Description: value of the attribute must match the value assigned in the DTD • Attribute type: #REQUIRED • Description: element must contain the attribute to be valid • Attribute type: #IMPLIED • Description: attribute is optional
Example XML Document <?xml version=”1.0” standalone=”yes”?> <emails> <message num=”a1” to=”joe@acmeshipping.com” from=”brenda@xyzcompany.com” date=”02/09/01”> <subject title=”Order 10011”/> <body> Joe, Please let me know if order number 10011 has shipped. Thanks, Brenda </body> <reply status="yes"/> </message> </emails>
Internal DTD <!DOCTYPE emails [ <!ELEMENT emails (message+)> <!ELEMENT message (subject?, body, reply*)> <!ATTLIST message num ID #REQUIRED to CDATA #REQUIRED from CDATA #FIXED “brenda@xyzcompany.com” date CDATA #REQUIRED> <!ELEMENT subject EMPTY> <!ATTLIST subject title CDATA #IMPLIED> <!ELEMENT body ANY> <!ELEMENT reply EMPTY> <!ATTLIST reply status (yes | no) "no"> ]> In a standalone XML document this is prepended to the XML document. If it's an external DTD the XML document must contain a declaration like the following.<!DOCTYPE emails SYSTEM "emails.dtd"> or<!DOCTYPE emails SYSTEM "http://…">
DTD Census Example • Here's an example XML document. The information in this example is a census document. The following example is a typical Census example XML document. It's created after an interview with one family. Consider that all such documents could be compiled and overall statistics generated. example 4-1 • Here's the DTD that generates the rules by which the Census XML documents are created.example 4.2
DTD Design • DTD design and construction is part science and part art form. The basic concepts are simple, but maintaining hundreds of element and attribute declarations while keeping them readable and bug-free can be a challenge. • Keep it organizedGood comments can save hours of scrutinizing later, do not wait until the end to document. Keep declarations separated into sections by their purpose.Pad declarations with lots of whitespace. Content models and attribute lists suffer from dense syntax, so spacing out the parts, even placing them on separate lines, helps. Indent lines inside declarations to make the delimiters clearer. Use extra space between logical divisions.DTD's will require updating as requirements change. Number versions to avoid lots of confusion later.
DTD Design (cont.) • Parameter entitiesParameter entities can hold recurring parts of declarations and allow you to edit them in one place. In the external subset, they can be used in element-type declarations to hold element groups and content models, or in attribute list declarations to hold attribute definitions. For example, assume you want every element to have an optional ID attribute for linking and an optional class attribute to assign specific role information. Parameter entities, which apply only in DTDs, look much like ordinary general entities, but have an extra % in the declaration. You can declare a parameter entity as in the following:<!ENTITY % common.atts " id ID #implied class CDATA #implied" >the entity can be used in attribute list declarations<!ATTLIST foo %common.atts;><!ATTLIST bar %common.atts; extra CDATA #FIXED "blah">
Attributes vs. Elements • Making a DTD from scratch is not easy. You have to break information down into its conceptual atoms and package it as a hierarchical structure, but it's not always clear how to divide the information. Choose names that make sense. Element names like thing, object, and chunk are nearly impossible to figure out.Hierarchy adds information. A newspaper has articles that contain paragraphs and heads. Containers create boundaries to make it easier to write stylesheets and processing applications. Strive for a tree structure that resembles a wide, bushy shrub. If you go too deep, the markup begins to overwhelm the content and it becomes harder to edit a document; too shallow and the information content is diluted.
Attributes vs. Elements (cont.) • Know when to use elements over attributes. An element holds content that is part of your document. An attribute modifies the behavior of an element. The trick is to find a balance between using general elements with attributes to specify purpose and creating an element for every single contingency.There are advantages to splitting a monolithic DTD into smaller components, or modules. The first is that a modularized DTD can be easier to maintain. XML provides two ways to modularize your DTD. The first is to store parts in separate files, then import them with external parameter entities. The second is to use a syntactic device called a conditional section.
Importing Modules • To import whole DTD's or parts of DTDs, use an external parameter entity.<!ELEMENT catalog (title, metadata, front, entries+)><!ENTITY % basic.stuff SYSTEM "basics.mod">%basic.stuff;<!ENTITY % frnt.matter SYSTEM "front.mod">%frnt.matter;<!ENTITY % metadata PUBLIC "-//Standards Stuff//DTD Metadata v3.2//EN" "http://www.standards- ….">%metadata;This DTD has two local components, which are specified by system identifiers. Each component has a .mod filename extension, which is a traditional way to show that a file contains declarations but should not be used as a DTD on its own.
Examples • standalone.xml • itfac.xmlReview the itfac.xml document, then students should develop the dtd. • faculty.dtd • faculty.css
XML Schema Overview • XML Schema specification released by the W3C in May 2001, and contains two parts: • Part I - structure • Part II - data types • Developed as an alternative to DTD’s and is much more powerful • Features: • Pattern matching • Rich set of data types • Attribute grouping • Supports XML namespaces • Follows XML syntax
XML Schemas • The XML Schema specification was released by the W3C in May of 2001 • XML Schemas, like DTD’s, are used to describe the structure of an XML document • The XML Schema specification consists of two parts: • XML Schema: Structures. This specification consists of a definition language for describing and constraining the content of XML documents • XML Schema: Datatypes. This specification defines the datatypes to be used in XML schemas. • The namespace for XML Schema is: http://www.w3.org/2001/XMLSchema
XML Schema - advantages • XML Schema allows you to import vocabularies (tag sets). • XML Schemas are XML documents, so they can be validated • The XML Schema specification contains a number of built-in datatypes, and also allows developers to create their own datatypes • Some of the datatypes are:xs:string textxs:token contains textual tokens xs:QName namespace-qualified namexs:decimal pos & neg floats and int'sxs:integer integersxs:float floating pt. numberxs:ID,IDREF identification tokenxs:boolean true or falsexs:time HH:MM:SSxs:date CCYY-MM-DDxs:dateTime CCYY-MM-DDTHH:MM:SS-Zone
Complex Elements • Most elements are not simple. They can contain elements, attributes, and character data with specialized formats. So, complex elements can be defined.Here's an example complex type definition.<xs:element name="date"> <xs:complexType> <xs:all> <xs:element ref="year"/> <xs:element ref="mo"/> <xs:element ref="day"/> </xs:all> </xs:complexType></xs:element><xs:element name="year" type="xs:integer"/><xs:element name="mo" type="xs:integer/><xs:element name="day" type="xs:integer/>
Restriction Elements • In the previous example the month number was just given as type integer. However, this would allow the user to insert any integer into the document for the month number, obviously we'd like to restrict the month number to 1-12.<xs:simpleType name="monthNum"> <xs:restriction base="xs:integer"> <xs:minInclusive value="1" /> <xs:maxInclusive value="12" /> </xs:restriction></xs:simpleType><xs:element name="mo" type="monthNum"/>
Restriction Elements (cont.) • Restrictions can create fixed values, constrain the length of strings, and match patterns with regular expressions. Here's an example that restricts a postal code (three digits followed by three capital letters).<xs:element name="postalcode" type="pcode"/><xs:simpleType name="pcode"> <xs:restriction base="xs:token"> <xs:pattern value="[0-9]{3}[A-Z]{3}"/> </xs:restriction></xs:simpleType> • Can also implement enumeration types<xs simpleType name="gender"> <xs:restriction base="xs:token"> <xs:enumeration value="female"/> <xs:enumeration value="male"/> </xs:restriction></xs:simpleType>
XML Schema Occurrence Constraints • Occurrence constraints define the number of times a particular element can or must occur • Attributes: minOccurs:Defines the minimum number of times an element can occur. Default value is 1 maxOccurs: Defines the maximum number of times an element can occur. Default value is 1 • Can set the value of the “maxOccurs” attribute to “unbounded” to indicate that there is no maximum number of times the element can occur
XML Schema Simple Type Example • XML schemas are put together like DTD's with element and attribute declarations along with type declarations. A simple example shows the structure. • XML file:<?xml version=”1.0”?> <email xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation = "email_schema.xsd"> This is my e-mail message </email> • Schema file: <?xml version=”1.0”?><xsd:schema xmlns:xsd= ”http://www.w3.org/2001/XMLSchema”> <xsd:element name=”email” type=”xsd:string”/></xsd:schema>
XML Schemas • XML Schemas utilize:type extensiontype restrictionlistsunionsnamespace featuresand much, much more.This brief presentation only scratches the surface of XML schemas.
XML Schema Example • Here's a schema for the Census example that a DTD was defined for. Note the differences.example 4-6