760 likes | 914 Views
Introduction to DTD. Bun Yue Professor, CS/CIS UHCL. Introduction. A DTD is a grammar that is used to determine the validity of an XML document. There is no separation recommendation of DTD.
E N D
Introduction to DTD Bun Yue Professor, CS/CIS UHCL
Introduction • A DTD is a grammar that is used to determine the validity of an XML document. • There is no separation recommendation of DTD. • It is embedded inside the XML recommendation: http://www.w3.org/TR/2008/REC-xml-20081126/ (5th edition).
DTD • DTD is used to specify additional constraints and rules for a given vocabulary, such as • element nesting rules • attribute name and value constraints. • DTD allows XML parsers to capture errors as soon as possible. Errors are less costly to fix in earlier stages.
Validation • An XML document satisfying the rules of a DTD is said to be validated. • The command line DTD validation tool, xmlvalid, can be obtained from http://www.elcel.com/products/xmlvalid.html. • XML editors and parsers usually can be used to validate XML documents.
Example <?xml version="1.0"> <person> <name>Adam</name> <spouse>Lucy</spouse> <spouse>Eva</spouse> </person> • Should there be two spouses? Is it an error? • Is "Eva"or "Lucy" a person? • Are there any additional information about "Lucy" or "Eva"?
Creator’s Intentions • General problems with XML documents: Creators may not know what applications will use the file. • Need to communicate creator's intentions to users.
Document Modeling • XML document modeling defines a grammar to restrict and constrain an XML application. • Advantages of document modeling: • Clear intention. • Restrictions lead to easier processing. • Interoperability improves if everyone uses the same standards. • Facilitate the development of tools for the XML applications.
Document Modeling • Disadvantage of document modeling: • Time for development. • Potentially more timely to check validity. • May be too restrictive.
XML Modeling Languages • Many methods. • Two main standards: • Document Type Definition (DTD): more established, but limited. • XML Schema: more sophisticated and gaining popularity. • May use both.
Example Continuing on the previous example, a better approach is to specify the constraints using DTD, such as: A person may only have up to one spouse. A spouse must refer to a person in the same XML doc.
DTD Example A possible DTD declaration for this: … <!ELEMENT person (name, pet*)> <!ATTLIST person id ID #REQUIRED spouse IDREF #IMPLIED> …
XML Example An XML document satisfying the DTD: ... <person id="p12324" spouse="p10001"> <name>Adam</name> <pet>Eva</pet> </person> <person id="p10001"> <name>Lucy</name> </person> ... This XML document is validated w.r.t. the DTD.
Document Modeling • Without a document model, an XML document only needs to be well-formed and it may have: • unlimited and unrestricted vocabulary: any element and attributes will be allowed. • no grammar rules, for example: • any element can be nested within any other element. • any element may have any attribute. • an attribute may have any value.
Associating XML to Document Type Declarations • The <!DOCTYPE> tag is used in XML to associate the XML document to its document type declarations. • It is optional but must follow the XML declaration immediately. • DTD declarations can be: • Internal DTD Subset • External DTD • The name of the root element should follow the keyword DOCTYPE.
Internal Subset • DTD is defined at the beginning of an XML document within the <!DOCTYPE> tag. • Format: <!DOCTYPE root-element external-subset-declaration [internal-subset-declaration]>.
Internal Subset Example <?xml version="1.0"?> <!DOCTYPE persons [<!ELEMENT persons (#PCDATA)>] > <persons> Kwok-Bun Yue </persons>
Internal Subset Example <?xml version="1.0"?> <!DOCTYPE board SYSTEM "msg.dtd" [ <!ENTITY monitor "Kwok-Bun Yue"> <!ENTITY monitoremail "yue@cl.uh.edu"> ]> …
Consideration • Internal DTD declarations have higher precedence than external DTD. • Internal DTD advantages: • always available as it is part of the XML document. • Higher precedence than external DTD. • Disadvantages: • Wasted transmission for non-validating parsers. • Redundancy problems: many documents may have the same internal DTD subset definitions. • Good to use Internal DTD subset to override external DTD (for example, to define entities suitable for the XML document.)
External DTD • External DTD is stored in external resources (e.g. files specified by an URL.) Example: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
External DTD Format • Instruct the XML document to get the DTD from the URL. • The keyword after the root element can be: • SYSTEM: always get the DTD from the URL. • PUBLIC: may get the DTD from some other means.
Formal Public Identifier • "-//W3C//DTD XHTML 1.0 Strict//EN" is the formal public identifier (FPI) of the xhtml DTD. FTI identify resources by names instead of URLs and thus do not have the URL relocation problem. Rough meaning in this example: • '-': not registered. • 'W3C': owner id, W3C in this case. • 'DTD XHTML 1.0 Strict': type and description of document. • 'EN': language, English, in this case.
FPI • Another example of FPI: "-//W3C//DTD HTML 4.0 Transitional//EN" • FPI is required for PUBLIC but not SYSTEM.
DTD Declarations • Vigorous data modeling should be used to define DTD. • Need to define the right business rules and constraints. • Errors in DTD are costly. • Usually, define the DTD to be as restrictive as possible.
DTD • DTD declarations are composed of a sequence of declarations. • Each DTD declaration declares one of the following constructs: • ELEMENT: XML element types • ATTLIST: attributes of an element • ENTITY: reusable content referenced by the &…; syntax • NOTATION: external contents not to be parsed.
DTD Declarations • If there is conflict, earlier declarations have higher precedence. • Although internal declarations are physically located after external declarations, they are read first and have thus high precedence. • No forward reference is allowed for parameter entities.
Element Declarations • Format of element declaration: <!ELEMENT element-name element-declaration>. • Element declarations can be one of the following four kinds. • EMPTY • ANY • #PCDATA • Content model: most important.
EMPTY and ANY • EMPTY: empty element. E.g. <!ELEMENT file EMPTY> • ANY: may contain anything. • No parsing checking. • Any embedded descendant elements will still need to be declared within the DTD. E.g. <!ELEMENT freeForAll ANY>
#PCDATA and Content Model • #PCDATA (parsed character data): text that is parsed for entity reference replacement. E.g. <!ELEMENT firstname (#PCDATA)> • Content model: a declaration of contents enclosed by ( & ) for specifying child elements.
Content Model • The following symbols can be used by content models: • ,: sequencing. • (): grouping • ?: 0 or 1. • *: 0 or more. • +: 1 or more. • |: or.
Example <!ELEMENT abc EMPTY> <!ELEMENT generalnote ANY> <!ELEMENT firstname (#PCDATA)> <!ELEMENT name (lastname, firstname)>
Example <!ELEMENT name (first, middleinitial?, last)> Acceptable: <name><first>Bun</first><last>Yue</last></name> <name> <first>Bun</first> <middleinitial>K</middleinitial> <last>Yue</last> </name>
Example <!ELEMENT name (first, middleinitial?, last)> Not acceptable: <name><last>Yue</last><first>Bun</first></name> <name>The one and only: <middleinitial>K</middleinitial> <first>Bun</first><last>Yue</last></name>
Exercise #1 Provide a DTD that will validate the following: <names> <name><first>Bun</first><last>Yue</last></name> <name> <first>Bun</first> <middleinitial>K</middleinitial> <last>Yue</last> </name> </names>
Mixed Content Model • For mixed content model, the following format must be used: (#PCDATA | child-element-1 | child-element-2 ...)* • #PCDATA must come first. • * must be used. • (#PCDATA) is also mixed content. • In general, mixed content models (character data and elements) should be avoided if possible.
Mixed Content Model • Mixed content models should be avoided if possible because: • provide minimum constraints • are harder to parse. • behaviors may also be different with or without DTD: some spaces may be for cosmetic uses only. • Scattered #PCDATA is up to interpretation.
Exercise #2 • Can you provide an example of well known elements that use mixed content models?
Exercise #3a How many text nodes are there? <a> <title>Greeting</title> <b>Hello</b> How are you? <b>Goodbye</b> </a>
Exercise #3b How many text nodes are there using <!ELEMENT a (#PCDATA | title | b)* >? <a> <title>Greeting</title> <b>Hello</b> How are you? <b>Goodbye</b> </a>
Exercise #3c How many text nodes are there using <!ELEMENT a (title | b | c)*>? <a> <title>Greeting</title> <b>Hello</b> How are you? <b>Goodbye</b> </a>
Example <!ELEMENT email (from, to+, cc*, subject, body)> Acceptable: <email> <from>Yue</from><to>Lee</to><to>Smith</to> <cc>King</cc> <subject>hello</subject> <body>good bye</body> </email>
Example <!ELEMENT email (from, to+, cc*, subject, body)> Not acceptable: <email> <from>Yue</from><cc>King</cc><to>Lee</to> <cc>Queen</cc><to>Smith</to> <subject>hello</subject> <body>good bye</body> </email>
Exercise #4 • Modify the DTD so cc and to can come in any order (there should still be at least on to).
Exercise #5 Comments on this DTD: <!ELEMENT bookcollection (book+) > <!ELEMENT book (author, publisher, isbn, chapter*)> <!ELEMENT author (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ELEMENT isbn (#PCDATA)> <!ELEMENT chapter (#PCDATA)>
Exercise #6 • How do you declare in DTD that element <a> may have a <b> and a <c> child in any order?
ATTLIST Declarations • To declare attribute properties of an element. • Format: <!ATTLIST element-name attribute-declarations> • Attribute declarations declare one or more attributes. • Each attribute declaration includes the attribute name, its type and a setting.
Example <!ATTLIST person comment CDATA #IMPLIED> <!ATTLIST person ssn ID #REQUIRED gender (male|female) #IMPLIED age CDATA #IMPLIED iq CDATA "100">
Attribute data types • CDATA: character data . A string, least restrictive. • ID: unique identifier. • An unique name string within the document; • Must start with a letter, a "_" or a ":". • Like a primary key of the element, but not exactly so. • No two elements within the XML document should have the same ID value. • The scope of ID is for the document, not for the element.
Attribute data types • IDREF: identifier reference. Refer to an ID value of some other elements. • IDREFS: identifier reference list. Refer to many ID values separated by white spaces. • ENTITY: entity name. Name of a pre-defined external entity. • ENTITIES: entity name list. Many entity names separated by a space.
Attribute data types • NMTOKEN: name token. A name formed by alphanumeric characters only (including ".", "-", "_", and ":"). The first character may be a letter, ".", ":", "_" or "-". • NMTOKENS: name token list. Many NMTOKENS separated by a white space. • NOTATION: notation list. For referencing data other than XML. • A list of notation names. • Each notation contains instruction for processing non XML data. • Each notation contains instruction for processing non XML data.