480 likes | 579 Views
XML Syntax - Writing XML and Designing DTD's. Alan Robinson. XML Syntax. Elements XML tags for markup Attributes Tuple information of elements Declarations Instructions to XML processor Processing Instructions Instructions to external applications. A Piece of XML.
E N D
XML Syntax - Writing XML and Designing DTD's Alan Robinson
XML Syntax • Elements • XML tags for markup • Attributes • Tuple information of elements • Declarations • Instructions to XML processor • Processing Instructions • Instructions to external applications
A Piece of XML <seqid="my_seq" name="NUCLEAR RIBONUCLEOPROTEIN"> <dbxref> <database>SWISS-PROT</database> <unique_id>P09651</unique_id> </dbxref> <residuestype="aa"> SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF </residues> </seq>
An XML DTD <?xml version='1.0' encoding="US-ASCII"?> <!DOCTYPE seq [ <!ELEMENT seq (dbxref*, residues?) > <!ATTLIST seq id ID #REQUIRED name CDATA #IMPLIED length CDATA #IMPLIED > <!ELEMENT residues (#PCDATA)> <!ATTLIST residues type (dna | rna | aa) #REQUIRED> ]>
Elements • Basic rules • Start tag <tag_name> and end tag </tag_name> • Tags must be nested • <tag1><tag2>…</tag2></tag1> • Tags may be empty (no enclosed data) • <empty_tag/> • Whitespace in element content usually ignored • <section><p> … </p></section> • <section> <p> … </p></section>
Attributes • Provides additional information about an element • Enclosed by quotes - either " or ' • Case-sensitive • May be character data or tokenized • value="Blue Peter" (character data) • value = "blue" (single token) • value = "red green blue" (tokens) • Values may be enumerated or defaulted (DTD)
Declarations • Instructions for the XML processor • Format - <! … > or <! … [<! … >]> • Document type - <!DOCTYPE … > • Character data - <![CDATA[ … ]]> • Entities - <!ENTITY … > • Notation - <!NOTATION … > • Element - <!ELEMENT … > • Attributes - <!ATTLIST … > • <![INCLUDE[…]]> and <![IGNORE[…]]>
Document Type Declaration • Identifies the name of the document root element • <!DOCTYPE My_XML_Doc> • May also add entity definitions and DTD • <!DOCTYPE My_XML_Doc [ … ] ><My_XML_Doc> ...</My_XML_Doc>
Comment Declaration • Comments are not considered part of XML document and should not be published • <!-- A comment --> • Cannot have additional '--' in comment • Cannot embed inside other declarations
Character Data Declaration • For occasions when text must contain uninterpreted markup characters • Press <<<ENTER>>> • <![CDATA[Press <<<ENTER>>>]]>
Processing Instructions • Information required by an external application • Processing Instructions • Format - <? … ?> • XML PI - <?xml version='1.0’ ?> • Confusingly, this is called the XML declaration, but is a processing instruction
Entities • XML document may be distributed among a number of files • Each unit of information is called an entity • Each entity has a name to identify it • Defined using an entity declaration • Used by calling an entity reference
When to use Entities • Use an entity when the information • Is used in several places • May be represented differently • Is part of a larger document that needs to be split up to be manageable • Conforms to a data format other than XML
Internal Entity Stored in main document Text content only External Entity Stored externally to the main document Text or binary Can use to group many internal entities together General Entity Referred to in XML document Parameter Entity Referred to in markup declarations in DTD Types of Entity
General Entities • Declared in 'Document Type Declaration' • <!DOCTYPE My_XML_Doc [ <!ENTITY name "replacement"> ]> • <!ENTITY xml "eXtensible Markup Language"> • The &xml; includes entities • The eXtensible Markup Language includes entities
Parameter Entities • Declared in 'Document Type Declaration' • <!DOCTYPE My_XML_Doc [ <!ENTITY % name "replacement"> ]> • <!ENTITY % param "(para | list)"> • <!ELEMENT section (%param;)*>
External Entities • External Text Entities • Location specified with SYSTEM keyword • <!ENTITY ent SYSTEM "/ENTS/MYENT.XML"> • May specify with public identifier • <!ENTITY ent PUBLIC "-//EBI//ENTITIES ents//EN" … > • External Binary Entities • Need to identify format of data - NDATA • <!ELEMENT pic EMPTY><!ATTLIST pic name ENTITY #REQUIRED><!ENTITY photo SYSTEM "/ENTS/photo.tif" NDATA TIFF> • Referenced by empty element • A photograph <pic name="photo"/>.
Restrictions on Entities • General text entities • Can appear in element content • <para> … &ent; … </para> • Can appear in attribute value • <para name="&ent;"> … </para> • Can appear in internal entity content • <!ENTITY cod "&ent;"> • Cannot appear in other parts of DTD
Restrictions on Entities (2) • Binary entities • If entity content is not XML, the entity cannot be used as a textual reference • Error - <!ELEMENT sec (para|&photo;)> • Error - <para> &photo; </para> • Binary entity can only appear as an attribute of type ENTITY • <!ENTITY photo SYSTEM "photo.tif" NDATA TIFF>…<!ELEMENT pic (#PCDATA)><!ATTLIST pic name ENTITY #REQUIRED>
Element Declarations • Used to define new elements and their content • <!ELEMENT name (#PCDATA)> <name> … </name> • Empty element has no content • <!ELEMENT name EMPTY> <name/> • When children allowed - any or model group • <!ELEMENT name ANY> • <!ELEMENT person (name, e-mail*)>
Model Groups • Used to define content of elements • <!ELEMENT person (name, e-mail*)> • Used to define hierarchies of elements • <!ELEMENT name (fname, surname)><!ELEMENT fname (#PCDATA)><!ELEMENT surname (#PCDATA)><!ELEMENT e-mail (#PCDATA)> • Control organisation of elements • Sequence connector - ',' - (A, B, C) [then] • Choice connector - '|' - (A | B | C) [or]
Model Group Quantity Indicators • Describe constraints on elements in DTDA? May occur [0..1]A+ Must occur [1..*]A* May occur [0..*]A | B Either A or BA, B A followed by B(A, B)+ ((A,B?) | C+)*
Attribute Declarations • Attributes can be attached to elements • Declared separately in ATTLIST declaration • <!ATTLIST tag … > • Rest of definition specifies • attribute name • attribute type • default value
Attribute Names and Types • Attribute name • <!ATTLIST tag nametypedefault> • <!ATTLIST tag first_attr …secon_attr … third_attr … > • Attribute types • ID • IDREF • IDREFS • NOTATION • name group • CDATA • NMTOKEN • NMTOKENS • ENTITY • ENTITIES
CDATA Character data NMTOKEN Single token NMTOKENS Multiple tokens ENTITY Attribute is entity ref ENTITIES Multiple entity ref's ID Unique ID IDREF Match to ID IDREFS Match to multiple ID's NOTATION Describe non-XML data Name group Restricted list Attribute Types
CDATA <!ATTLIST person name CDATA … > NMTOKEN <!ATTLIST mug color NMTOKEN … > NMTOKENS <!ATTLIST temp values NMTOKENS … > ENTITY <!ATTLIST person photo ENTITY … > ENTITIES <!ATTLIST album photos ENTITIES …> ID <!ATTLIST person id ID … > IDREF <!ATTLIST person father IDREF … > IDREFS <!ATTLIST person children IDREFS … > NOTATION <!ATTLIST image format NOTATION (TeX|TIFF) …> Name group <!ATTLIST point coord (X|Y|Z) … > Attribute Types
CDATA name = "Tom Jones" NMTOKEN color="red" NMTOKENS values="12 15 34" ENTITY photo="MyPic" ENTITIES photos="pic1 pic2" ID ID = "P09567" IDREF IDREF="P09567" IDREFS IDREFS="A01 A02" NOTATION FORMAT="TeX" Name group coord="X" Attribute Types
Default Attribute Values • Can specify a default attribute value for when its missing from XML document, or state that value must be entered • #REQUIRED Must be specified • #IMPLIED May be specifed • "default" Default value if unspecified • #FIXED Only one value allowed <ATTLIST tag name type default> <!ATTLIST seqlist sepchar NMTOKEN #REQUIRED type (alpha|num) "num"
Parameter Entities • Use parameter entities within DTD • <!ENTITY %common "(para|list|table)"><!ELEMENT chapter ((%common;)*, section*)><!ELEMENT section (%common;)*> • Safest to include parentheses in entity definition and around entity reference
Conditional Sections • INCLUDE and IGNORE declarations • <![INCLUDE[ … ]]> • <![IGNORE[ … ]]> • Can be used as switch for including and/or excluding declarations • <!ENTITY % variant "INCLUDE"><![%variant;[ <!ENTITY % Text "(#PCDATA|temp)"> ]]><!ENTITY % Text "(#PCDATA)">
Notation Declaration • Description of external non-XML entity given in Notation declaration • Also specifies helper app and documentation • Use with NDATA keyword • <!ENTITY Logo SYSTEM "LOGO.TIFF" NDATA TIFF><!ATTLIST image pic ENTITY #REQUIRED> <!NOTATION TIFF SYSTEM "/usr/local/bin/display.exe"> <!NOTATION TIFF PUBLIC "-//EBI//NOTATION tiff help file//EN" "/usr/local/bin/display.exe">
Putting it all together... • Have now been introduced to the main components and rules of XML and DTD’s • Entities, elements, declarations, processing instructions, attribute lists • Use all these components in the 'Document Definition Type' (DTD) to specify the rules about the format of the XML document
What is a DTD? • A template for document markup • A file which contains a formal definition of a particular type of document • A DTD describes: • What names can be used for element types • Where element types can occur • How element types fit together • Specifies document hierarchy and granularity • Specifies names and types of element attributes
Why have a DTD • Validating XML parser can check the structure of the XML file against a DTD and check that it is valid and well-formed • DTD can be a mechanism for standardisation and hence document/data manipulation and exchange
DTD Declaration • DTD syntax is stored either in an external file, in the XML file itself, or both • Internal DTD overrides or adds to the external in cases of ENTITY and ATTLIST repetition • DTD composed of declarations • ELEMENT - Tag definition • ATTLIST - Attribute definitions • ENTITY - Entity definition • NOTATION - Data type notation definition
Internal DTD Definition • Include in the DOCTYPE declaration • <!DOCTYPE MyDoc [ <!-- DTD appears here --> <! … > <! … >]><!-- Rest of XML file -->
External DTD Definition • Reference external DTD file as pathname in DOCTYPE declaration • <!DOCTYPE MyDoc SYSTEM "./MyDoc.dtd" [ <!-- Extra declarations --> <! … > <! … >]><!-- Rest of XML file --> • Document specific declarations kept internally
External DTD Definition (2) • Reference external DTD file as URL in DOCTYPE declaration • <!DOCTYPE MyDoc SYSTEM ”http://…/MyDoc.dtd" [ <!-- Extra declarations --> <! … > <! … >]><!-- Rest of XML file --> • Document specific declarations kept internally
Designing a DTD • Not trivial! • If an XML DTD needs to be changed - may have serious consequences on other s/w • Separation of interface and implementation • Analogous to database schema design • Need to consider • Granularity • Attributes versus Elements • Limitations of DTD declarations
Designing a DTD (2) • Identify features of the data that need markup • For each feature, determine • Can it be given a name • Does it always appear • May there be more than one • Does it deconstruct to smaller features • Is some of the textual content always the same • How is it associated with other features
Granularity of DTD • <PERSON> <NAME>Jon Smith</NAME></PERSON> • <PERSON> <FORENAME>Jon</FORENAME> <SURNAME>Smith</SURNAME></PERSON>
Elements or Attributes? • How should data be encapsulated? • <book><title>The Forty-nine Steps</title> … </book> • <book title="The Forty-nine Steps"> …</book> • Depends upon what document type is designed for • A "religious" issue, rather than technical…
Elements or Attributes? (2) • Discriminate content from metadata: • Data to be printed as character data • Metadata as attributes • General rule: • If all markup is stripped away, the document should still be readable and useable • If in doubt, use an attribute • Fallacy: "If you use an attribute to encode information, a browser won't display it"
Elements or Attributes? (3) • Use attributes to stress the 1-to-1 relationship among pieces of information - show that the element represents a tuple of information • Use attribute when it's a property of the element • Use an attribute when the information is inherent to the parent but not a constituent part • head versus height • Use attributes for simple data type validation • Use embedded elements for complex structure validation http://www.oasis-open.org/cover/attrSperberg92.html
Elements or Attributes? (4) • 1] Does the value have one of an enumeration of values or is the value free-form? • [1a] - enumerated values can be name token groups in attributes • [1e] - no restrictions on values for the content of elements • [2] Is the value to be specified, manipulated, organised, consumed by a program or by a human? • [2a] - I use attributes for computer-manipulated values • [2e] - I use elements for human-manipulated values • [3] Does the information represent information *about* content, or is the information the content itself? • [3a] - I typically put meta-data in attributes • [3e] - I typically put content into elements • [4] Is the information flat or hierarchical? • [4a] - attributes are flat and a value has no hierarchy • [4e] - elements can be either flat or hierarchical • [5] Is the information unordered or ordered? • [5a] - multiple attribute values in a single start element have no prescribed order • [5e] - multiple child elements of an element can be modelled in a prescribed order • [6] Is the content to be spell-checked? • [6a] - I put values guaranteed to fail a spell-checker (or undesired to be spell-checked) in attributes • [6e] - I put values to be spell checked in elements http://www.oasis-open.org/cover/holmanElementsAttrs.html
Two scenarios for allowing standardised access to information in XML files Require that XML documents represent the information in a particular way Industry-standard DTD's Introduce middleware that knows how to interpret a particular DTD for our application Lightweight document interface objects (Java, CORBA, COM) and DTD-specific interface service objects Standardisation
XML DTD Standards • OASIS has list of specifications released and in development • http://www.xml.org/ • http://www.oasis-open.org/ • Need to be able to import parts of DTD's • Re-specify in own DTD • Consider using XML Namespaces