400 likes | 418 Views
XML Validation I DTDs. Robin Burke ECT 360 Winter 2004. Outline. History Grammars / Regular expressions DTDs elements attributes entities Declarations. Validation. Why bother?. SGML. SGML was designed for text documents Only content of interest = text
E N D
XML Validation IDTDs Robin Burke ECT 360 Winter 2004
Outline • History • Grammars / Regular expressions • DTDs • elements • attributes • entities • Declarations
Validation • Why bother?
SGML • SGML was designed for text documents • Only content of interest = text • Interest in data types is new (XML) • SGML used existing notation for programming languages • production-based grammar • extended backus-naur form
The idea • Language consists of terminals • a, b, c • Set of productions • beginning with non-terminals • A, B, C • rules specifying how to generate sequences of terminals
Example • A aB • A aBA • B b • generates strings • ababab etc.
Grammar • Can be used to efficiently parse a language • basis of all modern programming language parsing since Algol-60
Java Language Spec • ClassOrInterfaceDeclaration: • ModifiersOpt (ClassDeclaration | InterfaceDeclaration) • InterfaceDeclaration: • interface Identifier [extends TypeList] InterfaceBody • TypeList: • Type { , Type} • InterfaceBody: • { {InterfaceBodyDeclaration} } • InterfaceBodyDeclaration: • (ModifiersOpt InterfaceMemberDecl )* • InterfaceMemberDecl: • InterfaceMethodOrFieldDecl • void Identifier VoidInterfaceMethodDeclaratorRest • ClassOrInterfaceDeclaration • InterfaceMethodOrFieldDecl: • Type Identifier InterfaceMethodOrFieldRest • InterfaceMethodOrFieldRest: • ConstantDeclaratorsRest • InterfaceMethodDeclaratorRest • InterfaceMethodDeclaratorRest: • FormalParameters BracketsOpt [throws QualifiedIdentifierList] ; • VoidInterfaceMethodDeclaratorRest: • FormalParameters [throws QualifiedIdentifierList] ;
Grammar • XML • grammar-based syntax • adheres to EBNF • SGML • SGML had a more complex language definition syntax • HTML is defined the SGML way
Regular expressions • Language for expressing patterns • Basic components • pattern elements • optional element = ? • repetition (1 or more) = + • repetition (0 or more) = * • choice = | • grouping = ( ) • sequence = ,
Examples • (a, b)* • all strings "ab" "abab" etc. • (a | b | c)+, q, (b, c)* • aaqb • bq • bqcccccccc
Note • Regular expressions are different in different applications • Perl • Javascript • XML Schemas • DTDs only support • ?+*|,()
EBNF • EBNF is more compact version of BNF • it uses regular expressions to simplify grammar expression • A aB • A aBA • turns into • A aB(A)? • only one production per non-terminal allowed
DTDs • Use EBNF to specify structure of XML documents • Plus • attributes • entities • Syntax • holdover from SGML • Ugly
DTD Syntax • <!ELEMENT element-namecontent_model> • Content model contains the RHS of the production rule • Example <!ELEMENT name (firstName, lastName)>
DTD Syntax cont'd • Not XML • <! begins a declaration • No "content" • Empty elements not indicated
Some special cases • Content can be any text • #PCDATA • Content can be anything at all • (useful for debugging) • ANY • Element has no content • EMPTY
Example <grades> <grade> <student>Jane Doe</student> <assigned-grade>A</assigned-grade> </grade> <grade> <student>John Doe</student> <assigned-grade>A-</assigned-grade> </grade> </grades>
Example <grades> <grade> <student>Jane Doe</student> <assigned-grade>A</assigned-grade> </grade> <grade> <student>John Doe</student> <assigned-grade>A-</assigned-grade> </grade> <grade> <student>Wayne Doe</student> <assigned-grade>I</assigned-grade> <reason>Alien abduction</reason> </grade> </grades>
Mixed content • Legal to have a content model with text and element data <story category="national" byline="Karen Wheatley"> <headline>President Meets with Congress</headline> <![CDATA[ The President meet with Congressional leaders today in effort to jump-start faltering budget negotiations. Sources described the mood of the meeting as "cordial". ]]> <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /> </story>
CDATA? • Forgot to mention last week • Content that appears here will not be parsed • Can include arbitrary text including <, &, etc. • Only restriction • termination sequence • ]]>
Mixed content, cont'd • <!ELEMENT story (headline, #PCDATA, full-story, image*)> • Mixed content is usually discouraged • Makes transformations more difficult
Recursion • Unlike grammars • recursive formulation ≠ repetition • Difference between • <!ELEMENT students (student+)> • <!ELEMENT students (student, students?)>
Restriction • The grammar cannot be ambiguous • A (a, b)| (a, c) • this makes the parser implementation difficult • Usually easy to make non-ambiguous • A a, (b | c)
Attribute lists • Declared separately from elements • Specification includes • name of the element • name of the attribute • attribute type • default
Attribute types • Character data • CDATA • different from XML CDATA section! • Enumerated • (yes|no) • ID • must be unique in the document • IDREF • must refer to an id in the document • NMTOKEN • a restriction of CDATA to single "word" • Also IDREFS and NMTOKENS
Default declaration • #REQUIRED • #IMPLIED • means optional • Value • this becomes the default • #FIXED • value provided
Examples <!ATTLIST img src CDATA #REQUIRED alt CDATA #REQUIRED align (left|right|center) "left" id ID #IMPLIED > <!ATTLIST timestamp time-zone NMTOKEN #IMPLIED>
Entities • Like macros • content to be inserted • indicated with &name; • Predefined general entities • & < • essential part of XML • User-defined general entities • &disclaimer;
Entities, cont'd • Parameter entities • can also be used to simplify DTD creation • or to combine DTDs • indicated with a % • Example from book • %Books; • %Mags;
Defining entities • General entities • <!ENTITY name content> • Example <!ENTITY disclaimer "This is a work of fiction. Any resemblance to persons living or dead is unintentional.">
Defining entities, cont'd • Parameter entities <!ENTITY % name content> • or more typical <!ENTITY % name SYSTEM url>
Unparsed data • What about non-text data? • images, audio files • In XML • we define a notation • create a name and associate an application • suggestion to the application • how to interpret the unparsed data • not part of parsing operation
Using Notation • <!NOTATION name SYSTEM url> • Example • <!NOTATION jpeg SYSTEM • "IExplore.exe"> • declares the jpeg notation • Example • <!ENTITY "photo53" SYSTEM "photo53.jpg" NDATA jpeg>
Notation, cont'd • Note that the content is defined in the DTD • not the document • binary data embedded in XML document • Not that useful in practice • more likely
Typical Example <story category="national" byline="Karen Wheatley"> ... <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /> </story> • Now it is up to the application to do something appropriate with the src attribute
A better solution • Use XLink
DTD limitations • Not in XML • need a special parser for the DTD • No content type restrictions • #PCDATA can be anything • Element names must be globally unique • cannot reuse a common term at different places in the document • course-name • professor-name