320 likes | 344 Views
This article provides an overview of XML validation, DTD syntax, and the use of grammars and regular expressions. It covers topics such as the history of XML, elements, attributes, entity declarations, and the validation process. Examples and explanations are given for each concept.
E N D
XML Validation IDTDs Robin Burke ECT 360 Winter 2004
Outline • History • Grammars / Regular expressions • DTDs • elements • attributes • entities • Declarations
Validation • Why bother?
The idea • Language consists of terminals • a, b, c • Set of productions • beginning with non-terminals • A, B, C • rules specifying how to generate sequences of terminals
Example • A aB • A aBA • B b • generates strings • ababab etc.
Grammar • Can be used to efficiently parse a language • basis of all modern programming language parsing since Algol-60 • Java Language Specification is completely in EBNF grammar
Grammar • XML • grammar-based syntax • adheres to EBNF • SGML • SGML had a more complex language definition syntax • HTML is defined the SGML way
Regular expressions • Language for expressing patterns • Basic components • pattern elements • optional element = ? • repetition (1 or more) = + • repetition (0 or more) = * • choice = | • grouping = ( ) • sequence = ,
Examples • (a, b)* • all strings "ab" "abab" etc. • (a | b | c)+, q, (b, c)* • aaqb • bq • bqcccccccc
Note • Regular expressions are different in different applications • Perl • Javascript • XML Schemas • DTDs only support • ?+*|,()
EBNF • EBNF is more compact version of BNF • it uses regular expressions to simplify grammar expression • A aB • A aBA • turns into • A aB(A)? • only one production per non-terminal allowed
DTDs • Use EBNF to specify structure of XML documents • Plus • attributes • entities • Syntax • holdover from SGML • Ugly
DTD Syntax • <!ELEMENT element-namecontent_model> • Content model contains the RHS of the production rule • Example <!ELEMENT name (firstName, lastName)>
DTD Syntax cont'd • Not XML • <! begins a declaration • No "content" • Empty elements not indicated with />
Simple content models • Content can be any text • #PCDATA • Content can be anything at all • (useful for debugging) • ANY • Element has no content • EMPTY
Example <grades> <grade> <student>Jane Doe</student> <assigned-grade>A</assigned-grade> </grade> <grade> <student>John Doe</student> <assigned-grade>A-</assigned-grade> </grade> </grades>
Example <grades> <grade> <student>Jane Doe</student> <assigned-grade>A</assigned-grade> </grade> <grade> <student>John Doe</student> <assigned-grade>A-</assigned-grade> </grade> <grade> <student>Wayne Doe</student> <assigned-grade>I</assigned-grade> <reason>Alien abduction</reason> </grade> </grades>
Mixed content • Legal to have a content model with text and element data <story category="national" byline="Karen Wheatley"> <headline>President Meets with Congress</headline> The President meet with Congressional leaders today in effort to jump-start faltering budget negotiations. Sources described the mood of the meeting as "cordial". <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /> </story>
Mixed content, cont'd • <!ELEMENT story (headline, #PCDATA, full-story, image*)> • Mixed content makes handling XML complex • necessary for many applications
Recursion • Unlike grammars • recursive formulation ≠ repetition • Difference between • <!ELEMENT students (student+)> • <!ELEMENT students (student, students?)>
Restriction • The grammar cannot be ambiguous • A (a, b)| (a, c) • this makes the parser implementation difficult • Usually easy to make non-ambiguous • A a, (b | c)
Attribute lists • Declared separately from elements • can be anywhere in the DTD • Specification includes • name of the element • name of the attribute • attribute type • default
Attribute types • Character data • CDATA • different from XML CDATA section! • Enumerated • (yes|no) • ID • must be unique in the document • IDREF • must refer to an id in the document • NMTOKEN • a restriction of CDATA to single "word" • Also IDREFS and NMTOKENS
Default declaration • #REQUIRED • #IMPLIED • means optional • Value • this becomes the default • #FIXED • value provided
Examples <!ATTLIST img src CDATA #REQUIRED alt CDATA #REQUIRED align (left|right|center) "left" id ID #IMPLIED > <!ATTLIST timestamp time-zone NMTOKEN #IMPLIED>
Entities • Like macros • content to be inserted • indicated with &name; • Predefined general entities • & < • essential part of XML • User-defined general entities • &disclaimer;
Entities, cont'd • Parameter entities • can also be used to simplify DTD creation • or to combine DTDs • indicated with a % • More on this next week
Defining general entities <!ENTITY name content> • Example <!ENTITY disclaimer "This is a work of fiction. Any resemblance to persons living or dead is unintentional.">
In-class exercise • Business cards
Next week • More DTDs • Entities • Modularization and parameterization • pg. 129-148