1.05k likes | 1.2k Views
XML Grammars. 95-733 Internet Technologies. XML Grammars: Three Major Uses. 1. Validation Code Generation Communication. XML Validation. Sources for this lecture: “ Data on the Web ” Abiteboul, Buneman and Suciu “ XML in a Nutshell ” Harold and Means “ The XML Companion ” Bradley
E N D
XML Grammars 95-733 Internet Technologies Internet Technologies
XML Grammars: Three Major Uses 1. Validation • Code Generation • Communication Internet Technologies
XML Validation Sources for this lecture: “Data on the Web” Abiteboul, Buneman and Suciu “XML in a Nutshell” Harold and Means “The XML Companion” Bradley The validation examples were originally tested with an older parser and so the specific outputs may differ from those shown. Internet Technologies
XML Validation A batch validating process involves comparing the DTD against a complete document instance and producing a report containing any errors or warnings. Consider batch validation to be analogous to program compilation, with similar errors detected. Interactive validation involves constant comparison of the DTD against a document as it is being created. Internet Technologies
XML Validation • The benefits of validating documents against a DTD include: • Programmers can write extraction and manipulation filters • without fear of their software ever processing unexpected • input. • Using an XML-aware word processor, authors and editors can • be guided and constrained to produce conforming documents. • Consider how Netbeans allows you to edit web.xml files. Internet Technologies
XML Validation Examples XML elements may contain further, embedded elements, and the entire document must be enclosed by a single document element. These are recursive hierarchical structures. A Document Type Definition (DTD) contains rules for each element allowed within a specific class of documents. Internet Technologies
Things the DTD does not do: • Specify the document root. • Specify the number of instances of each kind of element. • (Or, it’s rather hard to do.) • Describe the character data inside an element (the precise • syntax). • DTD’s don’t naturally handle namespaces. • The XML schema language is much more recent • and improves on DTD’s. We have “programmer level” • type specifications. • To see a real DTD, view source on • http://www.silmaril.ie/software/rss2.dtd Internet Technologies
We’ll run this program against several xml files with DTD’s. We’ll study the code soon. // Validate.java using Xerces import java.io.*; import org.xml.sax.ErrorHandler; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; import org.xml.sax.XMLReader; import org.xml.sax.InputSource; import org.xml.sax.helpers.XMLReaderFactory; import org.xml.sax.helpers.DefaultHandler; This slide shows the imported classes. Internet Technologies
public class Validate { public static boolean valid = true; public static void main (String argv []) { if (argv.length != 1) { System.err.println ("Usage: java Validate filename.xml"); System.exit (1); } Here we check if the command line is correct. Internet Technologies
try { // get a parser XMLReader reader = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser"); // request validation reader.setFeature("http://xml.org/sax/features/validation", true); // associate an InputSource object with the file name InputSource inputSource = new InputSource(argv[0]); // go ahead and parse reader.parse(inputSource); } Internet Technologies
// Catch any errors or fatal errors here. // The parser will handle simple warnings. catch(org.xml.sax.SAXException e) { System.out.println("Error in parsing " + e); valid = false; } catch(java.io.IOException e) { System.out.println("Error in I/O " + e); System.exit(0); } System.out.println("Valid Document is " + valid); } } Internet Technologies
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document DTD <?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > Valid document is true Internet Technologies
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "http://localhost:8001/dtd/FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document DTD on the Web? VERY NICE <?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > Valid document is true Internet Technologies
XML Document with an internal subset <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap [ <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > ]> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> Valid document is true Internet Technologies
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document DTD <?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > Valid document is false Internet Technologies
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Swaps SYSTEM "FixedFloatSwap.dtd"> <Swaps> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> </Swaps> XML Document Internet Technologies
<?xml version="1.0" encoding="utf-8"?> <!ELEMENT Swaps (FixedFloatSwap+) > <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > DTD C:\McCarthy\www\examples\sax>java Validate FixedFloatSwap.xml Valid document is true Quantity Indicators ? 0 or 1 time + 1 or more times * 0 or more times Internet Technologies
Is this a valid document? <?xml version="1.0"?> <!DOCTYPE person [ <!ELEMENT person (name+, profession*)> <!ELEMENT profession (#PCDATA)> <!ELEMENT name (#PCDATA)> ]> <person> <name>Alan Turing</name> <profession>computer scientist</profession> <profession>cryptographer</profession> </person> Sure! Internet Technologies
The locations where document text data is allowed are indicated by the keyword ‘PCDATA’ (Parsed Character Data). <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears> <StartYear>2000</StartYear> <EndYear>2002</EndYear> </NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document Internet Technologies
DTD <?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > Output C:\McCarthy\www\46-928\examples\sax>java Validate FixedFloatSwap.xml org.xml.sax.SAXParseException: Element "NumYears" does not allow "StartYear" -- (#PCDATA) org.xml.sax.SAXParseException: Element type "StartYear" is not declared. org.xml.sax.SAXParseException: Element "NumYears" does not allow "EndYear" -- (# PCDATA) org.xml.sax.SAXParseException: Element type "EndYear" is not declared. Valid document is false Internet Technologies
Mixed Content There are strict rules which must be applied when an element is allowed to contain both text and child elements. The PCDATA keyword must be the first token in the group, and the group must be a choice group (using “|” not “,”). The group must be optional and repeatable. This is known as a mixed content model. Internet Technologies
<?xml version="1.0" encoding="utf-8"?> <!ELEMENT Mixed (emph) > <!ELEMENT emph (#PCDATA | sub | super)* > <!ELEMENT sub (#PCDATA)> <!ELEMENT super (#PCDATA)> DTD XML Document <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Mixed SYSTEM "Mixed.dtd"> <Mixed> <emph>H<sub>2</sub>O is water.</emph> </Mixed> Valid document is true Internet Technologies
Is this a valid document? <?xml version="1.0"?> <!DOCTYPE page [ <!ELEMENT page (paragraph+)> <!ELEMENT paragraph ( #PCDATA | profession | bold)*> <!ELEMENT profession (#PCDATA)> <!ELEMENT bold (#PCDATA)> ]> <page> <paragraph> Alan Turing broke codes during <bold>World War II</bold>. He very precisely defined the notion of "algorithm". And so he had several professions: <profession>computer scientist</profession> <profession>cryptographer</profession> And <profession>mathematician</profession> </paragraph> </page> Sure! Internet Technologies
How about this one? <?xml version="1.0"?> <!DOCTYPE page [ <!ELEMENT page (paragraph+)> <!ELEMENT paragraph ( #PCDATA | profession | bold)*> <!ELEMENT profession (#PCDATA)> <!ELEMENT bold (#PCDATA)> ]> <page> The following is a paragraph marked up in XML. <paragraph> Alan Turing broke codes during <bold>World War II</bold>. He very precisely defined the notion of "algorithm". And so he had several professions: <profession>computer scientist</profession> <profession>cryptographer</profession> And <profession>mathemetician </profession> </paragraph> </page> java Validate mixed.xml org.xml.sax.SAXParseException: The content of element type "page" must match "(paragraph)+". Valid document is false Internet Technologies
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> <Note> <![CDATA[This is text that <b>will not be parsed for markup]]> </Note> </FixedFloatSwap> XML Document CDATA Section DTD <?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap ( Notional, Fixed_Rate, NumYears, NumPayments, Note ) > <!ELEMENT Notional (#PCDATA)> <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > <!ELEMENT Note (#PCDATA) > Internet Technologies
Recursion <?xml version="1.0"?> <!DOCTYPE tree [ <!ELEMENT tree (node)> <!ELEMENT node (leaf | (node,node))> <!ELEMENT leaf (#PCDATA)> ]> <tree> <node> <leaf>A DTD is a context-free grammar</leaf> </node> </tree> java Validate recursive1.xml Valid document is true Internet Technologies
How about this one? <?xml version="1.0"?> <!DOCTYPE tree [ <!ELEMENT tree (node)> <!ELEMENT node (leaf | (node,node))> <!ELEMENT leaf (#PCDATA)> ]> <tree> <node> <leaf>Alan Turing would like this</leaf> </node> <node> <leaf>Alan Turing would like this</leaf> </node> </tree> java Validate recursive1.xml org.xml.sax.SAXParseException: The content of element type "tree" must match "(node)". Valid document is false Internet Technologies
Relational Databases and XML Consider the relational database r1(a,b,c), r2(c,d) r1: a b c r2: c d a1 b1 c1 c2 d2 a2 b2 c2 c3 d3 c4 d4 How can we represent this database with an XML DTD? Internet Technologies
Relations <?xml version="1.0"?> <!DOCTYPE db [ <!ELEMENT db (r1*, r2*)> <!ELEMENT r1 (a,b,c)> <!ELEMENT r2 (c,d)> <!ELEMENT a (#PCDATA)> <!ELEMENT b (#PCDATA)> <!ELEMENT c (#PCDATA)> <!ELEMENT d (#PCDATA)> ]> <db> <r1><a> a1 </a> <b> b1 </b> <c> c1 </c> </r1> <r1><a> a1 </a> <b> b1 </b> <c> c1 </c> </r1> <r2><c> c2 </c> <d> d2 </d> </r2> <r2><c> c3 </c> <d> d3 </d> </r2> <r2><c> c4 </c> <d> d4 </d> </r2> </db> java Validate Db.xml Valid document is true There is a small problem…. Internet Technologies
Relations <?xml version="1.0"?> <!DOCTYPE db [ <!ELEMENT db (r1|r2)* > <!ELEMENT r1 ((a,b,c) | (a,c,b) | (b,a,c) | (b,c,a) | (c,a,b) | (c,b,a))> <!ELEMENT r2 ((c,d) | (d,c))> <!ELEMENT a (#PCDATA)> <!ELEMENT b (#PCDATA)> <!ELEMENT c (#PCDATA)> <!ELEMENT d (#PCDATA)> ]> <db> <r1><a> a1 </a> <b> b1 </b> <c> c1 </c> </r1> <r1><a> a1 </a> <b> b1 </b> <c> c1 </c> </r1> <r2><c> c2 </c> <d> d2 </d> </r2> <r2><c> c3 </c> <d> d3 </d> </r2> <r2><c> c4 </c> <d> d4 </d> </r2> </db> The order of the relations should not count and neither should the order of columns within rows. Internet Technologies
Attributes An attribute is associated with a particular element by the DTD and is assigned an attribute type. The attribute type can restrict the range of values it can hold. Example attribute types include : CDATA indicates a simple string of characters NMTOKEN indicates a word or token A named token group such as (left | center | right) ID an element id that holds a unique value (among other element ID’s in the document) IDREF attributes refer to an ID Internet Technologies
<?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > <!ATTLIST Notional currency (Dollars | Pounds) #REQUIRED> DTD <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document C:\McCarthy\www\46-928\examples\sax>java Validate FixedFloatSwap.xml org.xml.sax.SAXParseException: Attribute value for "currency" is #REQUIRED. Valid document is false Internet Technologies
<?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > <!ATTLIST Notional currency (Dollars | Pounds) #REQUIRED> DTD <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional currency = “Pounds”>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document Valid document is true Internet Technologies
<?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > <!ATTLIST Notional currency (Dollars | Pounds) #REQUIRED> <!ATTLIST FixedFloatSwap note CDATA #IMPLIED> DTD <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap> <Notional currency = “Pounds”>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document Valid document is true #IMPLIED means optional Internet Technologies
<?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > <!ATTLIST Notional currency (Dollars | Pounds) #REQUIRED> <!ATTLIST FixedFloatSwap note CDATA #IMPLIED> DTD <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd"> <FixedFloatSwap note = “For your eyes only”> <Notional currency = “Pounds”>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> XML Document Valid document is true Internet Technologies
ID and IDREF Attributes We can represent complex relationships within an XML document using ID and IDREF attributes. Internet Technologies
An Undirected Graph edge vertex v w u x z y Internet Technologies
A Directed Graph u w y x v Internet Technologies
Geom100 Math 100 Calc300 Calc100 Calc200 CS1 Philo45 CS2 This is called a DAG (Directed Acyclic Graph) Internet Technologies
<?xml version="1.0"?> <!DOCTYPE Course_Descriptions SYSTEM "course_descriptions.dtd"> <Course_Descriptions> <Course> <Course-ID id = "Math100" /> <Title>Algebra I</Title> <Description> Students in this course study introductory algebra. </Description> <Prerequisites/> </Course> This course has an ID But no prerequisites Internet Technologies
<Course> <Course-ID id = "Geom100" /> <Title>Geometry I</Title> <Description> Students in this course study how to prove several theorems in geometry. </Description> <Prerequisites/> </Course> The DTD will force this to be unique. Internet Technologies
<Course> <Course-ID id="Calc100" /> <Title>Calculus I</Title> <Description> Students in this course study the derivative. </Description> <Prerequisites pre="Math100 Geom100" /> </Course> <Course> These are references to ID’s. (IDREFS) Internet Technologies
<Course-ID id = "Calc200" /> <Title>Calculus II</Title> <Description> Students in this course study the integral. </Description> <Prerequisites pre="Calc100" /> </Course> The DTD requires that this name be a unique id defined within this document. Otherwise, the document is invalid. Internet Technologies
<Course> <Course-ID id = "Calc300" /> <Title>Calculus II</Title> <Description> Students in this course study the derivative and the integral (in 3-space). </Description> <Prerequisites pre="Calc200" /> </Course> Prerequisites is an EMPTY element. It’s used only for its attributes. Internet Technologies
<Course> <Course-ID id = "CS1" /> <Title>Introduction to Computer Science I</Title> <Description> In this course we study Turing machines. </Description> <Prerequisites pre="Calc100" /> </Course> <Course> IDREF ID A One-to-one link Internet Technologies
<Course-ID id = "CS2" /> <Title>Introduction to Computer Science II</Title> <Description> In this course we study basic data structures. </Description> <Prerequisites pre="Calc200 CS1"/> </Course> <Course> ID IDREFS ID One-to-many links Internet Technologies
<Course-ID id = "Philo45" /> <Title>Ethical Implications of Information Technology</Title> <Description> TBA </Description> <Prerequisites/> </Course> </Course_Descriptions> Internet Technologies
The Course_Descriptions.dtd <?xml version="1.0"?> <!-- Course Description DTD --> <!ELEMENT Course_Descriptions (Course)+> <!ELEMENT Course (Course-ID,Title,Description,Prerequisites)> <!ELEMENT Course-ID EMPTY> <!ELEMENT Title (#PCDATA)> <!ELEMENT Description (#PCDATA)> <!ELEMENT Prerequisites EMPTY> <!ATTLIST Course-ID id ID #REQUIRED> <!ATTLIST Prerequisites pre IDREFS #IMPLIED> Internet Technologies
General Entities & General entities are used to place text into the XML document. They may be declared in the DTD and referenced in the document. They may also be declared in the DTD as residing in a file. They may then be referenced in the document. Internet Technologies
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FixedFloatSwap SYSTEM "FixedFloatSwap.dtd" [ <!ENTITY bankname "Mellon National Bank and Trust" > ] > <FixedFloatSwap> <Bank>&bankname;</Bank> <Notional>100</Notional> <Fixed_Rate>5</Fixed_Rate> <NumYears>3</NumYears> <NumPayments>6</NumPayments> </FixedFloatSwap> Document using a General Entity <?xml version="1.0" encoding="utf-8"?> <!ELEMENT FixedFloatSwap (Bank,Notional, Fixed_Rate, NumYears, NumPayments ) > <!ELEMENT Bank (#PCDATA) > <!ELEMENT Notional (#PCDATA) > <!ELEMENT Fixed_Rate (#PCDATA) > <!ELEMENT NumYears (#PCDATA) > <!ELEMENT NumPayments (#PCDATA) > DTD Validate is true Internet Technologies