310 likes | 413 Views
DTD++ 2.0: Adding support for co-constraints. Davide Fiorello Nicola Gessa Paolo Marinelli Fabio Vitali University of Bologna. Two sales pitches here. DTDs aren’t dead yet and should not be Co-constraints are important, and the very next step in validation. The war of schema languages.
E N D
DTD++ 2.0: Adding support for co-constraints Davide Fiorello Nicola Gessa Paolo Marinelli Fabio Vitali University of Bologna
Two sales pitches here • DTDs aren’t dead yet and should not be • Co-constraints are important, and the very next step in validation Next: The war of schema languages 2/31
The war of schema languages DTD? XML Schema? Relax NG? Schematron? ISO/IEC 19757 DSDL (especially part9: “Data type- and namespace-aware DTDs”)
My own story • The project NormeInRete (http://www.normeinrete.it): XML-ization of national and regional laws and basically any kind of normative document in Italy • Supported by the Italian Office of the Prime Minister, the Ministry of Justice and the department for Informatics in Public Administration. All national laws and regional laws from 3 (soon 7) of the 20 regions are now available in XML and locatable through URNs. • Yours truly is the main author of the DTDs and documentation manuals providing guidance for conversion. • The document type contains 150+ elements and 50+ attributes, dealing with content, meta-content, evolution in time and space, non-ASCII characters. By the end of the year we will deal with judicial documents. Next: NormeInRete: DTD or XML Schema? 4/31
NormeInRete: DTD or XML Schema? • Started in 1999, the first versions of the rules was readied in 2000: necessarily DTD! • The syntax is clear, easy to look up and use, well-known by the users and tool implementers. • The birth of XML Schema created many discussions on whether to switch: • “All my friends use XML Schema” • “XML Spy creates very nice drawings of an XML Schema” • “XML Schema is the future” • “Admit you don’t know the first thing about XML Schema” • In truth, there is very little real reason to switch: DTDs are fine for our purposes. • So far, the parts are balanced. European integration may provide the necessary pressure. Next: But… 5/31
But… … is the switch inevitable? Next: Are DTDs dead? 6/31
Are DTDs dead? • The need for an XML-based syntax • For automatic processing and generation • The presence of strong competition • XML Schema • Relax NG • The absence of many important features Yes, but … • DTDs are easier to learn, • DTDs are easier to read, • DTDs are easier to use • Many people still think in terms of DTDs Next: So: DTD++ 1.0 (Extreme Markup 2003) 7/31
So: DTD++ 1.0 (Extreme Markup 2003) • The idea: create a DTD-like language that is as powerful as the most powerful validation language: XML schema. • Syntax from DTD, structures and concepts from XML Schema: • Namespace support • Complex types for managing markup structures • Simple types for managing constraints on data containers • Use as much as possible of DTD syntax, invent as little as possible, recycle concepts with new meanings. Next: What about XML-based syntax? 8/31
What about XML-based syntax? • Semantic equivalence to another XML-based schema language means this is no longer a problem. Just convert it! • All human tasks use the original DTD++ form, All computer task use the corresponding XSD version. Conversion is easy and fast. Next: A taste of DTD++ (1) 9/31
A taste of DTD++ (1) • Anonymous complex types in XSD are content models <!ELEMENT X (A?, (B | C)[2-5], D*) > • Predefined simple types are predefined keywords <!ELEMENT A (#PCDATA)> or <!ELEMENT A (#STRING)> <!ELEMENT B (#INTEGER)> <!ELEMENT C (#DATE)> • Anonymous simple types add facets to predefined simple types. Syntax for facets uses well-known mathematical constructs: for instance {} for lengths and [] for ranges. <!ELEMENT D (#INTEGER[,100])> Next: A taste of DTD++ (2) 10/31
A taste of DTD++ (2) • Named types are named entities using different characters to differentiate themselves <!ENTITY # myInt “(#INTEGER[0,100])”> <!ELEMENT D #myInt; > <!ENTITY @ myType “(A?, (B | C)[2-5], D*)” > <!ELEMENT X @myType; > • Complex types that specify attributes have an additional block of quotes: <!ENTITY @ myType “(A?, (B | C)[2-5], D*)” “anAttr #STRING{10} #IMPLIED”> <!ELEMENT X @myType; > Next: A taste of DTD++ (3) 11/31
A taste of DTD++ (3) • Mixed content models extend the DTD syntax to allow any structure allowable with XSD: <!ENTITY @ myType “#PCDATA (A?, (B | C)[2-5], D*)” > <!ELEMENT X @myType; > • The ANY structure is extended <!ELEMENT comment ANY[0,3]{http://www.foo.org}> • Target namespaces use the newly introduced TARGETNS structure <!TARGETNS “http://www.foo.org”> <!TARGETNS ns “http://www.bar.org”> <!ELEMENT name (ns:firstname)> <!ELEMENT ns:firstname (#PCDATA)> Next: Limits 12/31
Limits • No support (yet) for keys, keyrefs, uniques. • No local elements • No support for refs • Only two design styles supported: • Salami slices • Garden of Eden. • No redefine or include (but no need for them) Next: Co-constraints and what are they for 13/31
Co-constraints and what are they for Better constraints Real-life constraints Constraints difficult to formalize
Is DTD++ 1.0 enough, then? • No, since XML Schema is not enough • XML Schema cannot express all the structure and data constraints that document designers may need: • Mutual exclusion (“element x may have either the a attribute or the b attribute, but not both”) • Deep exclusions (“element x cannot contain, at any level of its subtree, element y”) • Structure-dependent structures (“if the item is gratis, i.e., the attribute gratis is present, then no price should be specified, i.e., the element price should be absent”) • Data-dependent structures (“if the address is a PO box, then the address must include a PO box number, otherwise it must include a street name and a street number”) • These kinds of constraints are known as co-constraints, or co-occurrence constraints. Most real life XML document types have one or more of those constraints. Next: For example… 15/31
For example… • XHTML • “a elements cannot contain other a elements” (appendix B) • Both the normative DTD and the non normative XML Schema cannot express fully this requirement (they only express a weaker form: “a elements cannot directly contain other a elements”) • XSLT • “In a template element at least one of the match and name attributes must be present” • Again, the DTD and XML schema cannot express this requirement, and specify both attributes as optional. • XML Schema itself • “An element definition must either contain a ref or a name attribute, but not both. Furthermore, if the name attribute is present, then the type attribute or one of the simpleType or complexType elements must be present, but not two.” • The normative XML schema can only specify all these elements and attributes as optional. • … and plenty more… Next: Who cares? 16/31
? ? ? XMLdoc rules rules DOM tree downstream application DOM parser Not well-formed Schema validator DOM Tree + PSVI invalid Who cares? • Documents that contain violations to these rules are still considered valid by the XML schema validator. • Three solutions: • Hope for the best (“It won’t happen”) - subject to Murphy’s Law • Provide a default behavior (“If both attributes are present, consider the first only”) • Provide validation code within the downstream application Next: SchemaPath and DTD++ 2.0 17/31
SchemaPath and DTD++ 2.0 • At the WWW2004 conference, we presented SchemaPath, our proposal to minimally extend XML Schema to handle co-constraints. • The idea is to find a way to conditionally assign types to elements and attributes. Furthermore, a non-satisfiable type is added for specifying error conditions to avoid. • SchemaPath maintains the XML Schema syntax, adds only ONE construct and ONE pre-defined simple type, maintains important XML Schema properties (the validation theorem and round-tripping and reverse round-tripping properties), and does not impact the PSVI for valid documents. • DTD++ 2.0 is the DTD-like syntax for Schematron Next: DTD++ 2.0 18/31
DTD++ 2.0 • Conditional assignment of types • Multiple definitions of the same element, each conditioned by an XPath expression. Implicit and explicit priorities are used. • Each condition is tested on the instance element, and the one that holds with the highest priority is selected. • The type specified by the selected definition is assigned to the element. • This is NOT a way to provide conditional types: types are just plain old DTD++ 1.0 (XML Schema) types. • The #ERROR simple type • When we want to specify the non-validity of a condition, we assign the element the #ERROR type. • The #ERROR type is a non-satisfiable type, whose presence in the instance document always and automatically signals a validation error. Next: Examples 19/31
Examples • Mutual exclusion • “Element x may have either the a attribute or the b attribute but not both”. Suppose we have defined a type myType with both a and b attributes as optional <xsd:element name=“x”><xsd:alt cond=“(@a and @b)” type=“xsd:error”/><xsd:alt type=“myType”/> </xsd:element> <!ELEMENT x “(@a and @b)” #ERROR> <!ELEMENT x “” @myType;> • Data-dependent structures • “The element quantity must be an integer if the unit element is ‘items’, and it must be a decimal value if the unit element is ‘meters’”. Suppose we have already defined the data type for the unit element to only contain the values “meters” or “items”. <xsd:element name=“quantity”><xsd:alt cond=“../unit=‘items’” type=“xsd:integer”/><xsd:alt cond=“../unit=‘meters’” type=“xsd:decimal”/> </xsd:element> Next: One possible solution to the W3C problems (1) 20/31
One possible solution to the W3C problems (1) • XHTML • “a elements cannot contain other a elements” (appendix B) <!ELEMENT A “.//a” (#ERROR)> <!ELEMENT A “” (@inlineType;)> • XSLT • “In a template element at least one of the match and name attributes must be present” <!ELEMENT template "not(@match) and not(@name)" (#ERROR) > <!ELEMENT template "" (@templateType;) > <!ENTITY @ templateType "%templateContent;" "match (#patternType;) name(#NCName;)"> Next: One possible solution to the W3C problems (2) 21/31
One possible solution to the W3C problems (2) • XML Schema • “An element definition must either contain a ref or a name attribute, but not both. Furthermore, if the name attribute is present, then the type attribute or one of the simpleType or complexType elements must be present, but not two.” <!ELEMENT simpleType (@localSimpleType;)> <!ELEMENT complexType (@localComplexType;)> <!ENTITY @ element "(simpleType|complexType)" "name (#NCName;) #IMPLIED ref (#QName;) #IMPLIED type (#QName;) #IMPLIED"> <!ELEMENT element "@name and @ref":4 (#ERROR)> <!ELEMENT element "(@type or @ref) and (xsd:simpleType or xsd:complexType)":3 (#ERROR)> <!ELEMENT element "../xsd:schema and @ref":2 (#ERROR)> <!ELEMENT element "not(@ref) and not(@name)":1 (#ERROR)> <!ELEMENT element "":0 (@element;)> Next: The “Trojan Milestones” requirements 22/31
The “Trojan Milestones” requirements “1. the element must be empty exactly when its sID or eID attribute is set. 2. when eID is present, no other attributes are permitted. 3. each sID/eID value should occur only twice (once on sID and once on eID) 4. empty elements with matching sID and eID values should match up in proper pairs and in order. Note that because of the second rule above, no attributes may be required for milestoneable elements. Schema languages that can make attributes optional or required depending on the presence of other attributes (in this case eID) do not suffer this problem.” [DeRose, Extreme Markup 2004] Next: A DTD++ 2.0 solution to the Trojan Milestones requirements 23/31
A DTD++ 2.0 solution to the Trojan Milestones requirements <!ENTITY @ startMarker “EMPTY” “sID ID #REQUIRED %regularAtts;”> <!ENTITY @ endMarker “EMPTY” “eID IDREF #REQUIRED”> <!ELEMENT X “”:0 %regularCM; > <!ATTLIST X “”:0 %regularAtts;> <!ELEMENT X “@sID”:2 @startMarker;> <!ELEMENT X “@sID = preceding::*/@sID”:3 #ERROR> <!ELEMENT X “@eID=preceding::X/@sID”:4 @endMarker;> <!ELEMENT X “@eID = preceding::*/@eID”:3 #ERROR> <!ELEMENT X “@eID”:2 #ERROR> Next: Implementation of the DTD++2.0 parser 24/31
Implementation of the DTD++2.0 parser • A DTD++ 2.0 validator exists and can be tested online at http://tesi.fabio.web.cs.unibo.it/dpp • It is a Java application and a plain XML Schema validating engine (tested with Xalan and MS XML parsers) • The application is a pre-processor to any XML Schema validator, and, given an XML document X and a DTD++ document D, • it converts D into (one or more) equivalent Schemapath file SP • It converts SP into a plain XML Schema file XS • It converts X into a different XML file X’, so that • XS validates X’ if and only if SP validates X and thus if and only if D validates X Next: … but who cares for DTD anyway? 25/31
… but who cares for DTD anyway? This part is not in the published paper • On July 21st, 2004 we did a test on the relative speed and precision of DTD++ and XML schema • 14 volunteers (10M, 4F) were summoned, all 3rd and 4th year computer science students, versed in both DTD and XML schema (they all had passed with good marks bot the Web Technologies exam and specifically the questions on DTDs and XML schema) • The volunteers were divided in two groups and given 15 questions. Half had to solve them using XML schema, half using DTD++. Next: The test 26/31
The test • The 15 questions were identical in both tests, and regarded: • Write XML: applying the rules from a schema and write valid XML fragments (5 questions) • Validate XML: applying the rules from a schema and find errors in XML fragments (5 questions) • Write Schemas: write a fragment of schema given a plain text description of the problem (5 questions) Next: A sample question 27/31
A sample question • Verify whether the fragment: <order> <to id=”125”>John Smith</to> <lines><line> <art>130</art> <description>Some nice stuff</description> <col>Red</col> <price>0,65</price> <quant>130</quant> </line></lines></order> is valid with respect the following DTD++ fragment:<!ELEMENT order (to, lines) ><!ELEMENT to (#STRING)><!ATTLIST to id ID #REQUIRED><!ELEMENT lines (line+) > <!ELEMENT line (art, col, price, quant)><!ELEMENT art (#PCDATA{,20}) ><!ENTITY # colors (“red | blue | green | yellow)” > <!ELEMENT col (#colors;) ><!ELEMENT quant (#INTEGER]0,]) ><!ELEMENT price (#DECIMAL]0,]) > Next: The results 28/31
The results • DTD++ resulted a clear winner in all categories • 36% faster on group A (Write XML) • 53% faster on group B (Validate XML) • Twice as fast (99%) on group C (Write Schemas) • The question on the previous slide was answered on the average in 0:01:33 with DTD++, and in 0:03:03 average with XML Schema. • Errors are slightly more with DTD++ than XML schema (123%), but this might be due to the fact that the language was brand new. • Of course the volunteers are very few, and the test might be considered non-significant, but it gives at least an initial approximate measure of the relative value of the two languages. • An interesting note is that one of the volunteer converted the XML Schema into DTD fragments with textual annotations before answering each question. Next: Demo 29/31
Demo • A demo of the validating engine and the full result of the tests can be found at http://tesi.fabio.web.cs.unibo.it/dpp • Time for a demo? Next: Conclusions 30/31
Conclusions • DTDs are faster to learn and use • XML Schema are powerful and expressive • Schematron-like co-constraints are even more expressive • Why learning three languages? • DTD++ 1.0 is semantically equivalent to a relevant subset of XML schema • SchemaPath provides co-constraints with a very limited syntax and the new idea of conditional assignment of types (rather than conditional typing) • DTD++ 2.0 uses the same principle with a DTD-like syntax • What now? Maybe ISO/IEC 19757 - DSDL: Part 5 Data types Part 9 Data type- and namespace-aware DTDs Fine presentazione 31/31