360 likes | 469 Views
0360-569 Semantic Web (Winter 2007) A Report on DTD vs XML Schema: A Practical Study By – Bex, G. J., Neven, F., Bussche, J. V. Presented By: Quazi Rahman Titas Mutsuddi. Outline. Introduction Structural View of DTDs and XSDs Dataset Expressiveness of XSDs Additional Features
E N D
0360-569 Semantic Web(Winter 2007)A Report onDTD vs XML Schema: A Practical StudyBy – Bex, G. J., Neven, F., Bussche, J. V.Presented By: Quazi Rahman Titas Mutsuddi
Outline • Introduction • Structural View of DTDs and XSDs • Dataset • Expressiveness of XSDs • Additional Features • Regular Expression Characterization • Schema and Ambiguity • Errors • Conclusion • Reference 60-569
1. Introduction • DTD and XSD are two widely used schemas to describe the contents in an XML documents. • Although DTDs and XSDs differs syntactically, they are quite related on an abstract level. • In this paper the authors present a comparative study of both DTDs and XSDs. They have tried to answer two questions: • Which of the extra features or expressiveness of XML schema are effectively used in practice that are not allowed in DTDs, and • How sophisticated are the structural properties (nature of regular expression) of the two formalisms. 60-569
1. Introduction (cont’d)Definition of DTD and XSD • Both Document Type Definitions (DTDs) and XML Schema Definitions (XSDs) states what tags and attributes are used to describe the elements in an XML document, where each tag is allowed, and which tags can appear within other tags, etc. • Applications use a document's DTDs or XSDs to properly read and display a document's contents. • Changes in the format of the document can be easily made by modifying the DTDs or the XSDs of the document. 60-569
1. Introduction (cont’d)Merits and Demerits of DTD and XSD • Shortcomings of DTDs • No support for namespaces • Limited support for data types • Limited support for cardinality • Shortcomings of XSDs • It is more complex than DTDs • There are complains about the performance issue. • Merits of XSDs • XSDs are extensible to future additions • Reuse Schema in other Schemas • Create new data types derived from the standard types • Reference multiple schemas in the same document • XSDs are richer and more powerful than DTDs 60-569
1. Introduction (cont’d)Merits of XSDs • XSDs are written in XML • Don't have to learn a new language • Can use XML editor to edit Schema files • Can use XML parser to parse Schema files • Can transform Schema with XSLT • XSDs support data types. It is easier to: • Describe allowable document content • Validate the correctness of data • Work with data from a database • Define data facets (restrictions on data) • Define data patterns (data formats) • Convert data between different data types • XSDs support namespaces 60-569
2. Structural View of DTD and XSD • An XML document may be viewed as a finite ordered tree structure. • An Example: <store> <dvd> <title>Amelie</title> <price>17</price> </dvd> <dvd> <title>Good bye, Lenin</title> <price>20</price> <discount>20%</discount> </dvd> </store> 60-569
2. Structural View of DTD and XSD(cont’d) • Corresponding Tree structure: store dvd dvd title price title price discount “Amelie” “17” “Good bye, Lenin” “20” “20%” 60-569
2. Structural View of DTD and XSD(cont’d) • DTD to describe the previous document <!ELEMENT store (dvd+)> <!ELEMENT dvd (title, price, discount?)> <!ELEMENT title (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT discount (#PCDATA)> • For the tree above let us consider every node label is a member of some finite alphabet . • Definition 1. A DTD is a pair (d, s) where d is a function that maps -symbols to regular expression over , and s is the start symbol. A tree satisfies the DTD if its root is labeled by s and for every node u with label a, the sequence a1…an of labels of its children matches the regular expression d(a). 60-569
2. Structural View of DTD and XSD(cont’d) • We can abstract the DTD by the set of rules of the form a r, where a is an element and r is a regular expression over the alphabets of elements. Such as store dvd+ dvd title price discount? • Definition 2. A specialized DTD (SDTD) is a 4-tuple (, ’, , ), where ’ is an alphabet of types, is a DTD over ’ and is a mapping from ’ to . Note that can be applied to a ’-tree as a re-labeling of the nodes, thus yielding a -tree. A -tree t then satisfies the SDTD if t can be written as (t’), where t’ satisfies the DTD . 60-569
2. Structural View of DTD and XSD(cont’d) • A simple example of a SDTD: store (dvd1 + dvd2)*dvd2(dvd1 + dvd2)* dvd1 title price dvd2 title price discount • Here, dvd1 defines ordinary DVDs while dvd2 defines DVDs on sale. The rule for store specifies that there should be at least one of the latter • Definition 3. A single-type SDTD is an SDTD (, ’, (d,s), ) with the property that no regular expression d(a) has occurrences of types of the form bi and bj with the same b but different i and j. • The example above is not a single-type SDTD, as both dvd1 and dvd2 occur in the rule for store. 60-569
2. Structural View of DTD and XSD(cont’d) • An example of single-type grammar is given below: store regulars discounts regulars (dvd1)* discounts dvd2(dvd2)* dvd1 title price dvd2 title price discount • Although there are still two element definitions dvd1 and dvd2, they can only occur in a different context, regulars and discounts respectively. 60-569
2. Structural View of DTD and XSD(cont’d) • Fragment of XSD of the above DTD may be written as: <xs:element name = “store”> <xs:complexType> <xs:sequence> <xs:choice minOccurs=“0” maxOccurs=“unbounded”/> <xs:element name = “dvd” type = “dvd1”/> <xs:element name = “dvd” type = “dvd2”/> </xs:choice> <xs:element name = “dvd” type = “dvd2”/> <xs:choice minOccurs=“0” maxOccurs=“unbounded”/> <xs:element name = “dvd” type = “dvd1”/> <xs:element name = “dvd” type = “dvd2”/> </xs:choice> </xs:sequence> </xs:complexType> </xs:element> 60-569
3. Dataset • The authors have gathered a representative samples of DTDs and XSDs for this comparative study, mostly from the online source xml.coverpages.org • They have obtained 109 DTDs and 93 XSDs for this study. 60-569
4. Expressiveness of XSDsSingle-Type • The authors tried to find out whether the expressive power of single-type SDTDs actually used in real world XSDs. • Most XSDs define local tree language, that is, can be defined by DTDs • Only 5 out of 30 XSDs that are used in this analysis, or only 15%, are true single-type SDTDs • All five XSDs were of the form: p …a1… q …a2… a1 expr1 a2 expr2 Which means, when a parent of an a is p (or q) use the rule for a1 (or a2) 60-569
4. Expressiveness of XSDs (cont’d)Derived Types • XML Schema provides two kinds of types, simple and complex types • Simple type describes the character data an element can contain (like #PCDATA in DTDs) • Complex type specifies which elements may occur as children in a given element. • In XSDs, new types may derived from existing types using two mechanisms: • Extension • Restriction 60-569
Simple type (%) Complex type (%) Extension 27 37 Restriction 73 7 4. Expressiveness of XSDs (cont’d)Derived Types • A simple type can be extended to complex type to add attributes to elements • A complex type can be extended to add a sequence of additional elements to its content model or to add attributes • A simple type can be restricted to limit the acceptable range of values for that type • A complex type can be restricted to limit the set acceptable sub-trees Table1: Relative use of derivation features in XSDs 60-569
4. Expressiveness of XSDs (cont’d)Derived Types Out of 93 XSDs considered: • Approx. one fifth (20%) do not construct new type through derivation at all • Extension is used to define additional attributes in 58%, and to add new elements to a content model in 42% • Restriction of complex type is used only in 7% • Note that only 37% used extension of complex type which is parallel to inheritance in OOP. • Extension of simple type occurs in 27% of XSDs • Restriction of simple type is most heavily used (73%), which shows the shortcomings of DTDs which uses unrestrictive #PCDATA 60-569
4. Expressiveness of XSDs (cont’d)Derived Types • 6 XSDs have used the feature of finalizing a type definition, that is using an attribute that specify that the type can not be restricted nor extended • 11 XSDs have used the abstract type definition that must be derived to new types from it. • Derived type can occur anywhere in the content model where the original type is allowed, but this can be prevented by applying block attribute to the original type. 2 XSDs have used this blocking feature. • Fixed attribute is usually used to indicate that an element or attribute is restricted to specific value. Only a single XSD used this feature. • Using substitutionGroup feature the name of an element can be substitute with other name. This feature is used by 10 XSDs. 60-569
5. Additional Features • The &-operator specifies that all elements must occur but their order is not significant, was available in SGML DTD, but is lost in XML DTD. (a1&a2 &a3 a1a2a3 | a1a3a2 | … | a3a2a1). In XSDs this feature is restored by defining the xsd:all element. Only 4 XSDs used this operator • Elements of an XML document can be identified using ID attribute and referred by IDREF or IDREFS (also supported by DTDs). The IDs are unique throughout the document. Only 6 XSDs used this feature • Referring to elements can be accomplished by key/keyref pairs. Using a reference to a key implies that the element with the corresponding key should exist in the document. It is used by 4 XSDs. • One important feature of XSDs is the use of namespace. This allows to use elements and types in the current XSD that are defined elsewhere. Apart from the obvious inclusion of XML Schema namespace, 20 XSDs used this feature. 60-569
6. Regular Expression Characterization • The second question the authors tried to answer is how sophisticated regular expression tend to be in the real world DTDs and XSDs. • For this analysis, the authors had to perform some preprocessing on the documents: • DTD element definition were converted to a canonical form such as, <!ELEMENT lib ((book | journal)*)> was converted to the form (c1 | c2)*, just to keep the structural DTD information • XSDs were preprocessed using XSLT to the canonical form • For DTDs, total 11802 element definition was reduced to 750 canonical forms, and for XSDs, total 1016 element definition was reduced to 138 canonical forms, totaling to 838 for both types of schema. 60-569
6. Regular Expression Characterization (cont’d) • Definition 4. A base symbol is a regular expression a, a?, or a* where a ; a factor is of the form e, e?, or e*, where e is a disjunction of base symbols. A simple regular expression is , Ø, or a sequence of factors, such as, (a*+b*)(a+b)?b*(a+b)*. • The authors introduced a uniform syntax to denote subclass of simple regular expressions by specifying the allowed factors. They distinguish base symbols extended by ? Or *. Further, they distinguish between factors with one disjunct or with arbitrarily many disjuncts; the latter is denoted by (+…). Finally, factors can again be extended by * or ?. For example, they write RE((+a)*,a?) for the set of regular expression e1… en where every ei is (a1+…+ an)* for some a1,…, an and n 1, or a? for some a . 60-569
Factor Abbr. Factor Abbr. a a* a? (a1 + … + an) a a* a? (+a) (a1 + … + an)* (a1 + … + an)? (a1* + … + an*) (a1* + … + an*)* (+a)* (+a)? (+a*) (+a*)* 6. Regular Expression Characterization (cont’d) • Following is a table of possible factors in simple regular expressions and how they are denoted (a, a1, . . . , an ). Table 2 60-569
6. Regular Expression Characterization (cont’d) • The authors have analyzed the DTDs and XSDs to characterize their content models according to the subclasses defined above. • The result is represented in the Table 3 that list the non-overlapping categories of expression having a significant population (more than 0.5%) • Two major differences between DTDs and XSDs. • XSDs have more simpleType elements (#PCDATA). This may be due to the fact that XSD introduces more distinct simpleType elements. It is now possible to fine tune the specification of an element’s content. • XSDs have less expression in the category RE(a,(+a)*). This is most probably due to the nature of the XSDs in the sample since those describing data are over represented with respect to those describing meta documents 60-569
DTDs (%) XSDs (%) #PCDATA 34 48 EMPTY 16 10 ANY 1 0 RE(a) 5 5 RE(a, a?) 2 10 RE(a, a*) 8 10 RE(a, a?, a*) 1 4 RE(a, (+a)) 3 3 RE(a, (+a)?) 0 1 RE(a, (+a)*) 20 2 RE(a, (+a)?, (+a)*) 0 1 RE(a, (+a*)*) 0 2 Total simple expression 92 97 Non-simple expression 8 3 6. Regular Expression Characterization (cont’d) Table 3: Relative occurrence of various types of regular expressions given in % of element definitions 60-569
6. Regular Expression Characterization (cont’d) • The authors have compared DTDs and XSDs using different measures but did not observe any significant differences between them. More importantly, it is clear from different comparison that vast majority of expressions are simple both in DTDs (92%) and in XSDs (97%) • Some of the comparisons they have carried out are: • Density • Width and depth of canonical form • Simple content model • Star height 60-569
6. Regular Expression Characterization (cont’d) • The density of a schema is defined as the number of elements occurring in the right hand side of its rule divided by the number of elements. 60-569
6. Regular Expression Characterization (cont’d) • The table bellow show the fraction of DTDs and XSDs versus the fraction of their simple content models: the majority of documents have 90% or more simple content models 60-569
star height DTDs XSDs 0 61 78 1 38 17 2 1 4 3 0 0 6. Regular Expression Characterization (cont’d) • The star height of a regular expression is the maximum nesting depth of Kleene stars occurring in the expression. Content models with star height larger than 1 are very rare. • In DTDs presence of more 1 star height expression is due to the abundance of RE(a, (+a)*) type of expressions in DTDs with respect of XSDs. Table 4: Star height observed in DTDs and XSDs 60-569
7. Schema and Ambiguity • The XML 1.0 specification by W3C, requires that schema definition to be deterministic or one-unambiguous. • The authors checked whether the DTDs and XSDs in the study respect this requirement using the tool IBM’s XML Schema Quality Checker (SQC). • The authors found almost all of them follow the rule. • Only 3 out of 93 XSDs having one or more ambiguous content model of two canonical forms: c1?(c1|c2)* and (c1c2)|(c1c3). 60-569
7. Schema and Ambiguity(cont’d) • For DTDs, the first exception is a regular expression of the type: (… | ci | … | ci | …)*. But the authors claimed it to be only a typo, not a design feature. • The second type of ambiguous regular expression is of type: c1c2?c2?. The designer’s intention was clearly to state that c2 may occur zero, one or two times. • This illustrates a shortcoming of DTDs that has been addressed in XSDs, as in the following example <xsd:sequence> <xsd:element name=“c1” type=“t1”/> <xsd:element name=“c2” type=“t2” minOccurs=“0” maxOccurs=“2”/> </xsd:sequence> 60-569
8. Errors • The authors found some of the errors with XSDs they have retrieved • Only 30 out of 93 XSDs were found to pass a conformance test by SQC, that is to be complying the W3C specifications • 19 XSDs were designed according to a schema older than 2001 specs. • Some simple type have been omitted or added from one version of the specs to another causing the SQC to report errors. • Some errors concern violation of the Datatypes part of the specs., like a regular expression wrongfully restricting xsd:string • Some XSDs violating the specs. by specifying a type attribute for complexType element, or leaving out the name attribute for a top-level complexType element. 60-569
9. Conclusion • Many features defined in the XML Schema specification are not widely used yet, especially those that are related to OO data modeling such as derivation of complex type extension. • The expressive power of XSDs under investigation is almost equivalent of that of DTDs, which means that disregarding some exceptions, these XSDs could as well have been written as DTDs. This might show that the level of sophistication offered by XSDs is not necessary for most of the applications, at least until now. 60-569
9. Conclusion (cont’d) • The data type part of the XML Schema specs is heavily used, since it alleviates a major shortcoming of DTDs, namely the ability to specify the format and type of the text of an element, which, in XSDs, accomplish through restricting a simple type. Example: <xs:element name="letter"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="[a-z]"/> </xs:restriction> </xs:simpleType> </xs:element> • The content models specified in both DTDs and XSDs tend to be very simple. For XSDs, 97% of all content model can be classified as simple expression. 60-569
10. References • Bex, G. T., Neven, F. and Bussche, J. V., DTDs versus XML Schema: A Practical Study, In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, pages 79--84, Maison de la Chimie, Paris, France, June 17-18 2004. • http://www.webopedia.com/TERM/D/DTD.html • http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci831325,00.html • http://en.wikipedia.org/wiki/XML_Schema • http://www.w3schools.com/schema/default.asp • http://www.w3schools.com/dtd/dtd_intro.asp • IBM Corp. XML Schema Quality Checker, 2003,http://www.alphaworks.ibm.com/tech/xmlsqc • R. Cover. The cover pages, 2003, http://xml.coverpages.org/ • P. Biron and A. Mathotra, XML Schema part 2: datatypes. W3C, May 2001, http://www.w3.org/TR/xmlschema-2/ • http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-01-02/03-01-02.pdf 60-569
Thank you..... 60-569