220 likes | 352 Views
What Are Real DTDs Like. Group Members : Xijie Zeng Peiyu Cai Presentor : Xijie Zeng. Outline. Overview Introduction Local properties Global properties. Overview. XML is widely used in a variety of areas DTDs with different structures define XML with different usages
E N D
What Are Real DTDs Like Group Members : Xijie Zeng Peiyu Cai Presentor : Xijie Zeng
Outline • Overview • Introduction • Local properties • Global properties
Overview • XML is widely used in a variety of areas • DTDs with different structures define XML with different usages • A survey based on a number of DTDs in our real world
Introduction • DTDs are from XML.org DTD repository • Three DTD categories : • app : Describe objects interchanged between programs/applications • data : Describe data stored in database • meta : Describe the structure of document markup • 60 DTDs - 7 are app, 13 are data, 40 are meta
Introduction (cont.) • A DTD can be described as a collection of element declarations of the form eα where e is the element name and α is the content model. The content model α::= ε| pcdata |e |α,α| α|α|α* | α+ | α?
Introduction (cont.) Email DTD <!ELEMENT email (head, body)> <!ELEMENT head (from, to+, cc*, subject)> <!ELEMENT from EMPTY> <!ATTLIST from name CDATA #IMPLIED address CDATA #REQUIRED> <!ELEMENT to EMPTY> <!ATTLIST to name CDATA #IMPLIED address CDATA #REQUIRED> <!ELEMENT cc EMPTY> <!ATTLIST cc name CDATA #IMPLIED address CDATA #REQUIRED> <!ELEMENT subject (#PCDATA)> <!ELEMENT body (text, attachment*)> <!ELEMENT text (#PCDATA)> <!ELEMENT attachment EMPTY> <!ATTLIST attachment encoding (mime|binhex) "mime" file CDATA #REQUIRED> email (head, body) head (from, to+, cc*, subject) from (ε) to (ε) cc (ε) subject (pcdata) body (text, attachment*) text (pcdata) attachment (ε)
Introduction (cont.) • Local properties Describe content models in individual element declarations • Global properties Describe the graph-theoretic structure of the whole DTD
body1 (pcdata, attatchment*) Local properties • Content model classification • (1) pcdata • (2) ε • (3) any • No restriction on subelements • (4) Mixed content body (text, attachment*) text (pcdata) • (5) “|” only but not mixed content • (6) “,” only • (7) Complex content • Contains both “|” and “,” directory (dirname, dirinfo?, dirdesc?, (file | directory)*) • (8) List • α * • α + • (9) Single • α ?
Local properties (cont.) • Content model classification
Local properties (cont.) • Syntactic complexity depth(ε) = 0; depth(е) = 1; depth(α*) = depth(α+) = depth(α?) = depth(pcdata) = 1; depth(α1,α2,…,αn) = depth(α1|α2,…|αn) = depth(α) + 1; max(depth(αi)) + 1;
Local properties (cont.) • An example head (from, to+, cc*, subject) depth(from, to+, cc*, subject) = depth(cc*) + 1 = depth(cc) + 1 + 1 = 1 + 1 + 1 = 3
Local properties (cont.) • Determinism If a content model DOES NOT require look ahead when parsing, it is a deterministic content model. non-deterministic content model : (a, b) | (a, c) deterministic content model : a, (b|c) • Result It detects 5 non-deterministic content models in 4 DTDs.
Local properties (cont.) • Ambiguity Definition : An expression R is ambiguous if and only if there exists some string s in R such that there can be distinct ways to parse string s. partner (name?, onetime?, partnrid?, partnrtype?, syncind?, name*, parentid?, partnridx?, partnrratg*) • Result It detects 2 ambiguous content models.
email head head subject email subject Global properties • Reachability Definition : An element name e’ is reachable from e, denoted by ee’ , if either eαand e’ occurs in α, or ee” and e” e’. An example : email (head, body) head (from, to+, cc*, subject) Definition : An element namee is reachable if r e, where r is the name of the root element. Otherwise element name e is called unreachable or useless.
Global properties (cont.) • Reachability Unreachable element names in DTDs
email (head, body) email head (from, to+, cc*, subject) Global properties (cont.) • Recursions Definition : A content model αis derivable from an element name e, denoted by eα, if either eα, or eα’, e’α”, and α= α’[e’/α”], where α= α’[e’/α”] denotes the content model obtained by substituting α” for all occurrences of e’ in α’. An example :email (head, body) head (from, to+, cc*, subject) Definition : A DTD is recursive if and only if it has an element name e such that e e and e is reachable. (from, to+, cc*, subject, body)
Global properties (cont.) • Recursions Definition : A DTD is linear recursive if and only if it is recursive and for any reachable element name e and any eα, e occurs at most once inαand the occurrence is not enclosed in “*” or “+”. A DTD is said to be non-linear recursive if it is recursive but is not linear recursive. An example of non-linear recursive : directory (dirname, dirinfo?, dirdesc?, (file | directory)*) An example of linear recursive : e (pcdata | e) • Result No linear recursive DTD is found in the sample DTDs. There are 7, 2 and 26 non-linear recursive DTDs in the app, data and meta category respectively.
Global properties (cont.) • Chain of stars An example : entity (name*, contact*, location*, phone*, fax*) location (city*, otherinfo?) There is a chain of 2 stars.
Global properties (cont.) • Chain of stars
Global properties (cont.) • Hubs Definition : Fan-in of an element name e is the cardinality of the set {e’ | e’αand e occurs in α}. An element name with a large fan-in value is called hub. An example :email (head, body) head (from, to+, cc*, subject) from (ε) to (ε) cc (ε) subject (pcdata) body (text, attachment*) text (pcdata) attachment (ε) The fan-in value of email element is 0, and the fan-in value of all other elements in this DTD is 1.
Global properties (cont.) Result : Fan-in of elements in data DTDs Fan-in of elements in meta DTDs
Summary • Local properties • Content model classification • Syntactic complexity • Determinism • Ambiguity • Global properties • Reachability • Recursions • Chain of stars • Hubs • One drawback of this survey • It does not study any properties of attributes