860 likes | 882 Views
XML Schema Languages. Dongwon Lee, Ph.D. The Pennsylvania State University IST 516 / Fall 2010. http://www.practicingsafetechs.com/TechsV1/XMLSchemas/. In the Last Class. Today. DUE: Proj #1 Planning Doc Team Based Assignment: Lab #1 Individual. Course Objectives.
E N D
XML Schema Languages Dongwon Lee, Ph.D. The Pennsylvania State University IST 516 / Fall 2010 http://www.practicingsafetechs.com/TechsV1/XMLSchemas/
Today • DUE: Proj #1 Planning Doc • Team Based • Assignment: Lab #1 • Individual
Course Objectives • Understand the purpose of using schemas • Study regular expression as the framework to formalize schema languages • Understand details of DTD and XML Schema • Learn the concept of schema validation
Outline • Schema Language in general • Content Model • DTD • XML Schema (XSD) • Schema Validation
Motivation • The company “Nittany Vacations, LLC” wants to standardize all internal documents using the XML-based format nvML • Gather requirements from all employees • Informal description (ie, narrative) of how nvML should look like • Q • How to describe formally and unambiguously? • How to validate an nvML document?
Motivation: Schema = Rules • XML Schemas is all about expressing rules: • Rules about what data is allowed • Rules about how the data must be organized • Rules about the relationships between data
Motivation: sep-9.xml <Vacation date=“2010-09-09” guide-by=“Lee”> <Trip segment="1" mode="air"> <Transportation>airplane<Transportation> </Trip> <Trip segment="2" mode="water"> <Transportation>boat</Transportation> </Trip> <Trip segment="3" mode="ground"> <Transportation>car</Transportation> </Trip> </Vacation> Example modified from Roger L. Costello’s slides @ xfront.com
Motivation: Validate <Vacation date=“2010-09-09” guide-by=“Lee”> <Segment id="1" mode="air"> <Transportation>airplane</Transportation> </Segment> <Segment id="2" mode="water"> <Transportation>boat</Transportation> </Segment> <Segment id="3" mode="ground"> <Transportation>car</Transportation> </Segment> </Vacation> Validate the XML document against the XML Schema XML Schema = RULES nvML.dtd or nvML.xsd Rule 1: A vacation has segments. Rule 2: Each segment is uniquely identified. Rule 3: There are three modes of transportation: air, water, gound. Rule 4: Each segment has a mode of transportation. Rule 5: Each segment must identify the specific mode used.
Schema Languages • Schema: a formal description of structures / constraints • Eg, relational schema describes tables, attributes, keys, .. • Schema Language: a formal language to describe schemas • Eg, SQL DDL for relational model CREATE TABLE employees ( id INTEGER PRIMARY KEY, first_name CHAR(50) NULL, last_name CHAR(75) NOT NULL, dateofbirth DATE NULL );
Rules in Formal Schema Lang. • Why bother formalizing the syntax with a schema? • A formal definition provides a precise but human-readable reference • Schema processing can be done with existing implementations • One’s own tools for own language can benefit: by piping input documents through a schema processor, one can assume that the input is valid and defaults have been inserted
Schema Processing http://www.brics.dk/~amoeller/XML/schemas/schemas.html
Requirements for Schema Lang. • Expressiveness • Efficiency • Comprehensibility
Regular Expressions (RE) • Commonly used to describe sequences of characters or elements in schema languages • RE to capture content models • Σ: a finite Alphabet • α in Σ: set only containing the character α • α ?: matches zero or one α • α *: matches zero or more α’s • α +: matches one ore more α’s • α β: matches the concatenation of α and β • α | β: matches the union of α and β
RE Examples • a|b* denotes {ε, a, b, bb, bbb, ...} • (a|b)* denotes the set of all strings with no symbols other than a and b, including the empty string: {ε, a, b, aa, ab, ba, bb, aaa, ...} • ab*(c|ε) denotes the set of strings starting with a, then zero or more bs and finally optionally a c: {a, ac, ab, abc, abb, abbc, ...}
RE Examples • Valid integers: • 0 | -? (1|2|3|4|5|6|7|8|9) (1|2|3|4|5|6|7|8|9) * • Valid contents of table element in XHTML: • caption ? (col * | colgroups *) thead ? tfoot ? (tbody * | tr *)
Which Schema Language? • Many proposals competing for acceptance • W3C Proposals: DTD, XML Data, DCD, DDML, SOX, XML-Schema, … • Non-W3C Proposals: Assertion Grammars, Schematron, DSD, TREX, RELAX, XDuce, RELAX-NG, … • Different applications have different needs from a schema language
Expressive Power (content model) DTD XML-Schema XDuce, RELAX-NG “Taxonomy of XML Schema Languages using Formal Language Theory”, Makoto Murata, Dongwon Lee, Murali Mani, Kohsuke Kawaguchi, In ACM Trans. on Internet Technology (TOIT), Vol. 5, No. 4, page 1-45, November 2005
Closure (content model) DTD XML-Schema XDuce, RELAX-NG Closed under INTERSECT Closed under INTERSECT, UNION, DIFFERENCE
DTD: Document Type Definition • XML DTD is a subset of SGML DTD • XML DTD is the standard XML Schema Language of the past (and present maybe…) • It is one of the simplest and least expressive schema languages proposed for XML model • It does not use XML tag notation, but use its own weird notation • It cannot express relatively complex constraint (eg, key with scope) well • It is being replaced by XML-Schema of W3C and RELAX-NG of OASIS
DTD: Elements • <!ELEMENT element-namecontent-model> • Associates a content model to all elements of the given name content models • EMPTY: no content is allowed • ANY: any content is allowed • Mixed content: (#PCDATA | e1 | … | en)* • arbitrary sequence of character data and listed elements
DTD: Elements • Eg: “Name” element consists of an • optional FirstName, followed by • mandatory LastName elements, where • Both are text string <!ELEMENT Name (FirstName? , LastName) <!ELEMENT FirstName (#PCDATA)> <!ELEMENT LastName (#PCDATA)
DTD: Attributes • <!ATTLIST element-nameattr-nameattr-typeattr-default ...> • Declares which attributes are allowed or required in which elements attribute types: • CDATA: any value is allowed (the default) • (value|...): enumeration of allowed values • ID, IDREF, IDREFS: ID attribute values must be unique (contain "element identity"), IDREF attribute values must match some ID (reference to an element) • ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION: consider them obsolete…
DTD: Attributes • Attribute defaults: • #REQUIRED: the attribute must be explicitly provided • #IMPLIED: attribute is optional, no default provided • "value": if not explicitly provided, this value inserted by default • #FIXED "value": as above, but only this value is allowed
DTD: Attributes • Eg: “Name” element consists of an • optional FirstName, followed by • mandatory LastName attributes, where • Both are text string <!ELEMENT Name (EMPTY)> <!ATTLIST Name FirstName CDATA #IMPLIED LastName CDATA#REQUIRED>
DTD: Attributes • ID vs. IDREF/IDREFS • ID: document-wide unique ID (like key in DB) • IDREF: referring attribute (like foreign key in DB) <!ELEMENT employee (…)> <!ATTLIST employee eID ID #REQUIRED boss IDREF #IMPLIED> … <employee eID=“a”>…</>…. <employee eID=“b” boss=“a”>…</>
<?xml version="1.0"?> <!DOCTYPE event SYSTEM “../../dir/event.dtd” <event eID=“sigmod02”> <acronym>SIGMOD</acronym> <society>ACM</society> <url>www.sigmod02.org</url> <loc> <city>Madison</city> <state>WI</state> </loc> <year>2002</year> </event> DTD Example XML document that conforms to “event.dtd”
DTD: Example // event.dtd <!ELEMENT event (acronym, society*,url?, loc, year)> <!ATTLIST event eID ID #REQUIRED> <!ELEMENT acronym (#PCDATA)> <!ELEMENT society (#PCDATA)> <!ELEMENT url (#PCDATA)> <!ELEMENT loc (city, state)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT year (#PCDATA)>
DTD: Type Declaration • Associate a DTD schema with an XML document • At the beginning lines of an XML document // DTD File: event.dtd <!ELEMENT …. > // XMLFile: event.xml <?xml version="1.0"?> <!DOCTYPE event SYSTEM “http://foo.com/event.dtd">
Exercise: Citation Authors Papers Publish *AID Name Title *PID Area Venues Attend Appear *Vname Year City 31
Exercise: Citation To RDBMS: Authors(*AID, Name, Title) Papers(*PID, Area, *Vname) Venues(*Vname, Year, City) Publish(*AID, *PID) Attend(*AID, *Pname) Appear(*PID, *Vname) 32
Exercise: Citation Relational XML: Create a dummy root element Make entities as 1st-level children Make columns of entities as attributes of those Relationship as attributes No violation of 1st normal form for many-many relationship like RDBMS One => IDREF, Many => IDREFS 33
Exercise: Citation “Lee” publishes two papers “p1” and “p2” which appear in venues “X” and “Y” in 2006, respectively, and attend only “Y”. “p2” is co-authored by “John” who attends “X”. <Author AID=‘1’ Name=‘Lee’ Title=‘Prof.’/> <Author AID=‘2’ Name=‘John’ Title=‘Prof.’/> <Paper PID=‘p1’ Area=‘DB’/> <Paper PID=‘p2’ Area=‘DB’/> <Venue Vname=‘X’ Year=‘2006’ … /> <Venue Vname=‘Y’ Year=‘2006’ … /> 34
Exercise: Citation “Lee” publishes two papers “p1” and “p2” which appear in venues “X” and “Y” in 2006, respectively, and attend only “Y”. “p2” is co-authored by “John” who attends “X”. <Author AID=‘1’ Name=‘Lee’ Title=‘Prof.’/> <Author AID=‘2’ Name=‘John’ Title=‘Prof.’/> <Paper PID=‘p1’ Area=‘DB’ Vname=‘X’ /> <Paper PID=‘p2’ Area=‘DB’ Vname=‘Y’ /> <Venue Vname=‘X’ Year=‘2006’ … /> <Venue Vname=‘Y’ Year=‘2006’ … /> 35
Exercise: Citation “Lee” publishes two papers “p1” and “p2” which appear in venues “X” and “Y” in 2006, respectively, and attend only “Y”. “p2” is co-authored by “John” who attends “X”. <Author AID=‘1’ Name=‘Lee’ Title=‘Prof.’ Publish=‘p1 p2’ Attend=‘Y’ /> <Author AID=‘2’ Name=‘John’ Title=‘Prof.’ Publish=‘p2’ Attend=‘X’ /> <Paper PID=‘p1’ Area=‘DB’ Vname=‘X’ /> <Paper PID=‘p2’ Area=‘DB’ Vname=‘Y’ /> <Venue Vname=‘X’ Year=‘2006’ … /> <Venue Vname=‘Y’ Year=‘2006’ … /> 36
Exercise: Citation <Dummy> <Author AID=‘1’ Name=‘Lee’ Title=‘Prof.’ Publish=‘p1 p2’ Attend=‘Y’ /> <Author AID=‘2’ Name=‘John’ Title=‘Prof.’ Publish=‘p2’ Attend=‘X’ /> <Paper PID=‘p1’ Area=‘DB’ Vname=‘X’ /> <Paper PID=‘p2’ Area=‘DB’ Vname=‘Y’ /> <Venue Vname=‘X’ Year=‘2006’ … /> <Venue Vname=‘Y’ Year=‘2006’ … /> </Dummy> 37
Exercise: Citation <!ELEMENT Dummy (Author*|Paper*|Venue*)> <!ELEMENT Author EMPTY> <!ATTLIST Author AID ID #REQUIRED Name CDATA #IMPLIED Title CDATA #IMPLIED PublishIDREFS#IMPLIED AttendIDREFS#IMPLIED> <!ELEMENT Paper EMPTY> <!ATTLIST Paper PID ID #REQUIRED Area CDATA #IMPLIED Title CDATA #IMPLIED VnameIDREF#REQUIRED> <!ELEMENT Venue EMPTY> <!ATTLIST Venue Vname ID #REQUIRED Year CDATA #IMPLIED City CDATA #IMPLIED> 38
XML Schema • New XML schema language from W3C • Successor of DTD • Unlike DTD, XML Schema is in XML syntax • http://www.w3.org/XML/Schema <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>
XML Schema vs. DTD: What’s New • XML Schemas are extensible to future additions • XML Schema V 1.0 1.1 … • XML Schemas are richer and more powerful than DTDs • XML Schemas are written in XML • No <!ELEMENT …> or <!ATTLIST ..> notation • XML Schemas support data types • XML Schemas support namespaces
New: Data Types • XML Schema support data types. Easier to: • Describe allowable document content • Validate the correctness of data • Work with data from a database • Define data facets (restrictions on data) • Define data patterns (data formats) • Convert data between different data types • Eg, <date type="date">2010-09-11</date> • Ensures a mutual understanding of the content • The XML data type "date" requires the format “YYYY-MM-DD”
New: in XML Notation • XML Schema uses XML notation • <> and </> • XML Schema file itself IS an XML file, too • No need to learn a new language • No need to use new tools • Use an XML editor to edit XML Schema files • Use XML parser to parse XML Schema files • Manipulate an XML Schema using DOM • Transform an XML Schema with XSLT
New: Extensibility • XML Schema is extensible because XML is extensible • XML Schema lets you: • Reuse your schema in other schemas • Create your own data types derived from the standard types Inheritance • Reference multiple schemas in the same document
Well-Formed: Not Enough • Well-Formed: a document conforms to XML syntax rules such as: • Begin with XML decl. • One unique root • Case-sensitive • Matching Start / End tags • Properly nested • Well-formed documents can still contain semantic errors or inconsistencies • Need VALID documents according to schema
note.xml <?xml version="1.0"?> // Reference to schema goes here <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
note.dtd <!ELEMENT note (to, from, heading, body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>
note.xml with Reference to DTD • <?xml version="1.0"?> • <!DOCTYPE note SYSTEM"http://www.w3schools.com/dtd/note.dtd"> • <note> • <to>Tove</to> • <from>Jani</from> • <heading>Reminder</heading> • <body>Don't forget me this weekend!</body> • </note>
note.xsd <?xml version="1.0"?> <xs:schema xmlns:xs= “http://www.w3.org/2001/XMLSchema” targetNamespace= “http://www.w3schools.com” xmlns= “http://www.w3schools.com” elementFormDefault= "qualified"> <xs:element name="note"> <xs:complexType> <xs:sequence> <xs:element name="to" type="xs:string"/> <xs:element name="from" type="xs:string"/> <xs:element name="heading" type="xs:string"/> <xs:element name="body" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
<schema> element <?xml version="1.0"?> <xs:schema xmlns:xs = “http://www.w3.org/2001/XMLSchema” targetNamespace = “http://www.w3schools.com” xmlns = “http://www.w3schools.com” elementFormDefault= "qualified"> . . . </xs:schema> • <schema> element is the root element of every XML Schema
<schema> element <?xml version="1.0"?> <xs:schema xmlns:xs = “http://www.w3.org/2001/XMLSchema” targetNamespace = “http://www.w3schools.com” xmlns = “http://www.w3schools.com” elementFormDefault= "qualified"> . . . </xs:schema> • Elements & data types in this schema file come from http://www.w3.org/2001/XMLSchemanamespace • They are to be prefixed with “xs:”