710 likes | 954 Views
CSIT600b: XML Programming XML, DTD, Schema, XMI, DOM (SAX). Dickson K.W. Chiu PhD, SMIEEE Thanks to Prof. SC Cheung (HKUST), Prof. Francis Lau (HKU) Reference: XML How To Program, Deitel, Prentice Hall 2001 J2EE 1.4 tutorial. General HTML Problems. The Web has changed everything
E N D
CSIT600b: XML Programming XML, DTD, Schema, XMI, DOM (SAX) Dickson K.W. Chiu PhD, SMIEEE Thanks to Prof. SC Cheung (HKUST), Prof. Francis Lau (HKU) Reference: XML How To Program, Deitel, Prentice Hall 2001 J2EE 1.4 tutorial
General HTML Problems The Web has changed everything • Except the need for such as: • Data integrity • Process repeatability • Competitive cost structure • HTML fails to meet these critical business needs • High development and maintenance costs • Internet and server bottlenecks • Interoperability Dickson Chiu 2004
Specific HTML Problems • HTML mixes up structure and style • Many wanted personalized tags • non-standard HTML web pages on the Internet nowadays • Want to put other data into HTML • mathematics, database entries, literary text, poems, purchase orders, graphic layouts …. • Different conceptions for the language • Software processing • Server management of data (library Web site, any large site) • Client data processing (machine--machine communication) • But -- HTML is so ill-formed, this is hard! Dickson Chiu 2004
Web Page Processing Web software Web server engine HTML data HTML chunk (from somewhere on the Web ...) HTML chunk HTML chunk HTML chunk Into a database, or other tool HTML Dickson Chiu 2004
Case Study: Price Comparison Scenario - compares prices of books For example, a user enters a book title, and your page displays the price at bn.com, amazon.com, bestbuy.com, etc. User can choose the cheapest price. Dickson Chiu 2004
SGML • Standard Generalized Markup Language • See: http://www.w3.org/MarkUp/SGML/ • Developed in the 1970s • Used by big organizations: IBM, US DoD • A meta-languagefor defining languages • Focuses on content structure, not look and feel • HTML is defined using SGML • Complex, sophisticated, powerful • Information model of freedom and extensibility • Write once, reuse many times • Future-proof, platform-proof • Validation for completeness and correctness • Infinite possibilities for expressing information (user-defined tag set) Dickson Chiu 2004
Problem of SGML • Too complicated • Rules too strict • Can’t distribute ‘muddle-able’, loosely formatted text (like HTML) • Not good in a distributed environment • Can’t mix different data together • Can’t add arbitrary tags • No mainstream browser support • Unlimited options, which complicates the tools • Not much support for styles • Limited vendor support Dickson Chiu 2004
eXtensible Markup Language XML to the Rescue • Well-behaved subset of SGML designed to enable delivery over the Web • a structured meta-language in the format of ASCII plain text • SGML - -, not HTML + + • Designed by the World Wide Web Consortium (W3C) • Overwhelming vendor support • Can use XML to define new languages • Distributes easily on the Web • Can mix different types of data together • can easily add new tags, and tell a browser what to do with them (more or less....) • Tools are easier to build • Mainstream browsers (IE 5 and Netscape 6) support XML • However! Reuse, interchange and automation still require data analysis and enforcement of rules Dickson Chiu 2004
XML History and Pointers • XML is an official standard of the World Wide Web Consortium (W3C) • Official information is available at:http://www.w3.org/XML/ • Version:1.0 (2nd edition: 6 October 2000) • New version 1.1http://www.w3.org/TR/2004/REC-xml11-20040204/ • The Official spec is available at: http://www.w3.org/TR/2000/REC-xml-20001006 • The Official XML FAQ:http://www.ucc.ie/xml/ • Popular reference sites: http://www.xml.com/http://www.xml.org/ • Reference Book: XML – How to Program Dietel, Dietel, Nieto, Lin & Sadhu (Prentice Hall 2000) Dickson Chiu 2004
XML Family of Technologies (partial) • DTD / Schema – defining XML document, elements and attributes • DOM - manipulating XML (and HTML) file from a programming language • Xpath - address parts of an XML document • Xlink - adding hyperlinks to an XML file • XPointer - pointing to parts of an XML document CSS is applicable to XML as it is to HTML • XSL - an advanced language for expressing style sheets (XML represents data but not how it looks…) • XSLT - transforming XML to other formats • Namespaces - differentiating elements of different XML documents Dickson Chiu 2004
Official (W3C) Design Goal of XML • XML shall be straightforwardly usable over the Internet. • XML shall support a wide variety of applications. • XML shall be compatible with SGML. • It shall be easy to write programs which process XML documents. • The number of optional features in XML is to be kept to the absolute minimum, ideally zero. • XML documents should be human-legible and reasonably clear. • The XML design should be prepared quickly. • The design of XML shall be formal and concise. • XML documents shall be easy to create. • Terseness in XML markup is of minimal importance. Dickson Chiu 2004
Examples of Hot XML Application • E-publishing • Web (intranet, extranet, Internet) • CD-ROM • Print • E-commerce • Electronic commerce (business-to-consumer) • Electronic Data Interchange -- EDI (business-to-business) • Software applications • Data exchange between applications and databases • Application integration • Standard data formats for industries • MathML -- for mathematics • SpeechML -- for synthesised voices Dickson Chiu 2004
Over SGML: Faster download Supported by mainstream browsers Standard linking Standard stylesheet XML Advantages for Web Delivery Over HTML: • Interchangeable • Reusable • Enables automation • Searchable Dickson Chiu 2004
SMIL SpeechML WML XHTML MathML RDF The XML Family Tree HTML TEI . . . . . . XML SGML Dickson Chiu 2004
HTML vs. XML – a Quick Example HTML <html> <body> <p>333 MHz Pentium II with 256K internal cache, 512K external cache, 32MB standard RAM, 512MB max. RAM</p> </body> </html> XML <pcinfo> <processor> <type>Pentium II</type> <speed>333</speed> <intcache>256</intcache> </processor> <extcache>512</exctache> <ram> <standard>32</standard> <max>512</max> </ram> </pcinfo> Dickson Chiu 2004
Designed to express layout of maths Also can express semantics Cut & paste into Maple, Mathematica x2 + 4x + 4 =0 <mrow> <mrow> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>+</mo> <mrow> <mn>4</mn> <mo>&invisibletimes;</mo> <mi>x</mi> </mrow> <mo>+</mo> <mn>4</mn> </mrow> <mo>=</mo> <mn>0</mn> </mrow> MathML Example Dickson Chiu 2004
Case Study: EDI • EDI (Electronic Data Interchange) – aims at eliminating the use of paper for business data exchange • Single point of information capture, electronic delivery, low storage and retrieval costs • Statistics show that only the top 10,000 companies on a global scale are using EDI. The rest of the business world: only 5% using EDI, all others, paper Dickson Chiu 2004
The Case of a Small Business • Arthur runs a music wholesaling business • He buys CDs from publishers using EDI or by fax • He sells CDs to shops, taking orders by mail, phone, fax, or over the Web • His ordering using EDI is actually worse than fax • Big business (the record company) benefits, the small guy (Arthur) suffers • His suppliers all use different EDI standards • Arthur has to use four PCs, one for each supplier, running some expensive software to produce EDI orders and accept EDI invoices • Worse, none of systems links to his accounting system Dickson Chiu 2004
Problems of EDI • Information coded in EDI is not self-describing • might look like any of “Wing Discspinner Music”, “Music distributor”, “Wing Discspinner”, Wing@discspinner.co.uk • Systems must be 100% compatible in the message structures they understand: • Imagine what happens when adding a new field “Arther Discspinner Music”, “Music distributor”, “Arthur Discspinner”, “0118 912 3456”, Arthur@discspinner.co.uk. • In general, the EDI system will report an error EDI • So companies need to band together to define standards Dickson Chiu 2004
EDI by XML • Using XML, the same info will be coded as • The software will access the data by element name <Company>Arthur Discspinner Music</Company><MarketSector>Music distributor</MarketSector><Contact> <Name>Arthur Discspinner</Name><Phone>0118 912 3456</Phone> <Email>arthur@discspinner.co.uk</Email></Contact> Dickson Chiu 2004
New Vocabularies for E-Business • What Arthur is/we are looking for • A system that can link accounting systems over the Web or by email • A “many-to-many” solution • “Flexible interoperability“ • XML can achieve all this • To use XML to define vocabularies for business relationships and transactions • An example: ebXML (http://www.ebxml.org/) Dickson Chiu 2004
XML Markup <?xml version = "1.0" encoding="utf-8"?> <!– XML Fig. 5.1 : intro.xml --> <!-- Simple introduction to XML markup --> <myMessage> <message>Welcome to XML!</message> </myMessage> • Declaration; version 1.0 • Encoding specification, e.g., UTF-8 Unicode(Unicode Transformation Format-8: www.utf-8.com) • Comments • A tree of elements • One root element per document, e.g., <myMessage> • Child elements • <message> which contains the text Welcome to XML! Dickson Chiu 2004
XML Markup Syntax • Tags written as in HTML, but ... • Only 1 root element in a XML document • Tag names are case-sensitive • Always need end tags • Special empty-element tags <img src = "img.gif" />or <img src = "img.gif"></img> (<img src = "img.gif"> is invalid) • Always quote attribute values • Proper nesting for XML elements:<x><y>hello</x></y> is an error Dickson Chiu 2004
XML Characters • Unicode characters (http://www.unicode.org) • ASCII a small subset • Most languages in the world • E.g., دايتَ • Reserved characters: & < > ' " • Entity reference • Definition:<!ENTITY myName “Dickson Chiu”> • Using them:&myName; • Built-in entity: & < > ' " • <hello> displayed as <hello> • By default, consecutive white space, tabs and blank lines as single space. To override: <myCProgram xml:space = “preserve”> if ( x <= 0) x = 5; </myCProgram> Dickson Chiu 2004
Why Use Attributes? • Elements define structure, attributes describe elements • <car doors=“4”/> or • <car> <doors type=“4”/></car>? • Many debates – why pollute the language with two ways of doing the same thing? • “Attributes can provide metadata that may not be relevant to most applications dealing with XML” • Metadata is data about data (i.e., description) • Attributes save bandwidth? • Personal preference Dickson Chiu 2004
CDATA • Character not parsed by parser (good for code) • IE5 displays CDATA as is, including whitespace <?xml version = "1.0"?> <!-- Fig. 5.7 : cdata.xml --> <book title = "C++ How to Program" edition = "3"> <sample> // C++ comment if ( this->getX() < 5 && value[ 0 ] != 3 ) cerr << this->displayError(); </sample> <sample> <![CDATA[ // C++ comment if ( this->getX() < 5 && value[ 0 ] != 3 ) cerr << this->displayError(); ]]> </sample> C++ How to Program by Deitel & Deitel </book> Dickson Chiu 2004
XML Namespaces • To avoid name collisions (same name for different elements) • A namespace is tied to a uniform resource identifier (URI) • A common practice is to use URL <?xml version = "1.0"?> <!-- Fig. 5.9 : defaultnamespace.xml --> <directory xmlns = "urn:deitel:textInfo" xmlns:image = "urn:deitel:imageInfo"> <file filename = "book.xml"> <!-- default ns --> <description>A book list</description> </file> <image:file filename = "funny.jpg"> <image:description>A funny picture </image:description> <image:size width = "200" height = "100"/> </image:file> </directory> Dickson Chiu 2004
Document Type Definition • DTD (Document Type Definition) to define a document’s structure – what tags/attributes are permitted, and the “grammar” • Validity = conformance to some DTD (“grammatically correct”) • Well-formedness – required;validity – optional • DTD recommended, esp. for B2B transactions • DTDs are defined using EBNF (Extended Backus-Naur Form), not XML Dickson Chiu 2004
XML Parsers • “Parse”, to separate a sentence into its parts [Webster] • XML parser, a program/function that reads the XML document • To check its syntax • To allow programmatic access (DOM or SAX) to the contents • An XML document is well-formed if it is syntactically correct • One root element • Start and end tag for each element • Proper nesting, etc. • Validity implies well-formedness; the reverse is not true • All XML parsers check for well-formedness; validating parsers check also for validity Dickson Chiu 2004
Many Free XML Parsers • Apache’s Xerces, Sun’s JAXP, IBM’s XML4J, etc. • IE5 has one built in, msxml • It uses a default style sheet • With style sheets such as CSS or XSL, the data can be displayed in any desired format • msxml is a validating parser • But the validation feature needs to be turned on in IE5 • Current version msxml4, see: http://msdn.microsoft.com/xml Dickson Chiu 2004
DTD <!DOCTYPE … > • DTDs are specified using <!DOCTYPE … > • Internal DTD: • <!DOCTYPE myMessage [ <!ELEMENT myMessage ( #PCDATA )>]> • External DTD: • <!DOCTYPE myMessage SYSTEM “myDTD.dtd”> Dickson Chiu 2004
Element Type Declarations • The line in red is an ETD: • <!DOCTYPE myMessage [<!ELEMENT myMessage ( #PCDATA )>]> • #PCDATA means “parsable character data” that will be parsed and hence characters such as <, >, &, etc. will be specially treated • EMPTY – no content allow • ANY – anything allowed (poor design) • Dietel Fig. 6.1 & 6.2 (intro.xml and intro.dtd) Dickson Chiu 2004
, | + * ? Dickson Chiu 2004
Examples • <!ELEMENT class ( number, instructor, demtors+, ( assignment+ | project ), test*, exam, ( credit | noCredit ) )> • <!ELEMENT farm ( farmer+, ( dog* | cat? ), pig*, ( goat | cow )?, ( chicken+ | duck* ) )> • Fig. 6.5 (mixed.xml) Dickson Chiu 2004
Attribute Declarations • <!ELEMENT carEMTPY><!ATTLIST car doorsCDATA#REQUIRED> • <!ELEMENT pointEMTPY><!ATTLIST point x CDATA#REQUIRED y CDATA#REQUIRED > • <!ELEMENT pointEMTPY><!ATTLIST point x CDATA#REQUIRED ><!ATTLIST point y CDATA#REQUIRED > • CDATA for non-parsed character data except <, >, &, ‘ and “ • #REQUIRED: attribute must be provided • <car doors=“4”/> • #IMPLIED: application can derive its values if attribute does not appear • #FIXED: only 1 possible value as specified if the attribute presents<!ATTLIST po ... confirmed CDATA#FIXED “yes”> Dickson Chiu 2004
Attribute Types • ID - key uniquely identifies an element • IDREF – points to elements with ID attribute • Enumerated attribute types (with default values) • <!ATTLIST person gender ( M | F ) “F”> <bookstore> <shipping shipID = "s1"> <duration>2 to 4 days</duration> </shipping> <shipping shipID = "s2"> <duration>1 day</duration> </shipping> <book shippedBy = "s2"> Java How to Program 3rd edition. </book> <book shippedBy = "s2"> C How to Program 3rd edition. </book> </bookstore> <?xml version = "1.0"?> <!-- Fig. 6.8: IDExample.xml --> <!DOCTYPE bookstore [ <!ELEMENT bookstore ( shipping+, book+ )> <!ELEMENT shipping ( duration )> <!ATTLIST shipping shipID ID #REQUIRED> <!ELEMENT book ( #PCDATA )> <!ATTLIST book shippedBy IDREF #IMPLIED> <!ELEMENT duration ( #PCDATA )> ]> Dickson Chiu 2004
More Attribute Types • NMTOKEN / NMTOKENS – name token / tokens, each consists of only letters, digits, periods, periods, underscores, hyphens and colon • <!ELEMENT bornEMTPY><!ATTLIST born year NMTOKEN #REQUIRED> • Conforming element: <born year= “1934” /> • Attribute need not start with a letter • ENTITY – attribute must be a declared entity referring to external unparsed entity • <!ENTITY city … ><!ENTITY boat … ><!ENTITY city … ><!ATTLIST company tour ENTITY #REQUIRED> • Conforming element: <company tour = “city”> • ENTITIES – one or more of the above ENTITY • <!ATTLIST company tourset ENTITIES #REQUIRED> • Conforming element: <company tourset = “city boat train”>(Assume city, boat, train are declared entities.) Dickson Chiu 2004
Limitations of DTD of XML 1.0 • DTD not extensible • Only one DTD per document • Limited support of namespaces • Weak data typing • No inheritance • Document can override an external DTD • Non-XML syntax • No (direct) DOM support • Limited tools • Cannot specify cardinality Dickson Chiu 2004
Something better? • Schemas is the answer • “Schema”, originated in database, means the organization or structure of a database • Naming of data items • Constraints to be applied to data (eg., data typing) • Relationships between data items • W3C schemas (May 2001)– http://www.w3c.org/XML/Schema • XML-Data Reduced (XDR) - Microsoft’s non-W3C-compliant implementation • See tutorial: http://zvon.org/xxl/XMLSchemaTutorial/Output/index.html Dickson Chiu 2004
Schema vs. DTD’s • DTD is weak in data typing • <quantity>hello</quantity> • Schemas are XML documents which can be manipulated like other XML documents • Valid schemas conform to DTD’s • Schemas have more detailed and robust content models • Schemas are extensible • Dynamic schemas – can be modified at runtime Dickson Chiu 2004
XML Schema – Simple Types • Elements that do not contain other elements or attributes are of type simpleType. <xsd:element name=“STAFFNO” type = “xsd:string”/> <xsd:element name=“DOB” type = “xsd:date”/> <xsd:element name=“SALARY” type = “xsd:decimal”/> • Attributes must be defined last: <xsd:attribute name=“branchNo” type = “xsd:string”/> Dickson Chiu 2004
XML Schema – Enumeration Types <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" > <xsd:elementname="root"> <xsd:simpleType> <xsd:restrictionbase="xsd:string"> <xsd:enumerationvalue="N/A"/> <xsd:enumerationvalue="#REF!"/> </xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:schema> Dickson Chiu 2004
XML Schema – Range Restrictions • Element "root" to be from the range 0-100 or 300-400 (including the border values). <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" > <xsd:elementname="root"> <xsd:simpleType> <xsd:union> <xsd:simpleType> <xsd:restrictionbase="xsd:integer"> <xsd:minInclusivevalue="0"/> <xsd:maxInclusivevalue="100"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType> <xsd:restrictionbase="xsd:integer"> <xsd:minInclusivevalue="300"/> <xsd:maxInclusivevalue="400"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:schema> Dickson Chiu 2004
XML Schema – Complex Types • Elements that contain other elements are of type complexType. • List of children of complex type are described by • all – must all appear, any order • sequence –must all appear according to specified sequence • choice – any can appear <xsd:element name = “STAFFLIST”> <xsd:complexType> <xsd:sequence> <!-- children defined here --> </xsd:sequence> </xsd:complexType> </xsd:element> Dickson Chiu 2004
Cardinality • Cardinality of an element can be represented using attributes minOccurs and maxOccurs (default 1). • To represent an optional element, set minOccurs to 0; to indicate there is no maximum number of occurrences, set maxOccurs to “unbounded”. <xsd:element name=“DOB” type=“xsd:date” minOccurs = “0”/> <xsd:element name=“NOK” type=“xsd:string” minOccurs = “0” maxOccurs = “3”/> Dickson Chiu 2004
References • Can use references to elements and attribute definitions. <xsd:element name=“STAFFNO” type=“xsd:string”/> …. <xsd:element ref = “STAFFNO”/> • If there are many references to STAFFNO, use of references will place definition in one place and improve the maintainability of the schema. Dickson Chiu 2004
Defining New Types • Can also define new data types to create elements and attributes. <xsd:simpleType name = “STAFFNOTYPE”> <xsd:restriction base = “xsd:string”> <xsd:maxLength value = “5”/> </xsd:restriction> </xsd:simpleType> <xsd:simpleTypename="myNumber"> <xsd:restrictionbase="xsd:decimal"> <xsd:totalDigitsvalue="5"/> <xsd:fractionDigitsvalue="2"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleTypename="myString"> <xsd:restrictionbase="xsd:string"> <xsd:patternvalue="[^@]+@[^.]+\..+"/> </xsd:restriction> </xsd:simpleType> Dickson Chiu 2004
XMI • XML Metadata Interchange (XMI) • Created by the OMG • As a standard for exchanging metamodels and models • Provides a standard method for mapping object models and instances to XML • Mapping UML models to XML is only one specific subset of how XMI can be applied Dickson Chiu 2004
Appling XMI On UML • Mapping UML models to XML schemas and documents UML Meta Model instance of produced according to XMI Our Model XML DTD / Schema instance of validated by Our Model Instance translated according to XMI XML Document Dickson Chiu 2004
CatalogItem Organization name : String description : String sku : String listPrice : Money keyword [0..*] : String name : String address : String city : String state : String zip : String +item +supplier Money Catalog name : String expirationDate : Date currency : String amount : double 0..* * 1 1 Product Service photoURL : Stirng units : UnitOfMeasure units : UnitOfTime <<enumeration>> UnitOfTime <<enumeration>> UnitOfMeasure hour day week month year each dozen meter kilogram Simplified Product Catalog Example Dickson Chiu 2004