850 likes | 1.12k Views
Web data exchange formats. Introduction and Overview. Web data exchange formats. XML JSON YAML. XML o utline. What is XML & Why XML The rules of XML documents XML schema and validation XML processing DOM SAX JAXP JAXB Digester. Before XML.
E N D
Web data exchange formats Introduction and Overview
Web data exchange formats • XML • JSON • YAML
XML outline • What is XML & Why XML • The rules of XML documents • XML schema and validation • XML processing • DOM • SAX • JAXP • JAXB • Digester
Before XML • HTML, Hyper-Text Markup Language, the most successful markup language of all the times • First definition, HTML 1.0 – 1992 • Latest version, HTML 4.01 – 1999 • Fixed collection of markup tags • <head>, <body>, <h1>, <br>, etc…
What is XML? • XML, Extensible Markup Language, is a framework for defining markup languages • Created by the World Wide Web Consortium (W3C) to overcome the limitations of HTML • Like HTML, XML is based on SGML - Standard Generalized Markup Language • XML was designed with the Web in mind!
XML design goals • XML shall be straightforwardly usable over the Internet • XML shall support a wide variety of applications • XML shall be compatible with SGML • It shall be easy to write programs which process XML documents • The number of optional features in XML is to be kept to the absolute minimum, ideally zero
XML design goals • XML documents should be human-legible and reasonably clear • The XML design should be prepared quickly • The design of XML shall be formal and concise • XML documents shall be easy to create • Terseness in XML markup is of minimal importance
Typical XML usages • Web development and content management • Data exchange • Data storage • Configuration files • Web services
Historical outline • The development of XML began in the mid-90s • Initial XML draft – November 1996 • XML 1.0, W3C recommendation – February 1998 • XML 1.1 – February 2004
More about XML • XML lets us define our own tags • Each XML language is targeted to a particular application domain • XML specification says nothing about the semantics of the markup tags • XML is internationalized and platform independent
XML specification • Is located at • XML 1.0: http://www.w3.org/TR/REC-xml/ • XML 1.1:http://www.w3.org/TR/xml11/ • Defines the basic rules for XML documents
Sample XML document <?xml version="1.0" encoding="UTF-8"?> <people> <person id="person_1"> <name>David</name> <surname>Gilmour</surname> </person> <person id="person_2"> <name>Richard</name> <surname>Wright</surname> </person> <person id="person_3"> <name>Nick</name> <surname>Mason</surname> </person> </people>
Examples of XML markups • XHTML • WML - Wireless Markup Language • MathML – Mathematical Markup Language • ebXML - Electronic Business XML • CML - Chemical Markup Language • MusicXML – Musical Scores Markup Language • ThML - Theological Markup Language See more at http://en.wikipedia.org/wiki/List_of_XML_markup_languages
XHTML versus HTML • XHTML 1.0 is W3C’s XMLification of HTML 4.01 • The most notable differences: • HTML allows certain elements to omit the end tag (forbidden in XML) • Element and attribute names must be lowercase • Attribute values in XHTML must be present and they must be surrounded by quotes
XML document rules • The creators of XML decided to enforce document structure from the beginning • The XML specification requires a parser to reject any XML document that doesn't follow the basic rules • A parser is a piece of code that attempts to read a document and interpret its contents
Three kinds of XML documents • Invalid documents • Don't follow the syntax rules defined by XML specification or DTD/schema • Valid documents • Follow both the XML syntax rules and the rules defined in their DTD/schema • Well-formed documents • Follow the XML syntax rules but don't have a DTD/schema
How to check XML document? Easy way to check if XML document is well-formed: • Simply open it in a browser
XML main notions • There are three common terms used to describe parts of an XML document: • tags • elements • attributes <people> <personid="person_1"> <name>David</name> <surname>Gilmour</surname> </person> </people> <people> <person id="person_1"> <name>David</name> <surname>Gilmour</surname> </person> </people> <people> <person id="person_1"> <name>David</name> <surname>Gilmour</surname> </person> </people>
Rule: The root element An XML document must be contained in a single element <?xml version="1.0"?> <!-- A well-formed document --> <greeting> Hello, World! </greeting> <?xml version="1.0"?> <!-- An invalid document --> <greeting> Hello, World! </greeting> <greeting> Hola, el Mundo! </greeting>
Rule: Elements can't overlap Invalid XML documents: <?xml version="1.0"?> <!-- An invalid document --> <person><name>Jonh Brown</person></name> <?xml version="1.0"?> <!-- An invalid document --> <p> <b>My name is <i>John Brown</b>.</i> </p>
Rule: End tags are required • You can't leave out any end tags • If an element contains no markup at all it is called an empty element • In empty elements in XML documents, you can put the closing slash in the start tag <!-- NOT legal XML markup --> <p>My name is John Brown <p>I am 25 years old <p>... <!-- Two equivalent break elements --> <br></br> <br />
Rule: Elements are case sensitive In HTML, <h1> and <H1> are the same; in XML, they're not <!-- NOT legal XML markup --> <Person> Elements are case sensitive </person> <!-- legal XML markup --> <person> Elements are case sensitive </person>
Rule: Quoted attribute values There are two rules for attributes in XML documents: • Attributes must have values • Those values must be enclosed within quotation marks (single or double) <!-- NOT legal XML markup --> <ol compact> <!-- legal XML markup --> <ol compact="yes">
XML declarations • Most XML documents start with an XML declaration that provides basic information about the document to the parser • An XML declaration is recommended, but not required <?xml version="1.0" encoding="UTF-8" standalone="no"?>
XML document as a tree • Conceptually, an XML document is a hierarchical structure called an XML tree • Although there is no consensus on the terminology used on XML trees, at least two standard terminologies exist: • XPath Data Model • XML Information Set http://www.ibm.com/developerworks/xml/library/x-hands-on-xsl/
Namespaces • Different XML languages may use the same tags • Namespaces • a solution for a name clashing problem <?xml version="1.0"?> <customer_summary xmlns:addr="http://www.xyz.com/addresses/" xmlns:books="http://www.zyx.com/books/" xmlns:mortgage="http://www.yyz.com/mortage/"> ... <addr:title>Mrs.</addr:title> ... ... <books:title>Lord of the Rings</books:title> ... ... <mortgage:title>NC2948-388-1983</mortgage:title> ...
Namespaces • XML namespaces are similar to Java packages • The string in a namespace definition looks like a URL, but it’s just a string! • For simplicity, unprefixed element names are assigned a default namespace (xmlns=“ ”) • Can be overridden using a declaration in a form xmlns=“URI”
Defining document content • The elements of particular XML language have to be defined in some way • A schema is a formal definition of the syntax of an XML-based language • Two main schema languages: • DTD • XML Schema
DTD - Document Type Definition • Built-in schema language since the first XML working draft • DTD is not itself written in XML notations <!-- address.dtd --> <!ELEMENT address (name, street, city, postal-code)> <!ELEMENT name (title? first-name, last-name)> <!ELEMENT title (#PCDATA)> <!ELEMENT first-name (#PCDATA)> <!ELEMENT last-name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT postal-code (#PCDATA)>
Document Type Declaration • An XML document may contain a reference to a DTD schema • XHTML documents often contain: <?xml version="1.1"> <!DOCTYPE people SYSTEM "http://www.music.com/people.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
DTD – Element declaration • An element declaration looks as follows: <!ELEMENT element-namecontent-model> • Content model defines the validity requirements of the contents (the sequence of its immediate child nodes) of all elements of the given name
DTD – Content model Constructs used in content model description:
DTD: example <!ELEMENT people (person+)> <!ELEMENT person (name, surname, birthdate?, address*)> <!ELEMENT name (#PCDATA)> <!ELEMENT surname (#PCDATA)> <!ELEMENT birthdate (#PCDATA)> <!ELEMENT address (#PCDATA)>
DTD: Attribute-List declarations • An attribute-list declarations looks as follows: <!ATTLIST element-name attribute-definitions> • attribute-definitions is a list, each element in a form: attribute-name attribute-type default-declaration • Default declarations:
<!ELEMENT rectangle EMPTY> <!ATTLIST rectangle length CDATA "0px" width CDATA "0px"> <rectangle width="80px" length="40px"/> <!ELEMENT img EMPTY> <!ATTLIST img alt CDATA #REQUIRED src CDATA #REQUIRED width CDATA #IMPLIED height CDATA #IMPLIED> <img src="xmlj.jpg" alt="XMLJ Image" width="300"/> <!ELEMENT address (#PCDATA)> <!ATTLIST address country CDATA #FIXED "USA"> <address country="USA"> 123 15th St. Troy NY 12180</ADDRESS> DTD: examples
XML Schema • Shortly after XML 1.0, the W3C initiated the development of the next generation schema language to attack the problems with DTD • Some judicious guiding design principles, that the new schema language should be: • More expressive that XML DTD • Expressed in XML • Self-describing • Simple enough
XML Schema Specification • Published in 2001 • Specification consist of the following parts: • Part 0 - Primer: http://w3.org/TR/xmlschema-0 • Part 1 - Document structures: http://w3.org/TR/xmlschema-1 • Part 2 - Datatypes: http://w3.org/TR/xmlschema-2
XML Schema • Unfortunately, the resulting language does not fulfill the original requirement • Although it provides good support for namespaces, modularization and datatypes, but • It is not simple – Part 1 alone is more than 160 pages, and even XML experts do not find it human-readable • It is not fully self-describing – there is a schema for XML Schema, but it doesn’t capture all syntactical aspects of the language
XML Schema advantages Several advantages over DTDs • XML schemas use XML syntax • You can process a schema just like any other document • XML schemas support datatypes • Integers, floating point numbers, dates, times, strings, URLs • XML schemas are extensible • User-defined datatypes, derived datatypes • XML schemas have more expressive power • XML schemas support namespaces
XSD An XML Schema instance is an XML Schema Definition (XSD) and typically has the filename extension ".xsd" <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="country" type="Country"/> <xsd:complexType name="Country"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="population" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> </xsd:schema>
Example: people.xsd <?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xsd:element name="people" type="peopleType"/> <xsd:complexType name="peopleType"> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="person" type="personType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="personType"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="surname" type="xsd:string"/> </xsd:sequence> <xsd:attribute name="id" type="xsd:string"/> </xsd:complexType> </xsd:schema>
Declaring XML Schema To declare that people.xml uses people.xsd schema, need to add the following: <?xml version="1.0" encoding="UTF-8"?> <!–- schema is located in the same folder --> <people xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="people.xsd"> . . . </people> <?xml version="1.0" encoding="UTF-8"?> <!–- schema location specified as URL --> <people xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation= "http://www.ante.lv/lab01-music-serverside/data/people.xsd"> . . . </people>
XML Schema: Defining elements • To define an element is to define its name and content model (type) • A type can be simple or complex • A simple type cannot contain elements or attributes in its value • A complex type can create the effect of embedding elements in other elements or it can associate attributes with an element
Simple, non-nested elements An element that does not contain attributes or other elements can be defined to be of a • simple type • predefined • user-defined <element name='name' type='string'/> <element name='birthday' type='date'/> <element name='age' type='integer'/> <element name='price' type='decimal'/> http://www.ibm.com/developerworks/xml/library/xml-schema/sidetable2.html
Complex types • Elements with attributes must have a complex type • Elements that embed other elements must have a complex type <complexType name="personType"> <sequence> <element name="name" type="string"/> <element name="surname" type="string"/> </sequence> <attribute name="id" type="string"/> </complexType>
Expressing constraints on elements • XML Schema offers greater flexibility than DTD for expressing constraints on the content model of elements • For example, element occurrence definition: • DTD: * + ? • XML Schema: • maxOccurs • minOccurs <element name='Book'> <complexType> <element ref='Title' minOccurs='0'/> <element ref='Author' maxOccurs='2'/> </complexType> </element>
XML validation • Online XML validator against XML Schema: http://tools.decisionsoft.com/schemaValidate/ • Java API also provides a way to make a XML parser validate a document
XML processing APIs • The three basic XML parsing interfaces are: • Document Object Model (DOM) • Simple API for XML (SAX) • Streaming API for XML (StAX) • Java API for XML Processing (JAXP) • Provides common interfaces for processing XML documents (using DOM, SAX or StAX) • XML to Java classes binding • Java Architecture for XML Binding (JAXB) • Digester
DOM • The Document Object Model defines a set of interfaces to the parsed version of an XML document • The parser reads in the entire document and builds an in-memory tree • Your code can then use the DOM interfaces to manipulate the tree
DOM Using DOM API you can • move through the tree to see what the original document contained • delete sections of the tree • rearrange the tree • add new branches • and so on . . .