820 likes | 937 Views
XML- Extensible Markup Language. HTML to XML. HTML documents Emerging Web Standards - XML XML good for data interchange across platforms enterprise wide conversion HTML to XML - IBM, Microsoft. XML - Motivation.
E N D
HTML to XML • HTML documents • Emerging Web Standards - XML • XML good for data interchange across platforms enterprise wide • conversion HTML to XML - IBM, Microsoft
XML - Motivation • In HTML, both the tag semantics and tags are fixed. There is limited and strict interpretation of tags. • HTML is widely successful in disseminating documents across internet. • Though data can be disseminated through HTML, its extraction is painful, and laborious. • EDI has been a predominate mode of exchanging data among businesses. But it has very rigid format that requires highly customized applications.
XML - Introduction • XML aims to provide ease of authoring HTML documents with ease of data exchange that is possible with EDI. • Tags are used to markup documents. • XML is a meta-language for describing markup languages. • XML provides a facility to define tags and structural relationships between them. • No pre-defined tag set implied no preconceived semantics, semantics of XML document is defined by applications that process them
XML - Goals • Straightforward to use over internet • Support wide variety of applications, authoring, browsing, content analysis, etc. • Easy to write programs that process XML documents and validate them. • XML documents must be human-legible and reasonably clear. • Design of XML shall be formal and concise - expressed as EBNF (extended Backus Naur Form) - amenable to modern compiler tools and techniques.
XML-features • Some structure - not rigid • Extensibility - User defined tags • nested elements • validation - documents may specify their own grammar • DTD (Document Type Descriptor) - schema exists with data as tag names • Application -EDI - extraction, conversion, , transformation, integration • can be modeled using DOM
More terminology • RDF - Resource Description Framework - a method to describe metdata for XML documents • XSL - Extensible Stylesheet Language - language for transforming and formatting XML. • Transformation Language - XSLT, XPath, Xpointer, Xlink
Example-HTML • Print - Sanjay Madria Web Warehouse Tutorial, ADBIS’99 HTML <H2> Sanjay Madria </H2> <I> Web Warehouse Tutorial, ADBIS’99</I> Very difficult to understand, structure is hidden, describes only appearance
XML • <Ref> <Speaker> <Firstname> Sanjay</firstname> <Lastname> Madria</lastnaame> </Speaker> <Title > Web Warehouse Tutorial</Title> <Conference> ADBIS’99</Conference> </empty> </Ref> another format: <Firstname Value “Sanjay”/>
XML can Separate Data from HTML • XML is used to Exchange Data • XML can be used to Share Data • XML can be used to Store Data • XML can be used to Create new Languages (WML)
XML • <Person> - a start-tag • </Person> - a end tag • Tags are also called markups. • Tags must be balanced; close in inverse order of their opening • Tags are defined by users, no predefined tags
<person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> Element - <Person>…..</Person> Subelement – Age
XML elements must follow these naming rules: • Names can contain letters, numbers, and other characters • Names must not start with a number or "_" (underscore) • Names must not start with the letters xml (or XML or Xml ..) • Names can not contain spaces
<table> <description> People on the fourth floor </description> <people> <person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> <person> <name> Patsy </name> <age> 36 </age> <email> ptn@abc.com </ email > </person> <person> <name> Ryan </name> <age> 58 </age> <email> rgz@abc.com </ email > </person> </people> </table>
<married></married> Can be abbreviated to <married/>
XML Attributes Att. (Name, value) pair <product> <name language=“French”> trompette six trous </name> <price currency=“Euro”> 420.12 </price> <address format=“XLB56” language=“French”> <street>31 rue Croix-Bosset</ street> <zip>92310</zip><city>Sevres</city> <country>France</country> </address> </product>
Attributes takes always string values (“..”) • A given attribute may occur only once within a tag, while subelements within same tag can repeat attributes
XML tags are case sensitive • With XML, White Space is Preserved • <b><i>This text is bold and italic</b></i> • Ok in HTML • <b><i>This text is bold and italic</i></b>
XML Elements are Extensible • Extract to • MESSAGETo: ToveFrom: Jani • Don't forget me this weekend!
<?xml version="1.0" ?>-<note><to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
<note> <date>1999-08-01</date> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> • No problem
Book Title: My First XML • Chapter 1: Introduction to XML • What is HTML • What is XML • Chapter 2: XML Syntax • Elements must have a closing tag • Elements must be correctly nested
<book> • <title>My First XML</title> • <prod id="33-657" media="paper"></prod> • <chapter>Introduction to XML • <para>What is HTML</para> • <para>What is XML</para> • </chapter> • <chapter>XML Syntax <para>Elements must have a closing tag</para> <para>Elements must be properly nested</para> </chapter> • </book>
<person sex="female"> <firstname>Anna</firstname> <lastname>Smith</lastname> • <person> <sex>female</sex> <firstname>Anna</firstname> <lastname>Smith</lastname> </person>
Bad Design • <note day="12" month="11" year="99" to="Tove" from="Jani" heading="Reminder" body="Don't forget me this weekend!"> </note>
<note date="12/11/99"> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
<note> <date>12/11/99</date> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
<note> <date> <day>12</day> <month>11</month> <year>99</year> </date> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
PCDATA • XML parsers treat all text as Parsable Characters (PCDATA). • When an XML element is parsed, the text between the XML tags is also parsed: • CDATA • Everything inside a CDATA section is ignored by the parser. • Starts with "<![CDATA[" and ends with "]]>":
<person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> or <person name=“Alan” age = “42” email = “agb@abc.com” /> or <person age = “42” > <name> Alan </name> <email> agb@abc.com </ email > </person>
person person email name age name email age Alan 42 agb@abc.com Alan agb@abc.com 42
XML can associates unique identifier to elements, as the value of certain attribute Called id • Refer that element using idref
<messages> • <note ID="501"> • <to>Tove</to> • <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> • </note> • <note ID="502"> <to>Jani</to> <from>Tove</from> <heading>Re: Reminder</heading> <body>I will not!</body> </note> • </messages>
<state id=“s2”> <scode>NE</scode> <sname>Nevada</sname> </state> <city id=“c2”> <ccode>CCN</ccode> <cname>Carson City</cname> <state-of idref = “s2”/> </city>
a a c b
<a><b id=“&o123”> some string </b></a> <a c=“&o123”/> Assume c as reference attribute <a b=“&o123”/> <a><c id=“&o123”> some string </b></a> Assume b as reference attribute
<geography> <states> <state id=“s1”> <scode>ID</scode> <sname>Idaho</sname> <capital idref=“c1”/> <cities-in idref=“c1”/><cities-in idref=“c3”/>…… </state> <state id=“s2”> <scode>NE</scode> <sname>Nevada</sname> <capital idref=“c2”/> <cities-in idref=“c2”/>……. </state> …. </states>
<cities> <city id=“c1”> <ccode>BOI</ccode> <cname>Boise</cname> <state-of idref = “s1”/> </city> <city id=“c2”> <ccode>CCN</ccode> <cname>Carson City</cname> <state-of idref = “s2”/> </city> <city id=“c3”> <ccode>MOC</ccode> <cname>Moscow</cname> <state-of idref = “s1”/> </city> … </cities> </geography>
Ordering person:{firstname: “John”, lastname:“Smith”} person:{lastname: “Smith”,firstname: “John”} As SSD, both are same
These two are not same as XML documents <person><firstname>John</firstname> <lastname>Smith </lastname></person> <person><lastname>Smith </lastname> <firstname>John</firstname></person> The following two are equivalent as attributes are not ordered <person firstname=“John”lastname=“Smith”/> <person lastname=“Smith” firstname=“John”/>
Mixing elements and Text <Person> This is my best friend <Name> Alan </Name> <Age> 42 </Age> I am not too sure of the following email <Email> agb@abc.com </Email > </Person>
<!- - this is a comment - -> - Comments are allowed anywhere except inside markup and is a part of the document. <?xml-stylesheet href=“book.css” type=“text/css”?> - Processing instructions for applications <?xml version=“1.0”?> This is not PI, not passed to application. <![CDATA[<start>this is an incorrect element </end>]]> <!DOCTYPE name [markupdeclarations]> <?xml….?> <!DOCTYPE name [markupdeclarations]> <name>…</name>
<db><person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> <person>… </person> … </db> <!DOCTYPE db [ <!ELEMENT db (person*)> <!ELEMENT person (name,age,email)> <!ELEMENT name (#PCDATA)> <!ELEMENT age (#PCDATA)> <!ELEMENT email (#PCDATA)> ]>
Recursion <!ELEMENT node (leaf | (node,node))> <!ELEMENT leaf (#PCDATA)> An example of such XML document is <node> <node> <node> <leaf> 1 </leaf> </node> <node> <leaf> 2 </leaf> </node> </node> <node> <leaf> 3 </leaf> </node> </node>
<db> <r1><a> a1 </a><b> b1 </b><c> c1 </c></r1> <r1><a> a2 </a><b> b2 </b><c> c2 </c></r1> <r2><c> c2 </c><d> d2 </d></r2> <r2><c> c3 </c><d> d3 </d></r2> <r2><c> c4 </c><d> d4 </d></r2> <db>
<!DOCTYPE db [ <!ELEMENT db (r1*,r2*)> <!ELEMENT r1 (a,b,c)> <!ELEMENT r2 (c,d)> <!ELEMENT a (#PCDATA)> <!ELEMENT b (#PCDATA)> <!ELEMENT c (#PCDATA)> <!ELEMENT d (#PCDATA)> ]>
<!ELEMENT r2 ((c,d) | (d,c))> <!ELEMENT db ((r1|r2)*)> <!ELEMENT r1 (a,b?,c+)> <!DOCTYPE db [<!ELEMENT …>…]> <!DOCTYPE db SYSTEM “schema.dtd”> <!DOCTYPE db SYSTEM “http://www.schemaauthority.com/schema.dtd”>
<product> <name language=“French” department = “music”> trompette six trous </name> <price currency=“Euro”> 420.12 </price> </product> <!ATTLIS name language CDATA #REQUIRED department CDATA #IMPLIED> <!ATTLIS price currency CDATA #IMPLIED>
IDREF – attribute’s value is some other element’s identifier iDREFS – attribute’s value is a list of identifiers, separated by spaces <!DOCTYPE family [ <!ELEMENT family (person*)> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIS person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>
<family> <person id=“jane” mother=“mary” father=“john”> <name> Jane Doe </name> </person> <person id=“john” children =“jane jack” > <name> John Doe </name> </person> <person id=“mary” children =“jane jack” > <name> Mary Smith </name> </person> <person id=“jack” mother=“smith” father=“john”> <name> Jack Smith </name> </person> </family>