1k likes | 1.34k Views
Introduction to XML and its processing techniques. Cheng-Chia Chen 4/22 2003. outlines. What is XML ? A glimpse of XML Why do we need XML ? Some XML applications XML and related Core Specifications APIs for XML Combine XML technology with traditional language processing technology.
E N D
Introduction to XML and its processing techniques Cheng-Chia Chen 4/22 2003
outlines • What is XML ? • A glimpse of XML • Why do we need XML ? • Some XML applications • XML and related Core Specifications • APIs for XML • Combine XML technology with traditional language processing technology. • Other important XML programming technology • Summary and information for further study
What is XML ? • The eXtensible Markup Language • a data format (syntax) used for the representation, storage and transmission of data whose format is defined by xml. • a data-structure definition language : let you define the structure and format of your own data. • Text-based markup Language, let you define your own HTML-likemarkup languages. • Recommended by World Web Consortium (W3C) in Feb 1998. • intended to be used as a new message format over the Internet to complement the inadequacy of HTML.
The idea of XML • Existing student information • S9010 張得功 資科系 三年級 chang10@cs.nccu.edu.tw • S9021 王德財 應數系 二年級 null • …
HTML’s concerns • How to present the data: <TABLE BORDER=1 bgcolor=“yellow” > <TR><TH>學號</TH>姓名<TH>科系</TH> <TH>年級</TH> <TH>電郵</TH> </TR> <TR><TD> S9010</TD><TD>張得功</TD> <TD>資科系</TD> <TD>三年級</TD> <TD> chang10@cs.nccu.edu.tw </TD></TR> <TR> <TD> S9021 </TD> <TD>王德財</TD> <TD>應數系</TD> <TD>二年級 </TD> </TR> </TABLE>
XML’s concerns • XML uses markup tags as well, but, describe the content, rather than the presentation of that content. • the same example coded in XML: <students> <student><學號> S9010 </學號> <姓名>張得功</姓名> <科系>資科系</科系> <年級>三年級</年級> <電郵> chang10@cs.nccu.edu.tw </電郵> </student> <student><學號> S9021 </學號> <姓名>王德財</姓名> <科系>應數系</科系> <年級>二年級</年級><電郵/> </student> … </students> Notes: 1. Only contents are encoded in the XML text. 2. All data are annotated by tags indicating their roles or functions in the message.
Where does XML come from ? • a simplified subset of the Standard Generalized Markup Language (SGML) standardized in 1986. • simplified for more general use on the Web and as a data interchange format. • without losing extensibility, • easier for anyone to write valid XML. • easier to write a parser • easier for the parser to quickly verify that documents are well-formed and/or valid. • Recommended by W3C at Feb. 1998.
An example XML document <?xml version="1.0"?> <note> <to>Wang</to> <from>Chen</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> Notes: • The XML declaration should always be included. • <note>…</root> is the root element which has 4 children.
<!– the structure of the document element --> <department> <employee id=“s8931"> <name>張德治</name> </employee> <employee id=“s9017“ id-no =“L12345678” > <name>李大春</name> <url href = "http://www.xml.com.tw/~lee/"/> </employee> </department>
Key terminology • Element • Element type (or element name) • Start tag • End tag • [Element] Content • child element • character data [PCDATA] • Attribute • Attribute name • Attribute value • DTD • Comment • Processing Instructions • <? Target data ?>
<!– the structure of the document element --> Element type (or name) <department> start-tag <employee id=“s8931"> <name>張德治</name> </employee> <employee id=“s9017“ id-no =“L12345678” > <name>李大春</name> <url href = "http://www.xml.com.tw/~lee/"/> </employee> </department> end-tag Attributes PCDATA attribute value attribute name [The root or document] element
All XML elements must have an end tag • In HTML some elements do not have to have a closing tag. The following code is legal in HTML: <p>This is a paragraph <p>This is another paragraph • In XML all elements must have a closing tag like this: <p>This is a paragraph</p> <p>This is another paragraph</p>
XML tags are case sensitive • XML tags are case sensitive. • <Letter> != <letter> • Opening and closing tags must match with the same case: • <Message>This is incorrect</message> • <message>This is correct</message>
All XML elements must be properly nested • HTML allow overlapped elements: <b><i>bold and italic</b> italic only</i> • For XML all elements must be properly nested. <b><i>bold and italic</i> bold only</b>
Single root[document] element • A document contains exactly one root element. • All other elements must be nested within the root element. • Elements can have sub (children) elements and subelemetns can have subsubelements and so on. • Elements and text data that can appear as children of an element, their order and multiplicity is definable [by DTD/XML Schema]. <root> <child> <subchild>…</subchild> or text data <subchild>…</subchild> </child> … </root>
XML Attributes • Appear within the start tag of an element. • Attributes that can appear in the start tag of an element is definable [by DTD or XML schema]. • ID attributes are for identification and cannot have the same value in a document instance. • HTML examples: <img src="computer.gif"> <a href=demo.asp> • XML examples: <file type="gif"> <person id=’3344’> Note: • In XML attribute value must be quoted by ‘ or ".
Well-formed v.s. Valid XML Documents • Well-Formed XML documents • Essentially any document conforming to the XML syntax rules that we have described. • All texts/documents must be well-formed to be XML documents. • Example: <?xml version="1.0“?> <note> <to>Wang</to> <from>Chen</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
Valid XML documents • A Valid XML document is • a well-formed XML document and • conforms to the grammar attached to it. • The grammar attached to XML Documents is called a DTD [Document type definition] • A Document with a reference to an external DTD: <?xml version="1.0"?> <!DOCTYPE note SYSTEM "Note.dtd"> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
DTD • DTD • Document Type Definition; • a grammar for a class of XML documents • used to define the legal building blocks of an XML document. • Document Type Declaration: • Declare the DTD for an XML document; • External subset: // defined at external places <!DOCTYPE note SYSTEM “note.dtd” > • Internal subset: // inline declarations <!DOCTYPE note SYSTEM “externSubset.dtd”[ ……inline markup declarations……… ]>
DTD: markup Declarations • Element type declarations • Attribute list declarations • Entity declarations • declare macro-like abbreviations. • <!ENTITY chencc “Cheng-Chia Chen”> • <!ENTITY chapter1 SYSTEM “chapter1.xml”> • <!ENTITY % subDTD SYSTEM “dtd1.dtd”> • Notation declarations • Define types of non-xml data • <!NOTATION png SYSTEM “http://www.w3.org/png”>
DTD: Element Type Declaration • Specifies the element type and content: <!ELEMENT NamecontentSpec> • Element’s Content: • Empty: <!ELEMENThomepageEMPTY > • Any: <!ELEMENTcontainerANY > • Only elements (element content) • No character data • Mixed: • Character data mixed
DTD: Element content model • Basically represented by a regular expression over element types. • Building Blocks: • Choice (p | list | table | form ) • Sequence (street, zip, city, country) • Occurrences ? + * • Example: <!ELEMENT person (name, address+, homepage?, (email | telephone )+, note*)>
DTD: Mixed element content • can contain either • other elements and character data or • only character data • Examples: <!ELEMENT para (#PCDATA |em | strong | abbr )* > <!ELEMENT p (#PCDATA |em | i | b | a| ul)*> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)>
DTD: Attribute List Declaration • Define attributes that can appear in an element type. • format: <!ATTLIST elName attrName1attrType1 attrDefault1 attrName2attrType2 attrDefault2 ………………………………… > • Attribute types: • String type : • Tokenized type: • Enumerated type:
DTD: ATTLIST Attribute Type • String type: <!ATTLIST person age CDATA #IMPLIED> • Tokenized types: • ID, IDREF, IDREFS • ENTITY, ENTITIES • NMTOKEN, NMTOKENS <!ATTLIST person id ID #REQUIRED> father IDREF #REQUIRED> children IDREFS #IMPLIED > • Enumerated type: <!ATTLIST person gender (Male|Female) #REQUIRED>
DTD:ATTLIST Attribute defaults Provide information about the attribute’s presence: • #REQUIRED • Attribute must appear in the associated element. • <!ATTLIST person gender (Male |Female) #REQUIRED> • #IMPLIED • The attribute may be absent. • no default value. • <!ATTLIST person age CDATA #IMPLIED> • Default/constant value • <!ATTLIST list type (ol|ul) “ul”> • <!ATTLIST list type (ol|ul) #FIXED “ul”>
XML unifies the syntax of information • Layers of information(data): • bit • byte • Character BCD EBCDIC ASCII BIG5 ISO-8859 ==> • UNICODE • syntax(form) XML • semantics (ontology) Semantic Web • Application • Semantic Web: • an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. • --- Tim Berners-Lee et.al.
New desired requirements in the internet age • Easy retrieval of information over the net • realized by current Web/internet technology • good browser, • web server • HTTP, DNS, search engines. • HTML, URI, HyperText, MIME • Easy/cheap interoperation of existing software in the internet. • also the old goal of distributed system/computing • RPC, RMI, CORBA,... • a prerequisite for eCommerce • issues: • data transmission ==> solved by existing internet infrastructure • data representations ?
Why needing a unifying format for data ? • Case: 10 word processors, each need to be able to process docs generated by any other. • 1st approach: • write a converter A-->B for any A and B. • #converter = n x (n-1) = 90 (bad!) • 2nd approach: • invent a common format (C). • write a pair of converters (A --> C, C-->A) for each word processor. • To process doc generated from A by B, simply • A ==(A-->C)== C == (C-->B) == B • required converts: 2 x n = 20 (much better!) • prerequisite: need a common format. • This is what XML plays!!
Additional benefits of XML (as a common format) • Enable the interoperation of internet/intranet/extranet software/service. • Free (or cheap) cost of obtaining required software for processing XML. • without the need to reinvent the wheel. • can focus on value-added software based on these underlying software. • Decoupling of tightly-coupled distributed systems into loosely one. • less monopolization of software by vendors • more selections of combinations for buyers • more chances of contributing software for small company. • less investment for buyers.
Comparison of XML with Other formats • HTML • Text-based non-markup formats • .c .cpp .java .ini … • Binary formats • .dll .exe .o .swf • .class .png .jpeg …
Advantages of XML over HTML • XML can define your own tags. • XML tags describe the content, rather than the presentation of that content • easier for content search (no annoying presentation data). • easier for page development (separating content from view) • easy for devices to render the contents depending on its environments (single model/multiple views)
Advantage of XML over text formats Ex: • JavaML v.s Java; CppML v.s Cpp • XMI v.s rational’s proprietary format • web.xml, plugin.xml v.s ***.ini (for configuration) • build.xml v.s. makefile • XQuery XML format v.s plain text format • RelaxNG XML v.s. plain text format • advantage: • structure explicitly represented in the XML format. • (free and) standard tools (and API) exists for quick parsing of the XML format. => front-end processing avoided/reduced • disadvantage: too verbose. • for storage and transmission. • can be overcome by compression • for human generation; (not a problem for machine generation) • require smarter editor • for human reading/comprehension: • a real problem!!
Advantage of XML over binary formats • Example: • ASN.1 XER Encoding rule v. BER/CER/DER/PER • classML v.s .clss file format. • swfml v.s swf (Flash file format) • advantage: • readable; editable • (free and) open software and APIs available • disadvantage: • take longer time to parse. The trend: • one data model/ multi representation formats + • converters among the formats.
Some XML applications • An XML application is an language adopting the XML syntax [which is usually defined by DTD/ Schema]. • XML as an alternative representation format • (SVG) Scalar Vector Graph : for vector graph • (MathML) : for mathematical expressions • SMIL (Synchronized Multimedium Integration language): • Resource Description Framework (RDF) : an XML language for describing web resources and their relationship • CML (Chemical Markup Language) : for chemical molecule • JavaML : for java programs • CppML : XML formats for C++ • Ant : a replacement of make for java • Maven:a Java project management and project comprehension tool • OOML : a OO PL in XML • UIML : user interface Markup language • WAP WML (Wireless Markup Language) • See The XML Cover Pages for a bulky listing.
Mathematical Markup Language <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" > <head> <title>Fiat Lux</title> </head> <body> <p> And God said, </p> <m:math> <m:mrow> <m:msub> <m:mi>δ</m:mi> <m:mi>α</m:mi> </m:msub> <m:msup> <m:mi>F</m:mi> <m:mi>αβ</m:mi> </m:msup> <m:mi> </m:mi> <m:mo>=</m:mo> <m:mi></m:mi> <m:mfrac> <m:mrow> <m:mn>4</m:mn> <m:mi>π</m:mi> </m:mrow> <m:mi>c</m:mi> </m:mfrac> <m:mi> </m:mi> <m:msup> <m:mi>J</m:mi> <m:mrow> <m:mi>β</m:mi> <m:mo> </m:mo> </m:mrow> </m:msup> </m:mrow> </m:math> <p> and there was light </p> </body> </html>
Vector Graphics • Scalable Vector Graphics (SVG) • Adobe SVG Viewer • Apache Batik SVG toolkit • Vector Markup Language (VML) • Internet Explorer 5.0 or above • Microsoft Office 2000
Ant • A make-like building tools • Sample Build.xml <project default="echoFoo" name="ant-test" basedir="."> <property name="foo5.1" value="${foo5}"/> <target name="writeFoo3Bar3"> <echo message="foo3 = bar3" file="test.properties"/> </target> <target name="readWriteFoo4.1Foo4"> <echo message="foo4.1 = ${foo4}" file="test.properties"/> </target> <target name="readWriteFoo5.1Foo5InStart"> <echo message="foo5.1 = ${foo5.1}" file="test.properties"/> </target> <target name="echoFoo"> <echo message="${foo}"/> </target> </project>
Related technologies • XML is a key technology to ensure interoperability • But XML, by itself, is not really useful... we need to • have datatypes, validation (DTD-s, Schemas, ...) • mix XML applications (Namespaces) • link (XLink, XBase,...) • compose/decompose (XInclude, Fragments, ...) • refer to XML data content (XPath, Query, ...) • transform (XSLT) • encrypt, decrypt, sign (Signature, Encryption, ...) • interact, script (DOM, Events, ...) • etc
Core specifications for XML • XML 1.0 • XML Namespace • XML Path language (XPath) • XML Stylesheet Langugae (XSL) • XSL Transformation language (XSLT) • XSL formating Objects (XSLFO) • XML Linking language (XLink) • XML Pointer Langugae (XPointer) • XML schemas (; RelaxNG) • XHTML • XML signatures/canonicalization • XML protocols • XMLForm • XQuery (XML language for Querying XML Documents)
Core Specifications for XML • XML • document type definition (DTD) : a utility used to define the formats and contents of valid XML documents. • a specification to define what kinds of texts are well-formed XML document • XML namespace • Define a mechanism to avoid collision of elements and/or attribute names in documents using multiple sets of DTDs. • Xlink • Define the mechanism for linking to web resources from an XML document. • Xpointer • Define a mechanism for linking to inside an XML document. • XPath • Define a mechanism to refer to part of an XML document