590 likes | 603 Views
This course provides an overview of XML, covering its origins, syntax, and various applications. Topics include XML Schema, document object modeling, transforming XML documents, searching XML, and more.
E N D
X-Informatics: I-400 and I-590An Introduction to XML Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time Geoffrey Fox and Bryan Carpenter PTLIU Laboratory for Community Grids Informatics, (Computer Science , Physics) Indiana University Bloomington IN 47404 gcf@indiana.edu xmlintrospring02
Outline of Introduction to XML • The two drivers for XML • Original: A better way of specifying documents that is more powerful than HTML and easier to understand than SGML • Current: XML as a object structure for totally general entities • Basic XML: well formed and valid XML • Examples of use of XML • XML Syntax – this presentation • Further Presentations on • XML Schema • XML based Document Object Model and style sheets • Transforming XML documents XSLT • Searching XML documents • Applications of XML – Dublin Core, RDF, SVG, SOAP xmlintrospring02
AnInformationNugget xmlintrospring02
Nugget In XML xmlintrospring02
Essential Issues • We have a world of objects • Objects are instances of “classes of object” • This SONY Viao is an instance of the general SONY Viao laptop class • This SONY Viao laptop class is a subclass of the laptop class • Object classes are made up of smaller things – which have “types” • Color is a special type of variable with a 24 bit (or some other) specification • String (characters) is a type (simple type); Integers are a type etc. • Date is also a simple type in XML as so common • XML has Schema to specify classes and these are made up of simple types and complex (complicated) types • Further one Schema can be built from one or more other Schemas xmlintrospring02
An Example of RDF and Dublin Core • <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/metadata/dublin_core#"> • <rdf:Descriptionabout="http://www.dlib.org"> • <dc:Title>D-Lib Program - Research in Digital Libraries</dc:Title><dc:Description>The D-Lib program supports the community of people with research interests in digital libraries and electronic publishing. </dc:Description><dc:Publisher>Corporation For National Research Initiatives</dc:Publisher><dc:Date>1995-01-07</dc:Date> • <dc:Subject> • <rdf:Bag> <rdf:li>Research; statistical methods</rdf:li><rdf:li>Education, research, related topics</rdf:li><rdf:li>Library use Studies</rdf:li> </rdf:Bag> • </dc:Subject><dc:Type>World Wide Web Home Page</dc:Type> • <dc:Format>text/html</dc:Format> • <dc:Language>en</dc:Language> • </rdf:Description></rdf:RDF> xmlintrospring02
XML Example for SOAP SOAPEnvelope With body HTTP Header First argument • This is way to use XML to send command ls (list files) from one machine to another Specify ls as xmlintrospring02
XML Example II HTTP Header • And this is the result of ls sent back to client in SOAP over HTTP SOAPEnvelopeand body xmlintrospring02
Overview of HTML • HTML = Hypertext Markup Language • the lingua franca of the World Wide Web • HTML is a simple language well suited for hypertext, multimedia and the display of small and reasonably simple documents • HTML 2.0 spec completed in Nov 95 • HTML+ and HTML 3.0 never released • HTML 3.2 (Jan 97) added tables, applets, and other capabilities (approximately 70 tags) • this is what most people are familiar with today • HTML 4.0 spec released in Dec 97 • XHTML (XML Version of HTML 4.0) released January 2000 as a W3C recommendation xmlintrospring02
W3C Process • The Web Consortiumhttp://www.w3c.org has a highly effective process for initiating and refining standards for the web • The agreed standards for protocols and API’s are as critical to success of the web as are technologies • The process to define standards involve moving from Working Draft to Last Call Working Draft to Candidate Recommendation to Proposed Recommendation and finally to Recommendation. • The standards discussed here are quite recent • XML Schema became a recommendation May 2 2001 • SVG (2D Vector Graphics done in XML – relevant for animated web pages) became a recommendation September 5 2001 • XQUERY (a proposed way of searching XML datastructures/documents) is currently a working draft dated December 20 2001 xmlintrospring02
XML in the HTML world • XML = eXtensible Markup Language (name suggests documents not objects) • XML was originally designed as a subset of SGML -- Standard Generalized Markup Language, but unlike the latter, XML was specifically designed for the web and for comparative simplicity • Specification of W3C: http://www.w3.org/XMLand lots of links likehttp://www.xml.org • XML 1.0 in February 98, with continuing refinements • How XML fits into the Browser world: • XML with Application Specific Schema describes the logical structure of the document. • CSS (Cascading Style Sheets) or other style language describes the visual presentation of the document. • The DOM (Document Object Model) allows scripting languages, such as JavaScript to access document objects. • DHTML (Dynamic HTML) allows a dynamic presentation of the document. • XHTML is XML Syntax for specifying Text Display – so HTML just does DISPLAY; XML does “Knowledge” and DISPLAY xmlintrospring02
InformaticsView ofArchitecture Raw Data Resource • Note Server Tier uses lots of subsystems that are themselves separated by XML Interfaces XML for Data (Virtual) XML Interface Processing Server Information/Knowledge (Virtual) XML Interface XML for Knowledge Rendering to XML syntaxDisplay Format XHTML and SVG are examples Clients xmlintrospring02
The original Motivation for XML as an enhancement of HTML separating Display and Knowledge • Limitations of HTML: • Extensibility: HTML does not allow users to specify their own tags or attributes in order to parameterize or otherwise semantically qualify their data. • Structure: HTML does not support the specification of deep structures needed to represent database schema or object-oriented hierarchies. • Validation: HTML does not support the kind of language specification that allows applications to check data for structural validity when it is imported. xmlintrospring02
Logical vs. Visual Design • This is XML used as Interface between Knowledge and Rendering • The logical design of a document (content) should be separate from its visual design (presentation) • Separation of logical and visual (rendering) design • promotes sound typography • encourages better writing • is more flexible • Allows the same “knowledge/information” (defined in XML) to be displaced on PC’s, PDA’s, Braille devices etc. • XML can be used to define the logical design, while the XSL (Extensible Style Language) is used to define the visual design (usually by mapping XML into HTML but better XML for Knowledge into XHTML or SVG or ….). xmlintrospring02
What is SGML? • SGML = Standard Generalized Markup Language defined as an ISO (not W3C) standard (ISO8879) in 1986 • A SGML document carries with it a grammar called a Document Type Definition (DTD). The DTD defines the tags and the meaning of those tags • DTD syntax is not very nice • Presentation is governed by a style sheet written in the Document Style Semantics and Specification Language (DSSSL) • Note that HTML is a fixed SGML application, a hard-wired set of about 70 tags and 50 attributes, and does not need to have a DTD for each HTML instance. xmlintrospring02
SGML Example • A simple SGML document with embedded DTD:<!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT O O (p*,BIGP*)> <!ELEMENT p - O (#PCDATA)> <!ELEMENT BIGP - O (#PCDATA)> ]> <DOCUMENT> <p>Welcome to <BIGP>XML Style! </DOCUMENT> xmlintrospring02
SGML Example (cont’d) • A corresponding DSSSL style sheet:<!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">(root (make simple-page-sequence))(element p (make paragraph))(element BIGP (make paragraph font-size: 24pt space-before: 12pt)) • DSSSL is simplified as XSL just as XML simplifies SGML xmlintrospring02
XML as a simple SGML • XML is also an SGML application, but since XML is extensible (XML can be considered a metalanguage), every XML document must be accompanied by its DTD • XML is a compromise between the non-extensible, limited capabilities of HTML and the full power and complexity of SGML • XML offers “80% of the benefits of SGML for 20% of its complexity” • XML designers tried to leave out all the SGML that would be rarely used on the web • Note that XML specification is 30 pages and the SGML specification is 500 pages. • XML allows you to define your own tags and to describe nested hierarchies of information. xmlintrospring02
Some Global Concepts • We are defining objects – possible just for documents as in SGML • We need to define “object templates” or their structure • This is class for Java • This is DTD for SGML and XML or Schema for XML • We have instances of objects • XML files or Java Objects with some way (optional for XML) of specifying • We need to transform objects • We can do this with “real software” i.e. read object into program, interpret and spit out in a different form • We can use specialized transformation language with some control data –this is DSSSL plus stylesheet for SGML; XSLT plus stylesheet for XML; browser plus CSS stylesheet for HTML xmlintrospring02
XML Design Goals • 1) XML shall be usable over the Internet • 2) XML shall support a variety of applications • 3) XML shall be compatible with SGML • 4) It shall be easy to write programs that process XML documents • 5) Optional features in XML shall be kept to the absolute minimum, ideally zero • 6) XML documents should be human-legible and reasonably clear • 7) Design of XML should be prepared quickly • 8) Design of XML shall be formal and concise • 9) XML documents shall be easy to create • 10) Terseness in XML markup is of minimal importance xmlintrospring02
Features of XML I • The documents are stored in plain text and thus can be transferred and processed anywhere. • Inline-reusability - documents can be composed of many pieces • Unifying principles make it easily acceptable • “everything is a tree” • UNICODE for different languages • XML documents enable several types of uses • traditional data processing - XML documents can be the data interchange medium • document-driven programming • archiving xmlintrospring02
A Tree root onedown nextone twodown val threedown anotherval Content(stuff) • <root><onedown val=“abc”> <twodown> <threedown anotherval=“123”>stuff </threedown> </twodown></onedown><nextone></nextone></root> xmlintrospring02
Features of XML II • It is important to remember that XML is a markup language, not a programming language. XSL can be viewed as a way of programming data whose structure is defined in XML • Except this isn’t really correct – you can build a programming language with XML Syntax • <myshell program=“cat” args=“grades” /> <myshell program=“ls” args=“-l” /> • M in XML is Markup reflecting its origin in the publication” community with markup specifying layout of document, fonts to use etc. • XML’s most important use is not this original specifying abstract data structures -- equivalent to structures in C++ or classes in Java or Entity relationship in database world xmlintrospring02
Origins of XML • First draft of XML spec released by W3C in Nov 96 (four other drafts published in 1997) • The first XML parser (written in Java) released by Microsoft in July 97 • Microsoft released version 1.8 of its XML parser (which supports XML 1.0) in Jan 98 • W3C finalized the XML 1.0 spec in Feb 98 • First XML-aware beta versions of Netscape and IE5.0 released in June 98 • Sun announced Java Standard Extension for XML (XML API) in March 99 • W3C ongoing effort as discussed xmlintrospring02
HTML becomesXHTML • This shows how near HTML is to XML but also differences! http://www.w3schools.com/xhtml/default.asp xmlintrospring02
XHTML II You must have a bodyand a head sectionYou canNOT use capital letters in element or attribute names xmlintrospring02
XHTML III no yes no or yes xmlintrospring02
XHTML IV no yes xmlintrospring02
XHTML V xmlintrospring02
XHTML VI xmlintrospring02
XHTML VII xmlintrospring02
XHMTL VIII xmlintrospring02
XHTML IX Converting HTML to XHTML requires … xmlintrospring02
Homework 2 • Go tohttp://www.w3schools.com/xhtml/default.asp • Take XHTML Course • Take quiz returningEither screen dump orSaved HTML ofresults page • Build a valid XHTMLfile as your coursehome page • It should NOThave frames xmlintrospring02
Homework 2 Continued • Validate your XHTML File and send togcf@indiana.edu • http://validator.w3.org/ xmlintrospring02
“Hello World!” in XML • An XML document with external DTD:<?xml version="1.0"?><!DOCTYPE greeting SYSTEM "hello.dtd"><greeting>Hello World!</greeting> • An XML document with embedded DTD:<?xml version="1.0"? standalone =“yes” ?><!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)>]><greeting>Hello World!</greeting> • Current XHTML has a DTD but not a Schema • Next version of XHTML with modules is Schema based • Don’t need to understand DTD to use XHTML xmlintrospring02
XML and Related Acronyms • Document Type Definition (DTD), which defines the tags and their relationships – to be replaced (IMHO) by XML Schema • Extensible Style Language (XSL) style sheets, which specify the presentation of the document • Cascading Style Sheets(CSS) less powerful presentation technology without tag mapping capability • XPATH which specifies location in document • XLINK and XPOINTER which defines link-handling details • Resource Description Framework (RDF), document metadata • Document Object Model (DOM), API for converting the document to a tree object in your program for processing and updating • Simple API for XML (SAX), “serial access” protocol, fast-to-execute protocol for processing document on the fly • XML Namespaces, for an environment of multiple sets of XML tags • XHTML, a definition of HTML tags for XML documents (which are then just HTML documents) • XML Schema, offers a better alternative to DTD xmlintrospring02
Document Type Definition • The DTD specifies the logical structure of the document; it is a formal grammar describing document syntax and semantics • The DTD does notdescribe the physical layout of the document; this is left to the style sheets and the scripts • It is no mean task to write a DTD, so most users will adopt predefined DTDs (or can write an XML document without a DTD). • DTDs can be written in separate files to facilitate re-use. • Content-providers, industries and other groups can collaborate to define sets of tags: the essence of “any” field (physics, music …) is captured in a domain specific DTD/Schema • XML documents are valid if they are consistent with a specified DTD or Schema • We will NOT discuss DTD significantly in this presentation xmlintrospring02
XML Editors • There are several XML editors at various prices and capabilities • One list of available editors is at http://www.perfectxml.com/soft.asp?cat=6 • We have good experience with XML Spy which costs money but renewable 30 day licenses are available • The capabilities of editors depends on how well they support Schemas • As XML gets more complicated, expect a new generation of “processing tools” that accept XML as input with multiple Schema and produce some sort of output for people and/or computers • Microsoft XML Notepad is simple free and dated • http://www.w3.org/XML/Schema has a set of good XML Schema links which inter alia discuss XML • http://www.w3schools.com/xml/default.asp xmlintrospring02
XML must be “well-formed” • For the data contained in an XML document to be parsed correctly, its markup must be well-formed, meaning in part that properly nested and non-abbreviated starting and ending tags are used. • This well-formed-ness provides a well defined encapsulation mechanism allowing designated sections of the data to be accessed programmatically. • Current HTML browsers allow rule violations but XML is strict which is essential for many (robust) applications • If XML was just used to render, then sloppiness allowable but as XML aimed at capturing object structure or information, we cannot have errors interpreted unpredictably by parsers • Well-formed is less restrictive than valid • XML documents must be well-formed – user can decide if need to be valid xmlintrospring02
Character Data in XML CDATA and PCDATA • XML documents are made up of markup and CDATA (character data) • PCDATA is text gotten from parsing document and processing markup as necessary • “markup” includes • Tags and attributes (ALL that is important), Entity references, Character references, Comments, CDATA Section delimiters, DTD declarations and Processing Instructions • XML allows you to specify chunks of text which may contain “reserved characters/strings” with an ugly syntax • <![CDATA <ignored>Anything </ignored> ]]> • Maybe (hopefully) this will be replaced by alternatives based on ideas like mail attachments – see http://www.w3.org/TR/SOAP-attachments) xmlintrospring02
Characters in XML • We can choose the character set such as UTF-8 (8 bit ASCII codes for characters) or the official default Unicode (16 bit character codes as used by Java) or even UCS which offers 32 bits for each character. This is specified in the xml processing instruction in the document prolog. • You can use character reference markup • π is Unicode for wrapped in &# .. ; syntax for a 16 bit (4 hexadecimal symbols) character reference in Unicode (ISO/IEC 10646) • π is also using decimal form of Unicode • One can use the five built-in entity references • & for & • ' for ‘ • > for > • < for < • " for “ • In the DTD approach (which we are ignoring), one can define arbitrary entity references &#x----; Hexadecimal (base 16 0..9ABCDEF &#----; Decimal (base 10 0..9 xmlintrospring02
White Space in XML • XML as default treats spaces, tabs, line feeds and carriage return “just” as white space. Thus<greeting>Hello World!</greeting> and <greeting>HelloWorld!</greeting> are identical • This is similar to HTML. One can overrule this using attribute xml:space with syntax • <greeting xml:space=“preserve” >HelloWorld!</greeting> • This attribute must be defined in DTD with • <!ATTLIST greeting xml:space (default|preserve) ‘preserve’ > • defines element greeting to allow an attribute xml:space which can take values default or preserve with latter as default • If you specify xml:space, then it holds not only for given element but all those contained within it. xmlintrospring02
XML Example • Another example which could be used for URL exchanges between network capable applications:<LINK> <TITLE>XML Recommendation</TITLE> <URL> http://www.w3.org/TR/REC-xml </URL> <DESCRIPTION> The official XML spec from W3C </DESCRIPTION> </LINK> xmlintrospring02
XML Example (cont’d) • A document may have many such links: • <?xml version="1.0" encoding=”UTF-8” standalone="yes"?><?xml-stylesheet type=“text/css” href=“fred.css” ?><DOCUMENT> <LINKS> <LINK>…</LINK> <LINK>…</LINK> … </LINKS> </DOCUMENT> • Here we have also added prolog processing instructions. xmlintrospring02
XML Prolog and Processing Instructions • Every XML file starts with the prolog, giving information about the document. The minimal prolog identifies it as an xml document<?xml version=“1.0”?> • The prolog may also include the encoding and whether it is a standalone document:<?xml version="1.0" encoding="ISO-8859-1” standalone="yes” ?> • If it is not standalone, it may specifiy external “entities” which may be named in the document or an external DTD • An XML file may also contain more general processing instructions for the application processing the document:<?target instructions ?>where target is the name of the application. • Only <?xml … ?> is understood by all XML processors • Specification of a stylesheet by <?xml-stylesheet .. ?> is common xmlintrospring02
XML Prolog and Comments • The Prolog can contain: • Processing Instructions • DTD Specifications -- we have illustrated these but will not discuss further • Comments • Comments have same form anywhere in the XML document and are just like comments in HTML • <!--This is the Prolog and <tag> Lousy Course</tag> is not treated as a tag--> • You cannot have -- inside comments but <tag> </tag> is not treated as markup xmlintrospring02
Processing XML • So in the beginning (1999), it was not clear how XML would be used • One (major?) of original goals was specifying content of web pages and this implied processing of XML with “style-sheets” that specified mapping of XML into HTML • Obviously this is some sort of “processing” • XML was so popular that lots of other applications with lots of totally different processing were invented • <?target instructions ?> was insufficient in ability to specify the way processing to be done and not very useful as better always to be modular and NOT associate details of processing with data • So best to ignore XML processing tags unless used in very conventional way such as style sheets. • Modern web page technology tends not to use this way but rather has a separate “configuration file” matching XML and style-sheets xmlintrospring02
Comments in XML XML Parser Output XML File Use processing instructions to control parsing<!-- --> ignored • <!-- --> syntax represents a comment on “file” as ignored by XML parser • This is sometimes useful but more valuable is a comment that is preserved by parser as this can be either thrown away or preserved as you please • Do this with some sort of tag like<yourcomment> This is a comment</yourcomment> • Parsers read XML – check if well-formed/valid and return some sort of answer – in simplest model – this is a modified file OLD MODEL xmlintrospring02
Role of Parsers Web Service XML File XML Parser “Business Logic” XML File Output XML File • New model disassociates data and action on the data • XML Parsers are critical technology – Editors built on top of them but parsers are basis of all use of XML in web services Specify in XMLWeb Service(WSDL does this) xmlintrospring02