400 likes | 417 Views
Learn the fundamental syntax and concepts of XML in this tutorial by Geoffrey Fox and Bryan Carpenter. Explore XML structure, applications, transformations, and more. Discover how XML fits into the world of browsers and logical vs. visual design.
E N D
Basic Syntax of XMLDoD Users Group Tutorial on XML and Science June 10 2002 Austin TX Geoffrey Fox and Bryan Carpenter PTLIU Laboratory for Community Grids Informatics, (Computer Science , Physics) Indiana University Bloomington IN 47404 gcf@indiana.edu xmlintrougcjune02
Outline of Introduction to XML • The two drivers for XML • Original: A better way of specifying documents that is more powerful than HTML and easier to understand than SGML • Current: XML as a object structure for totally general entities • Introductory XML: well formed and valid XML • Examples of use of XML • Basic XML Syntax – this presentation • Further Presentations to be chosen from • XML Schema (Key Syntax) and XML technology/applications • XML based Document Object Model and style sheets • Transforming XML documents XSLT • Searching XML documents • Applications of XML – Science, Computing and community: Dublin Core (books), RDF (AI), SVG (Graphics), SOAP (Messages) xmlintrougcjune02
AnInformationNugget xmlintrougcjune02
Nugget In XML Start Tags of Elements <okc> <event> Attributes End Tags of Elements xmlintrougcjune02
XML Example for SOAP • This is way that system (Gateway http://www.gateway.org) we developed uses XML to send command ls (list files) from one machine to another SOAPEnvelope With body HTTP Header Start XML Specify Schema (namespace) Actual Information Specify ls as xmlintrougcjune02
XML Example from SOAP II Nothing to do with XML HTTP Header • And this is the result of ls sent back to client in SOAP over HTTP Start XML Namespaces: URIpoints to Schema SOAPEnvelopeand body Attribute 4 tags starting XML elements 4 tags ending XML elements xmlintrougcjune02
XML in the HTML world • XML = eXtensible Markup Language (name suggests documents not objects) • XML was originally designed as a subset of SGML -- Standard Generalized Markup Language, but unlike the latter, XML was specifically designed for the web and for comparative simplicity • Specification of W3C: http://www.w3.org/XMLand lots of links likehttp://www.xml.org • XML 1.0 in February 98, with continuing refinements • How XML fits into the Browser world: • XML with Application Specific Schema describes the logical structure of the document. • CSS (Cascading Style Sheets) or other style language describes the visual presentation of the document. • The DOM (Document Object Model) allows scripting languages, such as JavaScript to access document objects. • DHTML (Dynamic HTML) allows a dynamic presentation of the document. • XHTML is XML Syntax for specifying Text Display – so HTML just does DISPLAY; XML does “Knowledge” and DISPLAY xmlintrougcjune02
Logical vs. Visual Design • This is XML used as Interface between Knowledge and Rendering • The logical design of a document (content) should be separate from its visual design (presentation) • Separation of logical and visual (rendering) design • promotes sound typography • encourages better writing • is more flexible • Allows the same “knowledge/information” (defined in XML) to be displaced on PC’s, PDA’s, Braille devices etc. • XML can be used to define the logical design, while the XSL (Extensible Style Language) is one way to define the visual design in terms of logical design XML (usually by mapping XML into HTML but better XML for Knowledge into XHTML or SVG or ….). xmlintrougcjune02
Features of XML • It is important to remember that XML is a markup language, not a programming language. XSL can be viewed as a way of programming data whose structure is defined in XML • Except this isn’t really correct – you can build a programming language with XML Syntax • <myshell program=“cat” args=“grades” /> <myshell program=“ls” args=“-l” /> • M in XML is Markup reflecting its origin in the publication” community with markup specifying layout of document, fonts to use etc. • XML’s most important use is not this original goal but rather specifying abstract data structures -- equivalent to structures in C++ or classes in Java or Entity relationship in database world xmlintrougcjune02
Object Issues • We have a world of objects • Objects are instances of “classes of object” • This SONY Viao is an instance of the general SONY Viao laptop class • This SONY Viao laptop class is a subclass of the laptop class • Object classes are made up of smaller things – which have “types” • Color is a special type of variable with a 24 bit (or some other) specification • String (characters) is a type (simple type); Integers are a type etc. • Date is also a simple type in XML as so common • XML has Schema to specify classes and these are made up of simple types and complex (complicated) types • Further one Schema can be built from one or more other Schemas xmlintrougcjune02
Some Global Concepts • We are defining objects – possible just for documents as in SGML • We need to define “object templates” or their structure • This is class for Java • This is DTD for SGML and DTD or Schema for XML • We have instances of objects • XML files or Java Objects with some way (optional for XML) of specifying • We need to transform objects • We can do this with “real software” i.e. read object into program, interpret and spit out in a different form • We can use specialized transformation language with some control data –this is DSSSL plus stylesheet for SGML; XSLT plus stylesheet for XML; browser plus CSS stylesheet for HTML xmlintrougcjune02
XML defines Objects as A Tree root onedown nextone twodown val threedown anotherval Content(stuff) • <root><onedown val=“abc”><twodown><threedownanotherval=“123”>stuff</threedown></twodown></onedown><nextone></nextone></root> DOM – Document Object Model describes XML Object as a tree (hierarchy) In particular real documents are made of document fragments which are themselves made of document (sub) fragments xmlintrougcjune02
An Example of RDF and Dublin Core • <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/metadata/dublin_core#"> • <rdf:Descriptionabout="http://www.dlib.org"> • <dc:Title>D-Lib Program - Research in Digital Libraries</dc:Title><dc:Description>The D-Lib program supports the community of people with research interests in digital libraries and electronic publishing. </dc:Description><dc:Publisher>Corporation For National Research Initiatives</dc:Publisher><dc:Date>1995-01-07</dc:Date> • <dc:Subject> • <rdf:Bag> <rdf:li>Research; statistical methods</rdf:li><rdf:li>Education, research, related topics</rdf:li><rdf:li>Library use Studies</rdf:li> </rdf:Bag> • </dc:Subject><dc:Type>World Wide Web Home Page</dc:Type> • <dc:Format>text/html</dc:Format> • <dc:Language>en</dc:Language> • </rdf:Description></rdf:RDF> xmlintrougcjune02
Origins of XML • First draft of XML spec released by W3C in Nov 96 (four other drafts published in 1997) • The first XML parser (written in Java) released by Microsoft in July 97 • Microsoft released version 1.8 of its XML parser (which supports XML 1.0) in Jan 98 • W3C finalized the XML 1.0 spec in Feb 98 • First XML-aware beta versions of Netscape and IE5.0 released in June 98 • Sun announced Java Standard Extension for XML (XML API) in March 99 • W3C ongoing effort as discussed xmlintrougcjune02
HTML becomesXHTML • This shows how near HTML is to XML but also differences! http://www.w3schools.com/xhtml/default.asp xmlintrougcjune02
XHTML II You must have a bodyand a head sectionYou canNOT use capital letters in element or attribute names xmlintrougcjune02
XHTML III no yes no or yes xmlintrougcjune02
XHTML IV no yes xmlintrougcjune02
XHTML V xmlintrougcjune02
XHTML VI xmlintrougcjune02
XHTML VII xmlintrougcjune02
XHMTL VIII xmlintrougcjune02
XHTML IX Converting HTML to XHTML requires … xmlintrougcjune02
XML and Related Acronyms • Document Type Definition (DTD), which defines the tags and their relationships – to be replaced (IMHO) by XML Schema • Extensible Style Language (XSL) style sheets, which specify the presentation of the document • Cascading Style Sheets(CSS) less powerful presentation technology without tag mapping capability • XPATH which specifies location in document • XLINK and XPOINTER which defines link-handling details • Resource Description Framework (RDF), document metadata • Document Object Model (DOM), API for converting the document to a tree object in your program for processing and updating • Simple API for XML (SAX), “serial access” protocol, fast-to-execute protocol for processing document on the fly • XML Namespaces, for an environment of multiple sets of XML tags • XHTML, a definition of HTML tags for XML documents (which are then just HTML documents) • XML Schema, offers a better alternative to DTD xmlintrougcjune02
XML Editors • There are several XML editors at various prices and capabilities • One list of available editors is at http://www.perfectxml.com/soft.asp?cat=6 • We have good experience with XML Spy which costs money but renewable 30 day licenses are available • The capabilities of editors depends on how well they support Schemas • As XML gets more complicated, expect a new generation of “processing tools” that accept XML as input with multiple Schema and produce some sort of output for people and/or computers • Microsoft XML Notepad is simple free and dated • http://www.w3.org/XML/Schema has a set of good XML Schema links which inter alia discuss XML • http://www.w3schools.com/xml/default.asp xmlintrougcjune02
XML must be “well-formed” • For the data contained in an XML document to be parsed correctly, its markup must be well-formed, meaning in part that properly nested and non-abbreviated starting and ending tags are used. • This well-formed-ness provides a well defined encapsulation mechanism allowing designated sections of the data to be accessed programmatically. • Current HTML browsers allow rule violations but XML is strict which is essential for many (robust) applications • If XML was just used to render, then sloppiness allowable but as XML aimed at capturing object structure or information, we cannot have errors interpreted unpredictably by parsers • Well-formed is less restrictive than valid • XML documents must be well-formed – user can decide if need to be valid xmlintrougcjune02
Character Data in XML CDATA and PCDATA • XML documents are made up of markup and CDATA (character data) • PCDATA is text gotten from parsing document and processing markup as necessary • “markup” includes • Tags and attributes (ALL that is important), Entity references, Character references, Comments, CDATA Section delimiters, DTD declarations and Processing Instructions • XML allows you to specify chunks of text which may contain “reserved characters/strings” with an ugly syntax • <![CDATA <ignored>Anything </ignored> ]]> • Maybe (hopefully) this will be replaced by alternatives based on ideas like mail attachments – see http://www.w3.org/TR/SOAP-attachments) xmlintrougcjune02
Characters in XML • We can choose the character set such as UTF-8 (8 bit ASCII codes for characters) or the official default Unicode (16 bit character codes as used by Java) or even UCS which offers 32 bits for each character. This is specified in the xml processing instruction in the document prolog. • You can use character reference markup • π is Unicode for wrapped in &# .. ; syntax for a 16 bit (4 hexadecimal symbols) character reference in Unicode (ISO/IEC 10646) • π is also using decimal form of Unicode • One can use the five built-in entity references • & for & • ' for ‘ • > for > • < for < • " for “ • In the DTD approach (which we are ignoring), one can define arbitrary entity references &#x----; Hexadecimal (base 16 0..9ABCDEF &#----; Decimal (base 10 0..9 xmlintrougcjune02
White Space in XML • XML as default treats spaces, tabs, line feeds and carriage return “just” as white space. Thus<greeting>Hello World!</greeting> and <greeting>HelloWorld!</greeting> are identical • This is similar to HTML. One can overrule this using attribute xml:space with syntax • <greeting xml:space=“preserve” >HelloWorld!</greeting> • This attribute must be defined in DTD with • <!ATTLIST greeting xml:space (default|preserve) ‘preserve’ > • defines element greeting to allow an attribute xml:space which can take values default or preserve with latter as default • If you specify xml:space, then it holds not only for given element but all those contained within it. xmlintrougcjune02
XML Prolog and Processing Instructions • Every XML file starts with the prolog, giving information about the document. The minimal prolog identifies it as an xml document<?xml version=“1.0”?> • The prolog may also include the encoding and whether it is a standalone document:<?xml version="1.0" encoding="ISO-8859-1” standalone="yes” ?> • If it is not standalone, it may specifiy external “entities” which may be named in the document or an external DTD • An XML file may also contain more general processing instructions for the application processing the document:<?target instructions ?>where target is the name of the application. • Only <?xml … ?> is understood by all XML processors • Specification of a stylesheet by <?xml-stylesheet .. ?> is common xmlintrougcjune02
Processing XML • So in the beginning (1999), it was not clear how XML would be used • One (major?) of original goals was specifying content of web pages and this implied processing of XML with “style-sheets” that specified mapping of XML into HTML • Obviously this is some sort of “processing” • XML was so popular that lots of other applications with lots of totally different processing were invented • <?target instructions ?> was insufficient in ability to specify the way processing to be done and not very useful as better always to be modular and NOT associate details of processing with data • So best to ignore XML processing tags unless used in very conventional way such as style sheets. • Modern web page technology tends not to use this way but rather has a separate “configuration file” matching XML and style-sheets xmlintrougcjune02
Comments in XML XML Parser Output XML File Use processing instructions to control parsing<!-- --> ignored • <!-- --> syntax represents a comment on “file” as ignored by XML parser • This is sometimes useful but more valuable is a comment that is preserved by parser as this can be either thrown away or preserved as you please • Do this with some sort of tag like<yourcomment> This is a comment</yourcomment> • Parsers read XML – check if well-formed/valid and return some sort of answer – in simplest model – this is a modified file OLD MODEL xmlintrougcjune02
Role of Parsers Web Service XML File XML Parser “Business Logic” XML File Output XML File • New model disassociates data and action on the data • XML Parsers are critical technology – Editors built on top of them but parsers are basis of all use of XML in web services Specify in XMLWeb Service(WSDL does this) xmlintrougcjune02
XML tag structure • In XML terminology, a pair of start and end tags is an element. • XML documents must have a strict hierarchical structure. • All start tags must have an end tag. • Any element must be properly nested within another. • <LI> XML requires <B><I>proper nesting</I></B>.</LI> is well formed • <LI> XML requires <B><I>proper nesting</I></LI>.</B> would be rejected by an XML Parser • Empty tags (no content except perhaps attributes) are allowed as elements in XML documents. • An empty tag is a start and end tag together and is identified by a trailing / after the tag name. So in XHTML one uses <br/> for the empty break tag. (So empty tags with no attributes are “flags”) • A start tag and end tag with nothing in-between can also be considered an empty tag.<IMG SRC=“face.gif”></IMG> • XML tags are case-sensitive. (<H1> is not the same as <h1>. xmlintrougcjune02
Document is a Single Tree • XML documents allow only one root element. • So it must be • <?xml version=“1.0” ?><rootoftree>………</rootoftree> • And not • <?xml version=“1.0” ?><rootoftree>………</rootoftree><rootoftree>………</rootoftree> So there is only one tree in each document xmlintrougcjune02
XML Attributes • Tags can have any number of attributes (which must be declared by the schema) • All attribute values must be within single or double quotes.<FONT COLOR=“#FF00CC”> quoted attribute </FONT> • If you have a double quote inside an attribute value, then either • Use " for inside quote as in quote=“"” • Enclose attribute value in single quotes as in quote=‘”’ • Each attribute can only appear once in a given element definition • One can choose (matter of taste) between<person name=“Fox” role=“teacher” ></person> and • <person><name>Fox</name><role>teacher</role></person> • Note you can repeat elements but you cannot repeat attributes to represent multiple occurrences xmlintrougcjune02
CDATA Sections • CDATA sections allow you to include unparsed characters in a document<![CDATA <ignored>Anything</ignored>]]> • In this example the ignored tag is not processed by XML parser • Unfortunately you must guarantee that there is no ]]> string in the text between <![CDATA and ]]> • <script language=“JavaScript”><![CDATAvar fred = 0;if( fred < 10) { document.writeln(“> and < here are NOT parsed”); }]]></script> xmlintrougcjune02
XML Namespaces I • This is an extension to XML adopted January 1999 at http://www.w3.org/TR/1999/REC-xml-names-19990114/ • Namespaces address problem in DTD that labels (element and attribute names) cannot be repeated; more fundamentally (and for Schemas especially) it provides subroutine or library capability to XML • Suppose you had a XML file with <student> and <faculty> and you wanted to write<student><name>you</name><student><teacher><name>me<special>Prof</special></name></teacher> • This is invalid unless <name> is identical in structure for both teacher and student, as each element in tree must have unique structure. • We can get round it by using <studentname> and <teachername> but this is not so satisfactory especially if you get this conflict by joining two different sets of tags together • This is seen in XHTML when you could add MathML SMIL SVG tags …. xmlintrougcjune02
XML Namespaces II • So we use new syntaxxmlns=“http://aspen.ucs.indiana.edu/namespaces/university.xsd” to define an XML Namespace • The value of xmlns is hopefully a useful URL/URI telling you about tags. However this is not required. • Microsoft in its cunning way uses in Office web export: • <xml xmlns:v="urn:schemas-microsoft-com:vml“xmlns:o="urn:schemas-microsoft-com:office:office“xmlns:p="urn:schemas-microsoft-com:office:powerpoint"> • And teaches Internet Explorer to understand these obscure “universal resource names” for VML Office and PowerPoint Namespaces respectively. • xmlns is an attribute which can be used in any element (depending on parser you may need to declare this as allowed attribute in DTD/Schema) • <student xmlns=“studentschema”><name> …. xmlintrougcjune02
XML Namespaces III • And when we come to teacher use<bigboss:teacher xmlns:bigboss=“teacherschema”><bigboss:name> …. • In the above, we made student elements as default • We can more symmetrically write<universityxmlns:bigboss=“teacherschema” xmlns:downtrodden=“studentschema” > • <downtrodden:student><downtrodden:name>you </downtrodden:name></downtrodden:student>…….. <bigboss:teacher><bigboss:name>me </bigboss:name></bigboss:student> • </university> xmlintrougcjune02