350 likes | 358 Views
Explore XML, a versatile markup language defining custom tags for diverse information domains such as legal documents, music notation, and more. Discover the evolution, standards, and limitations of HTML, emphasizing the power of XML for structured data storage and processing.
E N D
CIS336Website design, implementation and management(also Semester 2 of CIS219, CIS221 and IT226) David Meredith d.meredith@gold.ac.uk www.titanmusic.com/teaching/cis336-2006-7.html Lecture 2 XML Documents (Based on Møller and Schwartzbach, 2006, Chapter 2)
What is XML? • XML = Extensible Markup Language • XML is a framework for defining markup languages • It is a subset of SGML • No fixed collection of tags like HTML • XML lets us define our own tags, designed for the kind of information we want to represent • Each XML language is targeted at a particular application domain (e.g., chemical formulae, legal documents, religious texts, music notation) • All XML languages use the same basic markup syntax and can benefit from a common set of generic tools for processing documents • XML is intended to be the future of all structured information • Including all information previously stored in relational databases • Prompted development of powerful query language, XQuery, which is designed to replace SQL
XML and HTML • XML is not an ‘extensible’ markup language • It is not a single markup language at all! • It defines a class of markup languages and a common notation that any markup language can use • XML is not an extension of or replacement for HTML • HTML should ideally be a particular application of XML, i.e., an XML language • HTML doesn’t fit directly into the XML framework • So W3C designed XHTML which is an XML-compliant variant of HTML
What XML doesn’t do • XML specification says nothing about the semantics of the markup tags • Specified by the individual XML languages • XML says nothing about how an XML document should be rendered in a browser • Can specify an XML stylesheet (using XSL) that defines how each tag should be rendered in a browser
XML and interoperability • XML is designed to be inherently internationalized and platform independent • All XML documents must use the Unicode character set • contains all international characters, past and present • XML also deals with different line-break encodings on different platforms by normalising all such breaks to the same sequence of characters • Defined by a public, free specification which can be viewed and implemented by anyone
Development of XML(see http://www.w3.org/xml/) • XML development started in mid 1990s • Initial draft specification of XML produced in November 1996 • Pure subset of SGML • XML 1.0 became a W3C recommendation in February 1998 • Latest version of XML 1.0 is the Fourth Edition, published in August 2006 • Available here: http://www.w3.org/TR/xml/ • XML 1.1 became a W3C recommendation in February 1998 • Latest version of XML 1.1 is the Second Edition, published in August 2006 • Available here: http://www.w3.org/TR/xml11/ • XML 1.1 incorporates recent and future changes in the Unicode standard and introduces the idea of normalization of character encodings • XML 1.1 is not fully compatible with XML 1.0 • Many applications written in XML 1.0, so many keep with this standard in preference to XML 1.1 • Perhaps standard should have been simpler • But now huge amount of technology and information that relies on the standard so core features will probably not change
Limitations of HTML • HTML tags used here to indicate structure of recipes • But no way of enforcing correct format for data • Cannot easily sort recipes or select a subset with particular features • Cannot easily perform computations on the recipe data • HTML is not a good language for making a database • HTML is designed specifically for the hypertext domain, not other domains (like recipes) • In HTML, syntax and semantics (or structure and layout) are intertwined, even if we use cascading style sheets • For data storage, we want to store data with logical structure only so that it can be processed and formatted in all sorts of different ways
Recipes in XML • Can define our own recipe markup language in XML, RecipeML • Tags in RecipeML directly correspond to concepts in the recipe domain • e.g., recipe, ingredient, preparation step, etc. • Similar to identifying key domain abstractions in OO software engineering • XML-ification is the process of developing an XML representation for a particular domain • Essential information is in attributes and text between tags • Tags indicate structure only, not layout • Tags provide meta-information • For any domain, usually many possible markup designs, e.g., • could break up date into day, month and year • Could enclose ingredient list in <ingredients> tag • XML is semi-structured • Can choose level of detail at which to mark up text • Often have to choose between using attributes or elements, e.g., • name, amount, unit attributes could be tags
XML Language Syntax, Semantics & Use as a Database • Define syntax of an XML language (like RecipeML, XHTML) using XML Schema • i.e., what tags are allowed and where they can appear in the XML document • e.g., preparation tag can only contain step tags and step tag can only contain text • Define semantics of an XML language using XSLT • Transforms XML into appropriate XHTML file that can be displayed in a web browser • Use XQuery to search recipe collection and extract all sorts of information from it • For more specialized applications can use a general-purpose programming language like Java • e.g., to write a web-based recipe editor, might need to use Servlets and JSP
XML Trees • Each XML document represents a hierarchical structure called an XML tree • Various ways of describing the structure of an XML tree, but here will adopt XPath Data Model • XML tree can be represented graphically with root node at the top (A in top diagram) • Edges between nodes represent parent-child relationships • A is parent of B; B is child of A in top diagram • Content of a node is sequence of child nodes • Sequence (B, C, D) is content of A in top diagram • Leaf node is one with no children • E, F, C and D are leaf nodes in top diagram • XML tree is ordered so ordering of children of a node is important • Two trees at right are not equivalent in XML • Siblings of a node are the other nodes that are children of the parent of the node • C and D are siblings of B in top diagram • Ancestors of a node include its parent, its parent’s parent, etc. back to root node • A and B are the ancestors of F in top diagram • Descendants of a node include its children, its children’s children and so on • Descendants of A are B, E, F, C and D in top diagram
XML Tree Node Types • In XPath data model, XML tree is a special ordered tree in which each node is one of the following types: • Text nodes • Plain text, not an element, raw data • Always leaf nodes (i.e., cannot have child nodes) • Cannot have two consecutive sibling text nodes • Node labelled with text • Element nodes • Logical grouping of information represented by descendants • Node labelled with element name • Attribute nodes • Parent is always an element node • Specify global properties of parent element • Each attribute is a name-value pair where value is always a text string • Names of attributes of a given element must be distinct • Comment nodes • Always a leaf node • Always contains a text string • Processing instruction nodes • Used to convey specialized meta-information to XML processing tools • A target-value pair in which • Target word specifies type of tool at which instruction is directed • e.g., “xml-stylesheet” recognized by XSLT processors • Value string contains meta-information to be conveyed to the tool • e.g., URI of stylesheet to be used by XSLT processor • Always leaf nodes • Root nodes • Every XML tree has a single root node which represents entire document • Root node always contains exactly one element: the root element • Root node may also contain any number of processing instruction and comment nodes • Note distinction between root node and root element: • Root element is the element in the document that contains all the other elements • Root node in the tree represents the whole document
Tree view of XML recipe • Some subtleties: • Parent of each attribute node is an element node, but children of an element node do not include attributes • Attributes of an element node form an unordered set; but children of an element form an ordered set (or sequence) • Document ordering of nodes: • Node x occurs before node y if its start tag occurs earlier in the textual representation of the document than that of y • Parent precedes children, siblings ordered left-to-right • Tree-view conventions: • Root node drawn as a circle • Element nodes drawn as rounded boxes • Text nodes drawn as parallelograms • Attribute nodes drawn as rectangles containing “name: value” pairs
Viewing tree structure in a browser • If you load an XML file in a modern browser and the file has no associated style sheet, then its tree structure is shown
Other XML data models • Foregoing is how XML document described in XPath data model • In DOM (Document Object Model) and JDOM (Java Document Object Model), an XML tree can contain other types of nodes such as: • Document Type nodes corresponding to Document Type Definitions (DTDs) • Entity reference nodes which are references to XML fragments defined in the DTD schema • CDATA nodes which are a special type of text node
Issues in designing an XML language • Text nodes usually contain the actual information or data • Elements and their attributes used to convey logical structure and meta-information about the data • Difference between information and meta-information not always obvious • Some languages use elements for everything • Others use attributes for everything so that all elements are empty • Most languages use a mixture of elements and attributes
Textual representation of XML documents • XML document is a Unicode text with markup tags and other meta-information representing elements, attributes and other nodes • Text nodes are written as the text they represent (character data) • Element nodes delimited by start and end tags: <related ref="42">Garden Quiche is also yummy.</related> • Text in between start and end tags is the content • This constitutes descendants of the element node • Attributes written inside element start tag and attribute values always written within double or single quotes: ref="42" or ref='42' • Empty element is one without content (i.e., nothing between start and end tags): <pineapple></pineapple> or <pineapple/> • XML document must be well-formed: • Nodes organised into a strictly nested tree structure • Every start tag must have an end tag (or use abbreviated form for empty element) • Elements must nest properly: • Properly nested: <banana><orange></orange></banana> • Improperly nested: <banana><orange></banana></orange> • Cf. HTML which allows certain tags (particularly many end tags) to be omitted and also allows improper nesting • XML is case-sensitive: • <Tag></tag> is not well-formed because end tag not the same name as start tag
Textual representation of XML documents • XML document should begin with an XML declaration: <?xml version="1.0" encoding="UTF-8" ?> • Version attribute indicates version of XML being used • Should be 1.0 or 1.1 • Encoding attribute indicates encoding used in file • All XML parsers required to understand Unicode encodings UTF-8 and UTF-16 • Some parsers support other popular encodings like ISO-8859-1 but must then be able to convert from these encodings to Unicode code points • Best to use UTF-8 or UTF-16 if possible • XML declaration followed by root element
Character data and attribute values • In character data (text nodes) and attribute values, special characters have to be escaped using Unicode character references &#N; denotes Unicode character with code point N represented in decimal &#xN; denotes Unicode character with code point N represented in hexadecimal Some characters are predefined entities in XML (see table above) Examples: < can be referenced as <, < or < & can be referenced as &, & or & • < and & must be escaped in both character data and attribute values • In attribute values that are enclosed by " or ' this character must also be escaped • Also use character references to encode Unicode characters that are not accessible from the keyboard • e.g., “sake” in hiragana script is which is encoded as さけ • Complete list of Unicode character code points is available on the Unicode website at: http://www.unicode.org/charts/
CDATA sections • If you have some text that contains lots of characters that have to be escaped, then you can enclose the text within a CDATA section • CDATA section corresponds to a CDATA node in the DOM and JDOM data models • For example, in most situations,<![CDATA[a<b & b>c]]>is equivalent toa<b & b>c • Strange syntax for CDATA sections originates in SGML
Comments, processing instructions and DTD information • Comment nodes are encoded in the source in the same way as in HTML:<!--This is a comment--> • A processing instruction is a target-value pair delimited by <?...?>:<?xml-stylesheet type="text/xsl" href="mystyle.xsl"?>in which xml-stylesheet is the target and the string type="text/xsl" href="mystyle.xsl"is the single value • Document type nodes (recognized in DOM and JDOM) are encoded as follows: <!DOCTYPE …>
Example XML document • Contains an XML declaration, followed by a document type definition and then a single root element named features • The features element contains a processing instruction, some character data and a comment
White space in XML • Often convenient to use “white space” (spaces, tabs and new lines) to format source and make it more readable • Usually this white space is not supposed to be included in the delivered version of the document • However, sometimes we want the white space to be preserved • e.g., in poetry or computer programme source code • By default, the way that white space is handled in an XML document is decided by the application that is used to process the document • If we definitely want white space within the content of an element to be preserved, then we assign the value "preserve" to the attribute xml:space in that element • Applies to all elements within content of element where xml:space attribute value specified, unless overridden by another instance of the xml:space attribute • Typically, white space handling is defined in the DTD for the specific language
Is XML too verbose? • Some argue that XML markup is more verbose than necessary • Same information can often be represented much more parsimoniously in a relational database • Leads to (misguided) advice to use attributes in preference to elements and short names for both attributes and elements • This usually leads to inflexible and incomprehensible language designs! • Better to disregard such considerations in the design phase and then compress files later using either a general purpose compression program or one that is optimized for XML, such as XMill http://sourceforge.net/projects/xmill • To represent structured information, all you really need are text and element nodes • One simpler alternative is to use something like Lisp S-Expressions which date back to 1958 • For example,(collection (recipe (title "Rhubarb Cobbler") (date "Wed, 14 Jun 95") … ))represents the same as<collection> <recipe> <title>Rhubarb Cobbler</title> <date>Wed, 14 Jun 95</date> </recipe></collection>
Applications of XML • Hundreds of XML applications have been developed for many different domainshttp://xml.coverpages.org/xmlApplications.html • XML languages can be roughlyl classified into • Data-oriented languages for describing data that would traditionally have been stored in databases. • Usually have a flat, wide structure, with the root element containing many similar children, each with a simple structure • Document-oriented languages (e.g., XHTML) are for annotating the structure of natural language text • Elements often have mixed content (elements and character data) • Unlike documents in a data-oriented language, documents in a document-oriented XML language can usually be understood even if the markup tags are removed • Protocols and programming languages including, e.g., XML Schema and XSLT • Usually have most complex syntax • Hybrids often combine features of data- and document-oriented languages • Typically allow freeform text as content of certain elements • e.g., <comment> element in RecipeML
Examples of XML languages: XHTML • XHTML 1.0 is W3C’s XML-ification of HTML 4.01 • Apart from the XML declaration and the XHTML namespace declaration, XHTML is very similar to HTML 4.01 • However, XHTML document must be a well-formed XML document, therefore • Omitting end tags is forbidden in XHTML • Can abbreviate end tags by using <…/> notation, e.g., <br/> • XHTML element and attribute names must be lower case • Attribute values cannot be omitted and must be surrounded by double or single quotes • e.g., attribute checked in HTML must be written checked="checked"
XHTML Variants • XHTML 1.0 Strict • Clean markup in which all layout is specified using CSS • XHTML 1.0 Transitional • Additionally permits explicit layout markup like bgcolor attribute and font tag • XHTML 1.0 Frameset • Allows use of frames • XHTML 1.1 • Modularization of XHTML 1.0 in which language partitioned by functionality into modules, e.g., • Structure: includes html, head and body tags • Text: includes basic text markup • Hypertext: includes the anchor tag (<a>) • Lists: ul, ol, dl,… • Forms: form, input, select,… • … • Each module defined using a separate DTD • Allows specific subsets of the XHTML language to be included in new languages
Other XML languages • CML (Chemical Markup Language) • Data-oriented language for representing molecules and chemical reactions • One of the first XML applications • Supported by wide range of tools such as browsers and editors • WML (Wireless Markup Language) • Document-oriented XML language that replaces HTML on mobile devices that typically have small displays, limited user-input facilities and low bandwidth • ebXML (Electronic Business XML Initiative) • Worldwide initiative to use XML for exchanging electronic business data • Has provided comprehensive standards for business processes, data components, collaboration protocol agreements, messaging etc. • Complex language that belongs to “protocols and programming languages” category of XML languages • ThML (Theological Markup Language) • Superset of XHTML for markup of theological texts • Supports references, annotations, glossaries • MusicXML • For encoding Western musical staff notation • Many other XML applications and initiatives listed here:http://xml.coverpages.org/xmlApplications.html
Namespaces in XML • Not part of the XML specification • Defined separately from XML specification • For XML 1.0: • http://www.w3.org/TR/xml-names/ • For XML 1.1: • http://www.w3.org/TR/xml-names11/ • “XML namespaces provide a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by IRI references”(http://www.w3.org/TR/xml-names/)
XML Namespaces:Motivating problem • The (fictitious) XML language, WidgetML, is designed for describing widgets • Can include explanatory text written in XHTML within a WidgetML document • i.e., WidgetML uses XHTML as a sublanguage • Example above describes a widget called gadget with a medium-sized head and a big gizmo subwidget • XHTML message contained within info element • Both XHTML and non-XHTML part of WidgetML use tags big and head • Means that big and head tags can mean different things within a WidgetML document, depending on context • Demonstrates need to be able to avoid name clashes when combining languages that may use elements with the same names to mean different things • Programming languages uses name spaces and qualified names to avoid clashing names • In XML we also use namespaces and each namespace is identified by a unique URI • E.g., XHTML namespace is associated with the URI, http://www.w3.org/1999/xhtml • WidgetML developed by a company called Widget Inc. whose domain is www.widget.inc, so assigns a URI under their domain to the WidgetML namespace, such as http://www.widget.inc/widgetml/
Namespaces • So now, instead of just writing <head>…</head> inside the info element, we can write<{http://www.w3.org/1999/xhtml}head> …</{http://www.w3.org/1999/xhtml}head>to specify that we mean the head tag from XHTML, not the head tag from WidgetML • But prefixing every tag with a long URI would lead to an extremely verbose and incomprehensible document • Instead, we assign a short name to the namespace we want to use within an element and declare this association between the short name and the namespace as an attribute in the start tag of the containing element: <info xmlns:foo="http://www.w3.org/TR/xhtml1”> … <foo:head>…</foo:head> </info>We can then prefix the tags within the containing element with the short name to indicate that they are from the declared namespace • The attribute, xmlns:foo="http://www.w3.org/TR/xhtml1", • Declares the namespace named http://www.w3.org/TR/xhtml1 • Gives this namespace the prefix, foo
Namespaces • Namespace declaration applies to all contents of element in whose start tag it occurs • Can use any name as a prefix except one that contains a colon or one starting with the letters “XML” (in any combination of upper or lower case) • NCName (No Colon Name) is one that does not contain a colon • QName (Qualified name) may be an NCName or an NCName prefixed with a namespace prefix and a colon • Unprefixed element names are assigned a default namespace • Default namespace can be overridden by setting attribute xmlns to the URI for the namespace to be used for unprefixed tags:<widget type=“gadget” xmlns=“http://www.widget.inc”>
Default namespaces don’t apply to attributes! • 1 and 2 are equivalent: • The size attribute does not come from the http://www.widget.inc namespace • In 3, the size attribute does come from the http://www.widget.inc namespace
Example RecipeML document • An example RecipeML document is available athttp://www.brics.dk/ixwt/examples/recipes.xml
Summary • XML is a framework for developing markup languages in any conceivable domain • XML is just a notation for hierarchically structuring textual data • Strength of XML is that it is a widely accepted standard supported by many generic languages and tools • Means you get lots of free infrastructure if you build on it • Considered XML in the form of trees and in its textual representation • Considered the namespace mechanism for resolving name conflicts