410 likes | 594 Views
2 Basics of XML and XML documents. Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.3 XML Namespaces. 2.1 XML and XML documents. XML 1.0 - Extensible Markup Language, W3C Recommendation, February 1998
E N D
2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.3 XML Namespaces 2: XML Basics
2.1 XML and XML documents • XML 1.0 - Extensible Markup Language,W3C Recommendation, February 1998 • not an official standard, but a stable industry standard • 2nd Ed 2000, 3rd Ed 2004, 4th Ed 2006, 5th Ed Nov 2008 • editorial revisions, not new versions of XML 1.0 • a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 • what is said later about valid XML documents applies to SGML documents, too 2: XML Basics
What is XML? • Extensible Markup Language is not a markup language! • does not fix a tag set nor its semantics (unlike markup languages, e.g. HTML) • XML documents have no inherent (processing or presentation) semantics • even though many think that XML is semantic or self-describing; See next 2: XML Basics
Semantics of XML Markup • Meaning of this XML fragment? • The computer cannot know it either! • Implementing the semantics is the topic of this course 2: XML Basics
What is XML (2)? • XML is a (A) way to use markup to represent information (B) metalanguage • supports definition of specific markup languages through XML DTDs or Schemas • E.g. XHTML a reformulation of HTML using XML • (C) Often “XML” XML + XML technology • that is, processing models and languages we’re studying (and many others ...) 2: XML Basics
Howdoesit look? <?xml version=’1.0’ encoding=”iso-8859-1” ?> <invoice num=”1234”> <client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <email>kilpelai@cs.uku.fi</email> </client> <item price=”60” unit=”EUR”> XML Handbook</item> <item price=”350” unit=”FIM”> XSLT Programmer’s Ref</item> </invoice> 2: XML Basics
Essential Features of XML • Overview of XML essentials • many details skipped • some to be discussed in exercises or with other topics when the need arises • Learn to consult original sources (specifications, documentation etc) for details! • The XML specification is easy to browse 2: XML Basics
XML Document Characters • XML documents are made of ISO-10646 (32-bit) characters; in practice of 16-bit Unicode chars (cf. Java) • Three aspects or characters(see next): • visual presentation • representation by bytes/octets • numeric code points • 97 • 98 • 99 • ... 2: XML Basics
External Aspects of Characters • Documents are stored/transmitted as a sequence of bytes (or octets). An encoding determines how characters are represented by bytes. • UTF-8 ( 7-bit ASCII) is the XML default encoding • encoding="iso-8859-1"~> 256 Western European chars as single bytes • A font (collection of character images called glyphs) determines the visual presentation of characters 2: XML Basics
XML Encoding of Structure 1 • XML is, essentially, just a textual encoding scheme of labelled, ordered and attributedtrees, in which • internal nodes are elements labelled by type names • leaves are text nodes labelled by string values, or empty element nodes • the left-to-right order of children of a node matters • element nodes may carry attributes(= name-string-value pairs) 2: XML Basics
XML Encoding of Structure 2 • XML encoding of a tree • corresponds to a pre-order walk • start of an element node with type name A denoted by a start tag <A>, and its end denoted by end tag </A> • possible attributes written within the start tag: <A attr1=“value1” … attrn=“valuen”> • names must be unique: attrkattrhwhen k h • text nodes written as their string value 2: XML Basics
XML Encoding of Structure: Example <S> S W E W A=1 world! Hello </S> </W> <W> Hello <E A=‘1’/> <W> world! </W> 2: XML Basics
XML: Logical Document Structure • Elements • indicated by matching (case-sensitive!) tags<ElementTypeName> … </ElementTypeName> • can contain text and/or subelements • can be empty:<elem-type></elem-type> or <elem-type/> (e.g. <br/> in XHTML) • unique root element -> document a single tree 2: XML Basics
Logical document structure (2) • Attributes • name-value pairs attached to elements • ‘‘metadata’’, usually not treated as content • in start-tag after the element type name <div class="preface" date='990126'> … • Also: • <!--commentsoutside other markup--> • <?PI value of processing instruction ‘PI’?> 2: XML Basics
CDATA Sections • “CDATA Sections” to include XML markup characters as textual content <![CDATA[ Here we can include for example code fragments: <example>if (Count < 5 && Count > 0) </example> ]]> 2: XML Basics
Two levels of correctness (1) • Well-formed documents • roughly: follows the syntax of XML,markup correct (elements properly nested, tag names match, attributes of an element have unique names, ...) • violation is a fatal error • Validdocuments • (in addition to being well-formed) obey an associated grammar (DTD/Schema) 2: XML Basics
XML docs and valid XML docs DTD-valid documents Schema-valid documents XML documents = well-formed XML documents 2: XML Basics
An XML Processor (Parser) • Reads XML documents and reports their contents to an application • relieves the application from details of markup • XML Recommendation specifies, partly, the behaviour of XML processors: • recognition of characters as markup or data; what information to pass to applications; how to check the correctness of documents; • validation based on comparing document against its grammar 2: XML Basics
2. 2 Basics of XML DTDs • A Document Type Declaration provides a grammar (document type definition, DTD) for a class of documents [Defined in XML Rec] • Syntax (in the prolog of a document instance):<!DOCTYPErootElemTypeSYSTEM "ex.dtd" <!-- "external subset" in file ex.dtd --> [ <!-- "internal subset" may come here --> ]> • DTD = union of the external and internal subset (could be empty); internal has higher precedence • can override entity and attribute declarations (see next) 2: XML Basics
Markup Declarations • DTD consists of markup declarations • element type declarations • similar to productions of ECFGs • attribute-list declarations • for declared element types • entity declarations • short-hand notations and physical storage units • notation declarations • information about external (binary) objects 2: XML Basics
HowdoDeclarations Look Like? <!ELEMENT invoice (client, item+)> <!ATTLIST invoicenum NMTOKEN #REQUIRED> <!ELEMENT client (name, email?)> <!ATTLIST clientnum NMTOKEN #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT item (#PCDATA)> <!ATTLIST item price NMTOKEN #REQUIRED unit (FIM | EUR) ”EUR” > string of namecharacters; (Letter | [0-9] | . | - | _ | : )+ 2: XML Basics
Element type declarations • The general form is<!ELEMENTelemType(E)>where E is a content model (regular expr.); (≈ ECFG production elemType E ) • Content model operators: E | F : alternation E, F: concatenation E? : optional E* : zero or more E+ : one or more (E) : grouping • No priorities: either (A,B)|C or A,(B|C), but no A,B|C 2: XML Basics
Attribute-List Declarations • Can declare attributes for elements: • Name, data type and possible default value: <!ATTLIST FIG id ID #IMPLIEDdescr CDATA #REQUIRED class (a | b | c) "a"> • Semantics mainly up to the application • parser checks that ID attributes are unique and that targets of IDREFand IDREFS attributes exist 2: XML Basics
Mixed, Empty and Arbitrary Content • Mixed content:<!ELEMENT P (#PCDATA | I | IMG)*> • may contain text (#PCDATA) and elements • Empty content:<!ELEMENT IMG EMPTY> • Arbitrary content:<!ELEMENT HTML ANY>(= <!ELEMENT HTML (#PCDATA |choice-of-all-declared-element-types)*>) 2: XML Basics
Entities (1) • Named storage units or XML fragments (~ macros in some languages) • character entities: • << and < all expand to ‘<‘(treated as data, not as start-of-markup) • other predefined entities: & > ' "e;for& > ' " • general entities are shorthand notations:<!ENTITY CS “<it>School</it> of Computing"> 2: XML Basics
Entities (2) • physical storage units comprising a document • parsed entities<!ENTITY chap1 SYSTEM "http://myweb/ch1"> • document entity is the starting point of processing • entities and elements must nest properly: <sec num="1"> … </sec> <sec num="2"> … </sec> <!DOCTYPE doc [ <!ENTITY chap1 (… as above …)> ] <doc> &chap1; </doc> 2: XML Basics
Unparsed Entities • For connecting external binary objects to XML documents; (XML processor handles only text) <!NOTATION TIFF "bin/gimp"><!ENTITY fig123 SYSTEM "figs/f123.tif" NDATA TIFF><!ATTLIST IMG file ENTITY #REQUIRED> • Usage: <IMG file="fig123"/> • parser provides notation and address to the application • an SGML legacy technique(?) 2: XML Basics
An Alternative Way • I have rarely used unparsed entities or notations • Easier "HTML-style" linking: <IMG type="TIFF" file="figs/fig123"/> • the link indicates the format and the address directly • maintenance of numerous entities might be easier with (indirect references through) explicit declarations 2: XML Basics
Parameter Entities • to parameterize and modularize DTDs:<!ENTITY %tableDTD SYSTEM "dtds/tab.dtd">%tableDTD; <!-- include sub-dtd --><!ENTITY %stAttr "ready (yes | no) 'no'"><!ATTLIST CHAP %stAttr; ><!ATTLIST SECT %stAttr; > (Parameter entities inside a markup declaration allowed only in external DTD subset) 2: XML Basics
Speculations about XML Parsing • Parsing involves two things:1. Pulling the entities together, and checking the well-formedness2. Building a parse tree for the input (a'la DOM), or otherwise informing the application about document content and structure (e.g. a'la SAX) • Task 2 is simple (<- simplicity of XML markup; see next) • Checking well-formedness is straightforward; • Implementing validation is a bit more challenging 2: XML Basics
S W E W Hello world! Building an XML Parse Tree <S> <W> Hello </W> <E> </E> <W> world! </W> </S> 2: XML Basics
Simplified XML Tree Building Algorithm C ← newDocument(); // currentnode Scandocumenttextuntil at end: case of scannedtext: "<Type>": N ← newElementNode(Type); C.addChildNode(N); C ← N; "Txt": C.addChildNode(newTextNode(Txt)); "</Type>": C ← C.parentNode(); return C; Uses DOM liketreeoperations 2: XML Basics
Sketching XML validation • Treat the document as a tree d • Document is valid w.r.t. a grammar (DTD/Schema) G iffd is a syntax tree over G • Check that the root is labelled by the start symbol of G • For each element node n of the tree, check that its • attributes match with those of the element type • content matches the content model of its type: If n is of type A and its children of type B1, …, Bn, check that the grammar has a production A -> E for which B1 … Bn L(E) (1) 2: XML Basics
Sketching the validation… (2) • How to check condition (1; matching of children with a content model)? • by an automaton built from content model E • Example: <!ELEMENT A ((B|C)+, D)> 2: XML Basics
HTML elements XSLT elements/instructions 2.3 XML Namespaces • Documents often comprise parts processed by different applications (and/or defined in different schemas) • for example, in XSLT scripts:<xsl:template match="doc/title"> <H1> <xsl:apply-templates /> </H1> </xsl:template> • How to manage multiple sets of names? 2: XML Basics
XML Namespaces (2/5) • Solution: XML Namespaces, W3C Rec. Jan’99, for separating possibly overlapping “vocabularies” (sets of element type and attribute names) within a single document • by introducing (arbitrary) local name prefixes, and binding them to (fixed) globally unique URIs • For example, the local prefix “xsl:” conventionally used in XSLT scripts 2: XML Basics
XML Namespaces briefly (3/5) • Namespace identified by a URI (through the associated local prexif) e.g.http://www.w3.org/1999/XSL/Transformfor XSLT • conventional but not required to use URLs • the identifying URI has to be unique, but it does not have to be an existing address • Association inherited to sub-elements • see the next example (of an XSLT script) 2: XML Basics
XML Namespaces (4/5) <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict"><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"> <H1> <xsl:apply-templates /> </H1> </xsl:template> </xsl:stylesheet> 2: XML Basics
XML Namespaces briefly (5/5) • Built on top of basic XML • By overloading attribute syntax (xmlns:foo=...) • does not affect validation • namespace attributes must be declared for DTD-validity • all element type names must be declared (with their prefixes!) • > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespaces • Processing languages allow to declare NS prefixes; see next 2: XML Basics
Example: Namespaces in XSLT XML: <test xmlns:foo="http://www.uef.fi/Test"> <foo:A>aaa</foo:A> <foo:B>bbb</foo:B></test> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:bar="http://www.uef.fi/Test"> <xsl:template match="/"> <xsl:apply-templates select="//bar:*" /> </xsl:template> </xsl:transform> 2: XML Basics
Example: Namespaces in XQuery ns-test.xml : <test xmlns:foo="http://www.uef.fi/Test"> <foo:A>aaa</foo:A> <foo:B>bbb</foo:B></test> xquery version "1.0"; declare namespace zap = "http://www.uef.fi/Test"; doc("ns-test.xml")//zap:* • For NS support in XML APIs, seenext 2: XML Basics