1 / 41

2 Basics of XML and XML documents

2 Basics of XML and XML documents. Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.3 XML Namespaces. 2.1 XML and XML documents. XML 1.0 - Extensible Markup Language, W3C Recommendation, February 1998

domani
Download Presentation

2 Basics of XML and XML documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.3 XML Namespaces 2: XML Basics

  2. 2.1 XML and XML documents • XML 1.0 - Extensible Markup Language,W3C Recommendation, February 1998 • not an official standard, but a stable industry standard • 2nd Ed 2000, 3rd Ed 2004, 4th Ed 2006, 5th Ed Nov 2008 • editorial revisions, not new versions of XML 1.0 • a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 • what is said later about valid XML documents applies to SGML documents, too 2: XML Basics

  3. What is XML? • Extensible Markup Language is not a markup language! • does not fix a tag set nor its semantics (unlike markup languages, e.g. HTML) • XML documents have no inherent (processing or presentation) semantics • even though many think that XML is semantic or self-describing; See next 2: XML Basics

  4. Semantics of XML Markup • Meaning of this XML fragment? • The computer cannot know it either! • Implementing the semantics is the topic of this course 2: XML Basics

  5. What is XML (2)? • XML is a (A) way to use markup to represent information (B) metalanguage • supports definition of specific markup languages through XML DTDs or Schemas • E.g. XHTML a reformulation of HTML using XML • (C) Often “XML”  XML + XML technology • that is, processing models and languages we’re studying (and many others ...) 2: XML Basics

  6. Howdoesit look? <?xml version=’1.0’ encoding=”iso-8859-1” ?> <invoice num=”1234”> <client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <email>kilpelai@cs.uku.fi</email> </client> <item price=”60” unit=”EUR”> XML Handbook</item> <item price=”350” unit=”FIM”> XSLT Programmer’s Ref</item> </invoice> 2: XML Basics

  7. Essential Features of XML • Overview of XML essentials • many details skipped • some to be discussed in exercises or with other topics when the need arises • Learn to consult original sources (specifications, documentation etc) for details! • The XML specification is easy to browse 2: XML Basics

  8. XML Document Characters • XML documents are made of ISO-10646 (32-bit) characters; in practice of 16-bit Unicode chars (cf. Java) • Three aspects or characters(see next): • visual presentation • representation by bytes/octets • numeric code points • 97 • 98 • 99 • ... 2: XML Basics

  9. External Aspects of Characters • Documents are stored/transmitted as a sequence of bytes (or octets). An encoding determines how characters are represented by bytes. • UTF-8 ( 7-bit ASCII) is the XML default encoding • encoding="iso-8859-1"~> 256 Western European chars as single bytes • A font (collection of character images called glyphs) determines the visual presentation of characters 2: XML Basics

  10. XML Encoding of Structure 1 • XML is, essentially, just a textual encoding scheme of labelled, ordered and attributedtrees, in which • internal nodes are elements labelled by type names • leaves are text nodes labelled by string values, or empty element nodes • the left-to-right order of children of a node matters • element nodes may carry attributes(= name-string-value pairs) 2: XML Basics

  11. XML Encoding of Structure 2 • XML encoding of a tree • corresponds to a pre-order walk • start of an element node with type name A denoted by a start tag <A>, and its end denoted by end tag </A> • possible attributes written within the start tag: <A attr1=“value1” … attrn=“valuen”> • names must be unique: attrkattrhwhen k  h • text nodes written as their string value 2: XML Basics

  12. XML Encoding of Structure: Example <S> S W E W A=1 world! Hello </S> </W> <W> Hello <E A=‘1’/> <W> world! </W> 2: XML Basics

  13. XML: Logical Document Structure • Elements • indicated by matching (case-sensitive!) tags<ElementTypeName> … </ElementTypeName> • can contain text and/or subelements • can be empty:<elem-type></elem-type> or <elem-type/> (e.g. <br/> in XHTML) • unique root element -> document a single tree 2: XML Basics

  14. Logical document structure (2) • Attributes • name-value pairs attached to elements • ‘‘metadata’’, usually not treated as content • in start-tag after the element type name <div class="preface" date='990126'> … • Also: • <!--commentsoutside other markup--> • <?PI value of processing instruction ‘PI’?> 2: XML Basics

  15. CDATA Sections • “CDATA Sections” to include XML markup characters as textual content <![CDATA[ Here we can include for example code fragments: <example>if (Count < 5 && Count > 0) </example> ]]> 2: XML Basics

  16. Two levels of correctness (1) • Well-formed documents • roughly: follows the syntax of XML,markup correct (elements properly nested, tag names match, attributes of an element have unique names, ...) • violation is a fatal error • Validdocuments • (in addition to being well-formed) obey an associated grammar (DTD/Schema) 2: XML Basics

  17. XML docs and valid XML docs DTD-valid documents Schema-valid documents XML documents = well-formed XML documents 2: XML Basics

  18. An XML Processor (Parser) • Reads XML documents and reports their contents to an application • relieves the application from details of markup • XML Recommendation specifies, partly, the behaviour of XML processors: • recognition of characters as markup or data; what information to pass to applications; how to check the correctness of documents; • validation based on comparing document against its grammar 2: XML Basics

  19. 2. 2 Basics of XML DTDs • A Document Type Declaration provides a grammar (document type definition, DTD) for a class of documents [Defined in XML Rec] • Syntax (in the prolog of a document instance):<!DOCTYPErootElemTypeSYSTEM "ex.dtd" <!-- "external subset" in file ex.dtd --> [ <!-- "internal subset" may come here --> ]> • DTD = union of the external and internal subset (could be empty); internal has higher precedence • can override entity and attribute declarations (see next) 2: XML Basics

  20. Markup Declarations • DTD consists of markup declarations • element type declarations • similar to productions of ECFGs • attribute-list declarations • for declared element types • entity declarations • short-hand notations and physical storage units • notation declarations • information about external (binary) objects 2: XML Basics

  21. HowdoDeclarations Look Like? <!ELEMENT invoice (client, item+)> <!ATTLIST invoicenum NMTOKEN #REQUIRED> <!ELEMENT client (name, email?)> <!ATTLIST clientnum NMTOKEN #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT item (#PCDATA)> <!ATTLIST item price NMTOKEN #REQUIRED unit (FIM | EUR) ”EUR” > string of namecharacters; (Letter | [0-9] | . | - | _ | : )+ 2: XML Basics

  22. Element type declarations • The general form is<!ELEMENTelemType(E)>where E is a content model (regular expr.); (≈ ECFG production elemType E ) • Content model operators: E | F : alternation E, F: concatenation E? : optional E* : zero or more E+ : one or more (E) : grouping • No priorities: either (A,B)|C or A,(B|C), but no A,B|C 2: XML Basics

  23. Attribute-List Declarations • Can declare attributes for elements: • Name, data type and possible default value: <!ATTLIST FIG id ID #IMPLIEDdescr CDATA #REQUIRED class (a | b | c) "a"> • Semantics mainly up to the application • parser checks that ID attributes are unique and that targets of IDREFand IDREFS attributes exist 2: XML Basics

  24. Mixed, Empty and Arbitrary Content • Mixed content:<!ELEMENT P (#PCDATA | I | IMG)*> • may contain text (#PCDATA) and elements • Empty content:<!ELEMENT IMG EMPTY> • Arbitrary content:<!ELEMENT HTML ANY>(= <!ELEMENT HTML (#PCDATA |choice-of-all-declared-element-types)*>) 2: XML Basics

  25. Entities (1) • Named storage units or XML fragments (~ macros in some languages) • character entities: • &lt;&#60; and &#x3C; all expand to ‘<‘(treated as data, not as start-of-markup) • other predefined entities: &amp; &gt; &apos; &quote;for& > ' " • general entities are shorthand notations:<!ENTITY CS “<it>School</it> of Computing"> 2: XML Basics

  26. Entities (2) • physical storage units comprising a document • parsed entities<!ENTITY chap1 SYSTEM "http://myweb/ch1"> • document entity is the starting point of processing • entities and elements must nest properly: <sec num="1"> … </sec> <sec num="2"> … </sec> <!DOCTYPE doc [ <!ENTITY chap1 (… as above …)> ] <doc> &chap1; </doc> 2: XML Basics

  27. Unparsed Entities • For connecting external binary objects to XML documents; (XML processor handles only text) <!NOTATION TIFF "bin/gimp"><!ENTITY fig123 SYSTEM "figs/f123.tif" NDATA TIFF><!ATTLIST IMG file ENTITY #REQUIRED> • Usage: <IMG file="fig123"/> • parser provides notation and address to the application • an SGML legacy technique(?) 2: XML Basics

  28. An Alternative Way • I have rarely used unparsed entities or notations • Easier "HTML-style" linking: <IMG type="TIFF" file="figs/fig123"/> • the link indicates the format and the address directly • maintenance of numerous entities might be easier with (indirect references through) explicit declarations 2: XML Basics

  29. Parameter Entities • to parameterize and modularize DTDs:<!ENTITY %tableDTD SYSTEM "dtds/tab.dtd">%tableDTD; <!-- include sub-dtd --><!ENTITY %stAttr "ready (yes | no) 'no'"><!ATTLIST CHAP %stAttr; ><!ATTLIST SECT %stAttr; > (Parameter entities inside a markup declaration allowed only in external DTD subset) 2: XML Basics

  30. Speculations about XML Parsing • Parsing involves two things:1. Pulling the entities together, and checking the well-formedness2. Building a parse tree for the input (a'la DOM), or otherwise informing the application about document content and structure (e.g. a'la SAX) • Task 2 is simple (<- simplicity of XML markup; see next) • Checking well-formedness is straightforward; • Implementing validation is a bit more challenging 2: XML Basics

  31. S W E W Hello world! Building an XML Parse Tree <S> <W> Hello </W> <E> </E> <W> world! </W> </S> 2: XML Basics

  32. Simplified XML Tree Building Algorithm C ← newDocument(); // currentnode Scandocumenttextuntil at end: case of scannedtext: "<Type>": N ← newElementNode(Type); C.addChildNode(N); C ← N; "Txt": C.addChildNode(newTextNode(Txt)); "</Type>": C ← C.parentNode(); return C; Uses DOM liketreeoperations 2: XML Basics

  33. Sketching XML validation • Treat the document as a tree d • Document is valid w.r.t. a grammar (DTD/Schema) G iffd is a syntax tree over G • Check that the root is labelled by the start symbol of G • For each element node n of the tree, check that its • attributes match with those of the element type • content matches the content model of its type: If n is of type A and its children of type B1, …, Bn, check that the grammar has a production A -> E for which B1 … Bn L(E) (1) 2: XML Basics

  34. Sketching the validation… (2) • How to check condition (1; matching of children with a content model)? • by an automaton built from content model E • Example: <!ELEMENT A ((B|C)+, D)> 2: XML Basics

  35. HTML elements XSLT elements/instructions 2.3 XML Namespaces • Documents often comprise parts processed by different applications (and/or defined in different schemas) • for example, in XSLT scripts:<xsl:template match="doc/title"> <H1> <xsl:apply-templates /> </H1> </xsl:template> • How to manage multiple sets of names? 2: XML Basics

  36. XML Namespaces (2/5) • Solution: XML Namespaces, W3C Rec. Jan’99, for separating possibly overlapping “vocabularies” (sets of element type and attribute names) within a single document • by introducing (arbitrary) local name prefixes, and binding them to (fixed) globally unique URIs • For example, the local prefix “xsl:” conventionally used in XSLT scripts 2: XML Basics

  37. XML Namespaces briefly (3/5) • Namespace identified by a URI (through the associated local prexif) e.g.http://www.w3.org/1999/XSL/Transformfor XSLT • conventional but not required to use URLs • the identifying URI has to be unique, but it does not have to be an existing address • Association inherited to sub-elements • see the next example (of an XSLT script) 2: XML Basics

  38. XML Namespaces (4/5) <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict"><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"> <H1> <xsl:apply-templates /> </H1> </xsl:template> </xsl:stylesheet> 2: XML Basics

  39. XML Namespaces briefly (5/5) • Built on top of basic XML • By overloading attribute syntax (xmlns:foo=...) • does not affect validation • namespace attributes must be declared for DTD-validity • all element type names must be declared (with their prefixes!) • > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespaces • Processing languages allow to declare NS prefixes; see next 2: XML Basics

  40. Example: Namespaces in XSLT XML: <test xmlns:foo="http://www.uef.fi/Test"> <foo:A>aaa</foo:A> <foo:B>bbb</foo:B></test> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:bar="http://www.uef.fi/Test"> <xsl:template match="/"> <xsl:apply-templates select="//bar:*" /> </xsl:template> </xsl:transform> 2: XML Basics

  41. Example: Namespaces in XQuery ns-test.xml : <test xmlns:foo="http://www.uef.fi/Test"> <foo:A>aaa</foo:A> <foo:B>bbb</foo:B></test> xquery version "1.0"; declare namespace zap = "http://www.uef.fi/Test"; doc("ns-test.xml")//zap:* • For NS support in XML APIs, seenext 2: XML Basics

More Related