580 likes | 778 Views
XML. http://www.flickr.com/photos/nics_events/2349632625/. Making your marks. SGML and HTML. SGML is a meta-markup language A language for making languages Standardized in 1986 HTML was specified with SGML
E N D
XML http://www.flickr.com/photos/nics_events/2349632625/ Making your marks
SGML and HTML • SGML is a meta-markup language • A language for making languages • Standardized in 1986 • HTML was specified with SGML • Fixed set of tags/attributes. Users cannot customize which means that HTML must have anticipated all possible documents and their structure. • Little structure. Tags can occur almost anywhere in any order • XML is a ‘lite’ beer (a stripped down version of SGML)
XML • XML is not a replacement for HTML • HTML describes layout • XML can be used to describe anything (content or layout). The semantics are user-defined. • XML has no predefined tags • All documents described as XML documents can be parsed with a single parser (not so with SGML) • Our book refers to • TAG SET: an xml-based markup language. Others refer to this as an XML APPLICATION • XML PROCESSOR: a parser that provides xml data to a program • XML DOCUMENT: a document that conforms to XML
XML Overview • XML provides no more than a baseline on which complex semantic models can be built. All those more restricted applications will share some common invariants. • An XML document is a linearization of a tree structure. • At every node in the tree there are several character strings. • The tree structure and the character strings together form the information content of an XML document. Almost everything will follow naturally from that. • Some of the characters in the document are only there to support the linearization, others are part of the information content.
XML Overview <p> <q id="x7">The first q</q> <q id="x8">The second q</q> <q href="#x7">The third q</q> </p>
XML Syntax • Two parts to an xml tag set • The low-level rules that apply to all XML documents • The rules that apply to a particular tag-set. These rules are formalized as either a • Document type definition • XML Schema • Generally the low-level rules are easily understood to those familiar with HTML • The tag-set specific rules tend to be more complex.
XML Tags • Elements in XML are denoted by tags • A tag has a type and may have attributes and content • A tag is denoted by an opening/closing pair • <BREWERS></ BREWERS > • < BREWERS /> • A single tag will typically have many children. • < BREWERS ><PLAYER></PLAYER><PLAYER></PLAYER></BREWERS> • Attributes are name/value pairs • <BREWERS YEAR="2011"> • <PLAYER NUMBER="8">Ryan Braun</PLAYER> • <PLAYER NUMBER="28">Prince Fielder</PLAYER> • <MANAGER NUMBER="10">Ron Roenicke</MANAGER> • </BREWERS>
XML/HTML Syntax • Attributes are name/value pairs that can be attached to an element. • In HTML, you only need to quote an attribute value if it contains a space, or a character that is not allowed in a name. • <body id=main> • In XML, attribute values must always be quoted. • <happiness type="joy" /> • Element types. • In HTML there is a built-in set of element names and allowed attributes. • In XML, there are no built-in names/attributes (a couple of exceptions). • Entities. • Since some characters have a special meaning in HTML (<,>,/, etc..) HTML provides a pre-defined set of characters names. These are called 'entities'. • In XML, there are only five built-in character entities: <, >, &, " and ' for <, >, &, " and ' respectively. You can define your own entities in a Document Type Definition, or you can use any Unicode character
XML Syntax • All XML documents must begin with an XML declaration <?xml version=“1.1” encoding=“utf-8”?> • XML Names • Must begin with a letter or underscore • Can include digits, hyphens and periods • No length limitations • CaSeSeNsItIvE • Every document defines a single root element. The opening tag of this ‘root’ must be the first line of the document. The ‘root’ is the root node of the document tree.
XML Syntax • An XML document that follows all of these low-level rules is ‘well formed’ <?xml version = "1.0" encoding = "utf-8" ?> <ad> <year> 1960 </year> <make> Cessna </make> <model> Centurian </model> <color> Yellow with white trim </color> <location> <city> Gulfport </city> <state> Mississippi </state> </location> </ad>
XML Syntax • One question that always arises is when to use attributes and when to use a nested element. Issues to consider: • If the information in question could be itself marked up with elements, put it in an element. • If the information is suitable for attribute form, but could end up as multiple attributes of the same name on the same element, use child elements instead. • If the information has a standardized format use an attribute. (Dates, Serial numbers, times, etc…, identifiers) • If the information should not be normalized for white space, use elements. XML processors normalize attributes in ways that can change the raw text of the attribute value.
Examples <player> <number>8</number> <name>Ryan Braun</name> </player> <player number="8"> <name> <first>Ryan</first> <last>Braun</last> </name> </player> <player number="8"> <name>Ryan Braun</name> </player> <player number="8" name="Ryan Braun"/>
XML Document Structure • An XML document often refers to two other files • A document that specifies the structure • A document that specifies the style • The document structure is defined by • DTD • SCHEMA • Although a document may be well-formed, it may not be valid. • Well-formed: conforms to the XML specification. Denotes syntactic correctness. • Valid: conforms to the DTD/SCHEMA. Denotes semantic correctness.
DTD • Document Type Definitions • A set of declarations which specify elements and where these elements can appear • The DTD is not an XML document. A DTD is described in a special DTD language. • The DTD language relies heavily on regular-expressions and BNF-like notation.
DTD • An XML document can refer to a DTD by using the DOCTYPE element. Note that DTD elements begin with a bang. • <!DOCTYPE root_element […]> • root_elementis the name of the documents type • The […] content is the DTD • The DOCTYPE element must • Be placed between the XML declaration and the root element • Name (and define) the root element
DTD • There are four possible DTD elements at the top level: • ELEMENT: declares the name of an element and it’s structure • ATTLIST: declares the attributes of an element • ENTITY: declares an entity • NOTATION: declares a notation
DTD ELEMENTS • What information would you have to give to specify an elements structure? • An element declaration specifies the name of an element and the element’s structure • #PCDATA forms the lower-level character data • General form: • <!ELEMENTelement_name (list of child names)> • Example: • <!ELEMENT memo (from, to, date, re, body)> • The vertical bar can be used to indicate OR • <!ELEMENT contact (mother | father | caregiver)>
DTD ELEMENTS • Child elements can have modifiers, +, *, ? which correspond to regular-expression multiplicities • * denotes zero-or-more occurrences • + denotes one-or-more occurrences • ? denotes zero-or-one occurrence (optional) • Example: <!ELEMENT person (parent+, age, spouse?, sibling*)> • Leaf nodes specify data types: • PCDATA: (parsed character data – entities will be expanded and if tags or markup appear they will be recognized [or parsed]) • CDATA: (character data – entities will not be expanded and if tags or markup appear they will not be recognized) • EMPTY: (no content) • ANY: (can have any content) • Example of a leaf declaration: <!ELEMENT name (#PCDATA)>
DTD Attributes • What information would you have to give to specify an elements attributes? • Attributes are defined by the ATTLIST element • <!ATTLIST elem_nameatt_nameatt_type modifiers default_value> • There are ten att_types, we will use CDATA for now (others include ID, IDREF, IDREFS, ENTITY, ENTITIES…) • Modifiers: • #FIXED: every element has the default value • #REQUIRED: this attribute must be present • #IMPLIED: no default and not required
DTD Attributes • This DTD allows us to interpret the car element • A car is, by default, a 4-door • A car must have an engine_type • A car may have a price • The make of all cars is FORD • Consider • <car year="1992" engine_type="V6"/> • <car make="GMC" doors="2" engine_type="V4" price="1235"/> <!ATTLIST car doors CDATA "4"> <!ATTLIST car engine_type CDATA #REQUIRED> <!ATTLIST car price CDATA #IMPLIED> <!ATTLIST car make CDATA #FIXED "Ford"> <car doors = "2" engine_type = "V8"> ... </car>
DTD Entities • What information would it take to define a new entity? • Recall that when an entity occurs in an XML document that it is simply a textual-replacement. • This is an <example> • Entity declaration syntax: <!ENTITYentity_name"entity_value"> • Example Declaration: <!ENTITY jfk "John Fitzgerald Kennedy"> • Example Use: • &jfk was born in 1917.
General and Parameter Entities • Two types of entities: • General (defined in the previous slide) entities can be used anywhere in the XML document • Parameter entities can be used only in the DTD • Parameter entity syntax: <!ENTITY%entity_name"entity_value"> • Example Declaration: <!ENTITY % abbr "bob | bill | sue | cindy"> • Example Use: • &jfk was born in 1917.
Internal DTD <?xml version="1.0"?><!DOCTYPE note [<!ELEMENT note (to,from,heading,body)><!ELEMENT to (#PCDATA)><!ELEMENT from (#PCDATA)><!ELEMENT heading (#PCDATA)><!ELEMENT body (#PCDATA)>]><note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend</body></note>
External DTD <?xml version="1.0"?><!DOCTYPE note SYSTEM "note.dtd"><note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body></note> <!ELEMENT note (to,from,heading,body)><!ELEMENT to (#PCDATA)><!ELEMENT from (#PCDATA)><!ELEMENT heading (#PCDATA)><!ELEMENT body (#PCDATA)>
<?xml version="1.0"?> <!DOCTYPE BOOK [ <!ELEMENT p (#PCDATA)> <!ELEMENT BOOK (OPENER,SUBTITLE?,INTRODUCTION?,(SECTION | PART)+)> <!ELEMENT OPENER (TITLE_TEXT)*> <!ELEMENT TITLE_TEXT (#PCDATA)> <!ELEMENT SUBTITLE (#PCDATA)> <!ELEMENT INTRODUCTION (HEADER, p+)+> <!ELEMENT PART (HEADER, CHAPTER+)> <!ELEMENT SECTION (HEADER, p+)> <!ELEMENT HEADER (#PCDATA)> <!ELEMENT CHAPTER (CHAPTER_NUMBER, CHAPTER_TEXT)> <!ELEMENT CHAPTER_NUMBER (#PCDATA)> <!ELEMENT CHAPTER_TEXT (p)+> ]> <BOOK> <OPENER> <TITLE_TEXT>All About Me</TITLE_TEXT> </OPENER> <PART> <HEADER>Welcome To My Book</HEADER> <CHAPTER> <CHAPTER_NUMBER>CHAPTER 1</CHAPTER_NUMBER> <CHAPTER_TEXT> <p>Glad you want to hear about me.</p> <p>There's so much to say!</p> <p>Where should we start?</p> <p>How about more about me?</p> </CHAPTER_TEXT> </CHAPTER> </PART> </BOOK>
DTD Expressive limitations <!ELEMENT collection (description,recipe*)> <!ELEMENT description ANY> <!ELEMENT recipe (title,ingredient*,preparation,comment?,nutrition)> <!ELEMENT title (#PCDATA)> <!ELEMENT ingredient (ingredient*,preparation)?> <!ATTLIST ingredient name CDATA #REQUIRED amount CDATA #IMPLIED unit CDATA #IMPLIED> <!ELEMENT preparation (step*)> <!ELEMENT step (#PCDATA)> <!ELEMENT comment (#PCDATA)> <!ELEMENT nutrition EMPTY> <!ATTLIST nutrition protein CDATA #REQUIRED carbohydrates CDATA #REQUIRED fat CDATA #REQUIRED calories CDATA #REQUIRED alcohol CDATA #IMPLIED> • Cannot express that: • Protein, must contain a non-negative number • Unit should only be allowed when amount is present • The comment element should be allowed to appear anywhere • Nested ingredient elements should only be allowed when amount is absent
XML Schema • DTD’s have limitations • Syntax is not XML and requires a dtd-specific parser • Structural logic is not as expressive as sometimes needed • Limited data types • XML Schema • A document that describes the structure of a family of XML documents. • Identical in purpose to a DTD. • The structure of an XML schema document is defined by an XML schema!. The namespace is given as • http://www.w3.org/2001/XMLSchema
Namespaces • An XML document may use tags from multiple tag sets. • What if two tag-sets have a tag that is defined differently in each tag set? When the tag is used, which tag set is being referred to? • Namespaces resolve conflicts by affixing a prefix to the actual tags • A namespace declaration has the form: • <elementNamexmlns[:prefix]=URI> • For example • <gmcarsxmlns:gm=“http://www.gm.com/names”> • The gm prefix is associated with tags in 'http://www.gm.com/names' for the gmcars tag and all of the gmcars content • Can now have XML elements such as • <gm:pontiac doors=“12”/>
Namespaces • Can have multiple namespaces of course • <cars xmlns:gm=“http://www.gm.com/names” • xmlns:ford=“http://www.ford.com/names”> • Can now use elements such as • <gm:LaCrosse doors=“4”/> • <ford:LaCrosse doors=“8”/>
XML Schema • Every schema has ‘schema’ as the root element • This element must specify the schema namespace • Each schema defines a tag set which is named via the targetNamespace attribute • The default tags must be qualified <?xml version="1.0"?><xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://charity.cs.uwlax.edu" elementFormDefault="qualified"></xs:schema>
XML Schema • The root element of a conforming XML document must then specify the namespaces it uses: • The default namespace (parsers now know which tag set this document uses) • The standard instance namespace (parsers now know to validate against a schema rather than a DTD) • The location of the schema (parsers now know which schema to validate against) <?xml version="1.0"?><classroom xmlns="http://charity.cs.uwlax.edu"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://charity.cs.uwlax.edu/classroom.xs"> <room-number>Wing 218<room-number> <capacity>12</capacity> <projector>YES</projector> <karaoke-machine>NO</karaoke-machine></classroom>
XML Schema • An XML Schema can define two types of elements • SIMPLE: elements that are strings without attributes or nested elements • COMPLEX: Of course, these are non-simple • Schemas have 44 defined data types • Primitives: string, boolean, float, base64binary, data, et.. • Derived: byte, decimal, positiveInteger,… • Derived data types • Are those types that are defined with respect to some other type • Users can define their own derived types
Simple Elements • Just like a variable declaration in programming languages defines a name and a type, an XML element is declared by giving the name and type. • <xs:element name="XXXX" type="YYYY"/> • Common built-in type names: • xs:string • xs:decimal • xs:integer • xs:boolean • xs:date • xs:time
Simple Elements • Consider the following schema declarations • <xs:element name="lastname" type="xs:string"/> • <xs:element name="age" type="xs:integer"/> • <xs:element name="birthdate" type="xs:date"/> • An XML document that uses this tag set could contain • <lastname>Xercesanthony</lastname> • <age>32</age> • <birthdate>1983-03-12</birthdate> • Note that an XML document that uses this tag set could not contain • <age>Thirty two</age> • <birthdate>Aug, third, Nineteen eighty three</birthdate>
Simple Elements • Simple elements may have a default value or a fixed value • Default values are automatically assigned if not provided • Fixed values are automatically assigned if not provided and, if specified, cannot be something other than the fixed value. • Schema examples: • <xs:element name="color" type="xs:string" default="red"/> • <xs:element name="pcolor" type="xs:string" fixed="red"/> • Instance examples: • <color>red</color> • <color>green</color> • <pcolor>red</color> • <pcolor>green</color>
Complex Elements • A complex element is an XML element that contains other elements and/or has attributes • There are four kinds of complex elements • empty elements • <sku number="1234"/> • elements that contain only other elements • <name><first>Kenny</first><last>Hunt</last></name> • elements that contain only text • <first>Kenny</first> • elements that contain both elements and text • <chapter><title>Chapter 1</title>This is the story of…</chapter>
Complex Elements • Consider an XML document that contains the following element: • <name><first>Kenny</first><last>Hunt</last></name> • What kind of information would have to be specified in order to define the structure of the "name" element? Would the following elements be valid? • <name><last>Hunt</last><first>Kenny</first></name> • <name><first>Kenny</first></name> • <name><last>Hunt</last><first>Kenny</first><mi>A</mi></name> • <name><first>Kenny</first><first>Kenneth</first><last>Hunt</last></name> • <name verified="yes"><first>Kenny</first><hunt>Hunt</hunt></name> • In order to know for sure which of the above are valid, must be able to define • The allowable children • The order of the children • The multiplicities (occurrences) of the children • The attributes that the element might take • This is done by defining a new type and then defining an element of that type
Complex Types • To define a complex type you must give the type some structure and a name. • The basic syntax for defining a new type is: • <xs:complexType name="new_type_name"> • …. • </xs:complexType>
Complex Types • The allowed children and ordering of the children within an element type is controlled by order indicators • <xs:all>…</xs:all> • An unordered list of elements referred to in the all (there are some significant constraint when using this one) • <xs:sequence>…</xs:sequence> • An ordered list of elements referred to in the sequence • <xs:choice>…</xs:choice> • Any one of the elements referred to in the choice
Complex Types <xs:complexTypename = "nametype"> <xs:sequence> <xs:elementname = "first"type = "xs:string" /> <xs:elementname = "last"type = "xs:string" /> </xs:sequence> </xs:complexType> <xs:complexTypename = "nametype"> <xs:choice> <xs:elementname = "first"type = "xs:string" /> <xs:elementname = "last"type = "xs:string" /> </xs:choice> </xs:complexType> <xs:complexTypename = "nametype"> <xs:all> <xs:elementname = "first"type = "xs:string" /> <xs:elementname = "last"type = "xs:string" /> </xs:all> </xs:complexType>
Complex Elements • Note that we haven't defined a complex element but a complex type. • We can now declare elements of that type. The element declaration can take on two forms. • The type definition is a child element of the element definition • The type definition is referenced by an attribute of the element definition.
Complex Elements <xs:element name="name"> <xs:complexType> <xs:sequence> <xs:elementname = "first"type = "xs:string" /> <xs:elementname = "last"type = "xs:string" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="name" type="nametype"/> <xs:complexType name="nametype"> <xs:sequence> <xs:elementname = "first"type = "xs:string" /> <xs:elementname = "last"type = "xs:string" /> </xs:sequence> </xs:complexType>
Occurrence Indicator • The number of times an element can occur is constrained by occurrence indicator attributes. • minOccurs: • gives the minimum number of occurrences. Defaults to 1. • must be a non-negative integer • maxOccurs: • gives the maximum number of occurrences. Defaults to 1. • must be a non-negative integer or 'unbounded' <xs:elementname="name"> <xs:complexType> <xs:sequence> <xs:element name="first" type="xs:string" minOccurs="1" maxOccurs="1" /> <xs:element name="last" type="xs:string" minOccurs="1" maxOccurs="1" /> </xs:sequence> </xs:complexType> </xs:element>
Occurrence Indicator • Consider these variants. What is their interpretation? <xs:elementname="name"> <xs:complexType> <xs:sequence> <xs:element name="first" type="xs:string" minOccurs="1" maxOccurs="unbounded" /> <xs:element name="last" type="xs:string" minOccurs="1" maxOccurs="1" /> </xs:sequence> </xs:complexType> </xs:element> <xs:elementname="name"> <xs:complexType> <xs:sequenceminOccurs="2" maxOccurs="unbounded"> <xs:element name="first" type="xs:string" minOccurs="1" maxOccurs="unbounded" /> <xs:element name="last" type="xs:string" minOccurs="1" maxOccurs="1" /> </xs:sequence> </xs:complexType> </xs:element>
Complex Elements • Consider making an element 'extensible' • The xs:any element allows an element of any type to appear. • This serves as a placeholder into which users of the tag-set can place data of their own choosing. <xs:elementname="name"> <xs:complexType> <xs:sequence> <xs:element name="first" type="xs:string" minOccurs="1" maxOccurs="1" /> <xs:element name="last" type="xs:string" minOccurs="1" maxOccurs="1" /> <xs:anyminOccurs="0"> </xs:sequence> </xs:complexType> </xs:element> <name><first>Kenny</first><last>Hunt</last><mi>A</mi></name> <name><first>Kenny</first><last>Hunt</last><alias>Kenneth</alias></name>
Mixed Types • What if you wanted to specify an xml document that looked like: • This doc has text and elements as children. This is known as a 'mixed' type. <letter> Dear Mr.<name>John Smith</name>. Your order <orderid>1032</orderid> will be shipped on <shipdate>2001-07-13</shipdate>. </letter> <xs:element name="letter"> <xs:complexType mixed="true"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="orderid" type="xs:positiveInteger"/> <xs:element name="shipdate" type="xs:date"/> </xs:sequence> </xs:complexType> </xs:element>
Element Attributes • The syntax for defining an attribute is nearly identical to the syntax for defining a simple element • <xs:attribute name="XXX" type="YYY"/> • Recall that simple elements can't have attributes • Attributes can also have default or fixed values. • Use the default attribute of xs:attribute • Use the fixed attribute of xs:attribute • Attributes can also be required • Use the use attribute of xs:attribute. Values are "optional" and "required" and "prohibited".
Element Attributes • Examples: • <xs:attribute name="verified" type="xs:boolean"/> • <xs:attribute name="expiration" type="xs:date"/> • <xs:attribute name="verified" type="xs:boolean" use="required"/> • <xs:attribute name="verified" type="xs:boolean" use="required" default="false"/> • <xs:elementname="name"> • <xs:complexType> • <xs:attribute name="verified" type="xs:boolean" use="required" default="false"/> • <xs:sequence> • <xs:element name="first" type="xs:string" minOccurs="1" maxOccurs="1" /> • <xs:element name="last" type="xs:string" minOccurs="1" maxOccurs="1" /> • <xs:anyminOccurs="0"> • </xs:sequence> • </xs:complexType> • </xs:element>
XML Datatypes • There are many pre-defined data types • Derivative types can be formed by • Placing restrictions on the allowed values of another type • Listing values from another type • Building the union of values from other types • Data types have properties or “Facets” • Fundamental Facets: ordered, bounded, cardinality, numeric • Constraining Facets: length, minLength, maxLength, pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minExclusive, minInclusive, totalDigits, fractionDigits, maxScale, minScale, Assertions, explicitTimeZone