570 likes | 830 Views
XML & XML Schema. Semantic Web - Fall 2005 Computer Engineering Department Sharif University of Technology. Outline. Markup Languages SGML, HTML, XML XML Building Blocks XML Applications Namespaces XML Schema. SGML(ISO 8879). S tandard G eneralized M arkup L anguage
E N D
XML & XML Schema Semantic Web - Fall 2005 Computer Engineering Department Sharif University of Technology
Outline • Markup Languages • SGML, HTML, XML • XML Building Blocks • XML Applications • Namespaces • XML Schema Semantic web - Computer Engineering Dept. - Fall 2005
SGML(ISO 8879) • Standard Generalized Markup Language • The international standard for defining descriptions of structure and content in text documents • Interchangeable: device-independent, system-independent • tags are not predefined • Using DTD to validate the structure of the document • Large, powerful, and very complex • Heavily used in industrial and commercial usages for over a decade Semantic web - Computer Engineering Dept. - Fall 2005
HTML(RFC 1866) • HyperText Markup Language • A small SGML application used on web (a DTD and a set of processing conventions) • Only uses a predefined set of tags Semantic web - Computer Engineering Dept. - Fall 2005
What is XML? • eXtensible Markup Language • A simplified version of SGML • Maintains the most useful parts of SGML • Designed so that SGML can be delivered over the Web • More flexible and adaptable than HTML • XHTML: a reformulation of HTML 4 in XML 1.0 Semantic web - Computer Engineering Dept. - Fall 2005
HTML vs. XML Semantic web - Computer Engineering Dept. - Fall 2005
HTML vs. XML (2) • HTML is for humans • HTML describes web pages • You don’t want to see error messages about the web pages you visit • Browsers ignore and/or correct as many HTML errors as they can, so HTML is often sloppy • XML is for computers • XML describes data • The rules are strict and errors are not allowed • In this way, XML is like a programming language • Current versions of most browsers can display XML • However, browser support of XML is spotty at best Semantic web - Computer Engineering Dept. - Fall 2005
XML-related technologies • DTD (Document Type Definition) and XML Schemas are used to define legal XML tags and their attributes for particular purposes • XSLT (eXtensible Stylesheet Language Transformations) and XPath are used to translate from one form of XML to another • SAX (Simple API for XML) Semantic web - Computer Engineering Dept. - Fall 2005
XML Building blocks - Elements • Delimited by angle brackets • Identify the nature of the content they surround • General format: <element> … </element> • Empty element: <empty-Element /> • XML Elements have Relationships • Elements are related as parents and children • Elements have Content • Elements can have different content types: • Element, mixed, Simple, empty Semantic web - Computer Engineering Dept. - Fall 2005
XML Building blocks - Attributes Name-value pairs that occur inside start-tags after element name, like: <element attribute=“value” /> • Provide additional information about elements that often is not a part of data. • Attributes and elements are somewhat interchangeable • Should I use an element or an attribute? • Example using just elements: • <name> <first>David</first> <last>Matuszek</last></name> • Example using attributes: • <name first="David" last="Matuszek"></name> metadata (data about data) should be stored as attributes, and that data itself should be stored as elements Semantic web - Computer Engineering Dept. - Fall 2005
XML Building blocks - Entities Five special characters must be written as entities: • & for & (almost always necessary) • < for < (almost always necessary) • > for > (not usually necessary) • "for " (necessary inside double quotes) • ' for ' (necessary inside single quotes) These entities can be used even in places where they are not absolutely required. These are the only predefined entities in XML. Semantic web - Computer Engineering Dept. - Fall 2005
XML Building blocks - Declaration The XML declaration looks like this:<?xml version="1.0" encoding="UTF-8" standalone="yes"?> • The XML declaration is not required by browsers, but is required by most XML processors (so include it!) • If present, the XML declaration must be first--not even whitespace should precede it • Note that the brackets are <? and ?> • version="1.0"is required (this is the only version so far) • encoding can be "UTF-8" (ASCII) or "UTF-16" (Unicode), or something else, or it can be omitted • standalone tells whether there is a separate DTD Semantic web - Computer Engineering Dept. - Fall 2005
XML Building blocks - Processing instructions • PIs (Processing Instructions) may occur anywhere in the XML document (but usually first) • A PI is a command to the program processing the XML document to handle it in a certain way • XML documents are typically processed by more than one program • Programs that do not recognize a given PI should just ignore it • General format of a PI: <?target instructions?> • Example: <?xml-stylesheet type="text/css" href="mySheet.css"?> Semantic web - Computer Engineering Dept. - Fall 2005
XML Building blocks - Comments • <!-- This is a comment in both HTML and XML --> • Comments can be put anywhere in an XML document • Comments are useful for: • Explaining the structure of an XML document • Commenting out parts of the XML during development and testing • The character sequence -- cannot occur in the comment • Comments are not displayed by browsers, but can be seen by anyone who looks at the source code Semantic web - Computer Engineering Dept. - Fall 2005
CDATA • By default, all text inside an XML document is parsed • You can force text to be treated as unparsed character data by enclosing it in <![CDATA[ ... ]]> • Any characters, even & and <, can occur inside a CDATA • Whitespace inside a CDATA is (usually) preserved • The only real restriction is that the character sequence]]>cannot occur inside a CDATA • CDATA is useful when your text has a lot of illegal characters (for example, if your XML document contains some HTML text) Semantic web - Computer Engineering Dept. - Fall 2005
XML Syntax • All XML elements must have a closing tag • XML tags are case sensitive • All XML elements must be properly nested • All XML documents must have a root tag • Attribute values must always be quoted • With XML, white space is preserved • With XML, a new line is always stored as LF • Comments in XML: <!-- This is a comment --> Semantic web - Computer Engineering Dept. - Fall 2005
Well-formed XML • Every element must have both a start tag and an end tag, e.g. <name> ... </name> • But empty elements can be abbreviated: <break />. • XML tags are case sensitive • XML tags may not begin with the letters xml, in any combination of cases • Elements must be properly nested, e.g. not<b><i>bold and italic</b></i> • Every XML document must have one and only one root element • The values of attributes must be enclosed in single or double quotes, e.g. <time unit="days"> • Character data cannot contain < or & Semantic web - Computer Engineering Dept. - Fall 2005
Displaying XML • XML documents do not carry information about how to display the data • We can add display information to XML with • CSS (Cascading Style Sheets) • XSL (eXtensible Stylesheet Language) --- preferred Semantic web - Computer Engineering Dept. - Fall 2005
XML Applications (1) Separate data XML can Separate Data from HTML • Store data in separate XML files • Using HTML for layout and display • Using Data Islands • Data Islands can be bound to HTML elements Benefits: Changes in the underlying data will not require any changes to your HTML Semantic web - Computer Engineering Dept. - Fall 2005
XML Applications (2) Exchange data XML is used to Exchange Data • Text format • Software-independent, hardware-independent • Exchange data between incompatible systems, given that they agree on the same tag definition. • Can be read by many different types of applications Benefits: • Reduce the complexity of interpreting data • Easier to expand and upgrade a system Semantic web - Computer Engineering Dept. - Fall 2005
XML Application (3) Store Data XML can be used to Store Data • Plain text file • Store data in files or databases • Application can be written to store and retrieve information from the store • Other clients and applications can access your XML files as data sources Benefits: Accessible to more applications Semantic web - Computer Engineering Dept. - Fall 2005
XML Applications (4) Create new language XML can be used to Create new Languages, e.g. : • WML (Wireless Markup Language) used to markup Internet applications for handheld devices like mobile phones (WAP) • MusicXML used to publishing musical scores Semantic web - Computer Engineering Dept. - Fall 2005
Names in XML • Names (as used for tags and attributes) must begin with a letter or underscore, and can consist of: • Letters, both Roman (English) and foreign • Digits, both Roman and foreign • . (dot) • - (hyphen) • _(underscore) • : (colon) should be used only for namespaces • Combining characters and extenders (not used in English) Semantic web - Computer Engineering Dept. - Fall 2005
Namespaces • Namespaces are a simple mechanism for creating globally unique names for the elements and attributes of your markup language. • Benefits: • De-conflicts the meaning of identical names in different markup languages. • Allows different markup languages to be mixed together without ambiguity. • Namespaces are implemented by requiring every XML name to consist of two parts: a prefix and a local part: <xsd:integer> Semantic web - Computer Engineering Dept. - Fall 2005
Namespaces and URIs • A namespace is defined as a unique string • To guarantee uniqueness, typically a URI (Uniform Resource Indicator) is used, because the author “owns” the domain • It doesn't have to be a “real” URI; it just has to be a unique string • Example:http://ce.sharif.edu/sw • There are two ways to use namespaces: • Declare a default namespace • Associate a prefix with a namespace, then use the prefix in the XML to refer to the namespace Semantic web - Computer Engineering Dept. - Fall 2005
Namespace syntax • In any start tag you can use the reserved attribute name xmlns: • <book xmlns="http://ce.sharif.edu/sw"> • This namespace will be used as the default for all elements up to the corresponding end tag • You can override it with a specific prefix • You can use almost this same form to declare a prefix: • <book xmlns:dave="http://ce.sharif.edu/sw"> • Use this prefix on every tag and attribute you want to use from this namespace, including end tags--it is not a default prefix • <dave:chapterdave:number="1">To Begin</dave:chapter> • You can use the prefix in the start tag in which it is defined: • <dave:book xmlns:dave=“http://ce.sharif.edu/sw"> Semantic web - Computer Engineering Dept. - Fall 2005
Review of XML rules • Start with <?xml version="1"?> • XML is case sensitive • You must have exactly one root element that encloses all the rest of the XML • Every element must have a closing tag • Elements must be properly nested • Attribute values must be enclosed in double or single quotation marks • There are only five pre-declared entities Semantic web - Computer Engineering Dept. - Fall 2005
novel foreword chapternumber="1" paragraph paragraph paragraph This is the greatAmerican novel. It was a darkand stormy night. Suddenly, a shotrang out! XML as a tree • An XML document represents a hierarchy; a hierarchy is a tree Semantic web - Computer Engineering Dept. - Fall 2005
Extended document standards • You can define your own XML tag sets, but here are some already available: • XHTML: HTML redefined in XML • SMIL: Synchronized Multimedia Integration Language • MathML: Mathematical Markup Language • SVG: Scalable Vector Graphics • DrawML: Drawing MetaLanguage • ICE: Information and Content Exchange • ebXML: Electronic Business with XML • cxml: Commerce XML • CBL: Common Business Library Semantic web - Computer Engineering Dept. - Fall 2005
XML Validation • "Well Formed" XML document • correct XML syntax • "Valid" XML document • “well formed” • Conforms to the rules of a DTD • XML DTD • defines the legal building blocks of an XML document • Can be inline in XML or as an external reference • XML Schema • an XML based alternative to DTD, more powerful • Support namespace and data types Semantic web - Computer Engineering Dept. - Fall 2005
An Example XML with DTD <?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend</body> </note> Semantic web - Computer Engineering Dept. - Fall 2005
XML Schemas • “Schema” is a general term • DTDs are a form of XML schemas • When we say “XML Schemas,” we usually mean the W3C XML Schema Language • This is also known as “XML Schema Definition” language, or XSD. Semantic web - Computer Engineering Dept. - Fall 2005
XSD vs. DTD • DTDs provide a very weak specification language • You can’t put any restrictions on text content • You have very little control over mixed content (text plus elements) • You have little control over ordering of elements • DTDs are written in a strange (non-XML) format • You need separate parsers for DTDs and XML • The XML Schema Definition language solves these problems • XSD gives you much more control over structure and content • XSD is written in XML Semantic web - Computer Engineering Dept. - Fall 2005
Referring to a schema • To refer to a DTD in an XML document, the reference goes before the root element: • <?xml version="1.0"?><!DOCTYPE rootElement SYSTEM "url"><rootElement> ... </rootElement> • To refer to an XML Schema in an XML document, the reference goes in the root element: • <?xml version="1.0"?><rootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"(The XML Schema Instance reference is required) xsi:noNamespaceSchemaLocation="url.xsd">(This is where your XML Schema definition can be found) ...</rootElement> Semantic web - Computer Engineering Dept. - Fall 2005
The XSD document • Since the XSD is written in XML, it can get confusing which we are talking about. • The file extension is .xsd • The root element is <schema> • The XSD starts like this: • <?xml version="1.0"?><xs:schema xmlns:xs="http://www.w3.rg/2001/XMLSchema"> Semantic web - Computer Engineering Dept. - Fall 2005
<schema> • The <schema> element may have attributes: • xmlns:xs="http://www.w3.org/2001/XMLSchema" • This is necessary to specify where all our XSD tags are defined • elementFormDefault="qualified" • This means that all XML elements must be qualified (use a namespace) • It is highly desirable to qualify all elements, or problems will arise when another schema is added Semantic web - Computer Engineering Dept. - Fall 2005
“Simple” and “complex” elements • A “simple” element is one that contains text and nothing else • A simple element cannot have attributes • A simple element cannot contain other elements • A simple element cannot be empty • However, the text can be of many different types, and may have various restrictions applied to it • If an element isn’t simple, it’s “complex” • A complex element may have attributes • A complex element may be empty, or it may contain text, other elements, or both text and other elements Semantic web - Computer Engineering Dept. - Fall 2005
Defining a simple element • A simple element is defined as<xs:element name="name" type="type" />where: • name is the name of the element • the most common values for type are xs:boolean xs:integer xs:date xs:string xs:decimal xs:time • Other attributes a simple element may have: • default="default value"if no other value is specified • fixed="value"no other value may be specified Semantic web - Computer Engineering Dept. - Fall 2005
Defining an attribute • Attributes themselves are always declared as simple types • An attribute is defined as<xs:attribute name="name" type="type" />where: • name and type are the same as forxs:element • Other attributes a simple element may have: • default="defaultvalue"if no other value is specified • fixed="value"no other value may be specified • use="optional" the attribute is not required (default) • use="required" the attribute must be present Semantic web - Computer Engineering Dept. - Fall 2005
Restrictions, or “facets” • The general form for putting a restriction on a text value is: • <xs:element name="name"> (or xs:attribute) <xs:restriction base="type">... the restrictions ... </xs:restriction></xs:element> • For example: • <xs:element name="age"> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"> <xs:maxInclusive value="140"> </xs:restriction></xs:element> Semantic web - Computer Engineering Dept. - Fall 2005
Restrictions on numbers • minInclusive -- number must be ≥ the given value • minExclusive -- number must be > the given value • maxInclusive -- number must be ≤ the given value • maxExclusive -- number must be < the given value • totalDigits -- number must have exactly valuedigits • fractionDigits -- number must have no more than valuedigits after the decimal point Semantic web - Computer Engineering Dept. - Fall 2005
Restrictions on strings • length -- the string must contain exactly valuecharacters • minLength -- the string must contain at least valuecharacters • maxLength -- the string must contain no more than valuecharacters • pattern -- the valueis a regular expression that the string must match • whiteSpace -- not really a “restriction”--tells what to do with whitespace • value="preserve" Keep all whitespace • value="replace" Change all whitespace characters to spaces • value="collapse" Remove leading and trailing whitespace, and replace all sequences of whitespace with a single space Semantic web - Computer Engineering Dept. - Fall 2005
Enumeration • An enumeration restricts the value to be one of a fixed set of values • Example: • <xs:element name="season"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="Spring"/> <xs:enumeration value="Summer"/> <xs:enumeration value="Autumn"/> <xs:enumeration value="Fall"/> <xs:enumeration value="Winter"/> </xs:restriction> </xs:simpleType></xs:element> Semantic web - Computer Engineering Dept. - Fall 2005
Complex elements • A complex element is defined as<xs:element name="name"> <xs:complexType>... information about the complex type... </xs:complexType> </xs:element> • Example: <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element> • <xs:sequence> says that elements must occur in this order • Remember that attributes are always simple types Semantic web - Computer Engineering Dept. - Fall 2005
Declaration and use • So far we’ve been talking about how to declare types, not how to use them • To use a type we have declared, use it as the value oftype="..." • Examples: • <xs:element name="student" type="person"/> • <xs:element name="professor" type="person"/> • Scope is important: you cannot use a type if is local to some other type Semantic web - Computer Engineering Dept. - Fall 2005
xs:sequence • We’ve already seen an example of a complex type whose elements must occur in a specific order: • <xs:element name="person"> <xs:complexType><xs:sequence> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element> Semantic web - Computer Engineering Dept. - Fall 2005
xs:all • xs:all allows elements to appear in any order • <xs:element name="person"> <xs:complexType> <xs:all> <xs:element name="firstName" type="xs:string" /> <xs:element name="lastName" type="xs:string" /> </xs:all> </xs:complexType> </xs:element> • Despite the name, the members of an xs:all group can occur once or not at all • You can useminOccurs="0"to specify that an element is optional (default value is 1) • In this context, maxOccursis always 1 Semantic web - Computer Engineering Dept. - Fall 2005
Empty elements • Empty elements are (ridiculously) complex • <xs:complexType name="counter"> <xs:complexContent> <xs:extension base="xs:anyType"/> <xs:attribute name="count" type="xs:integer"/> </xs:complexContent></xs:complexType> Semantic web - Computer Engineering Dept. - Fall 2005
Mixed elements • Mixed elements may contain both text and elements • We addmixed="true" to the xs:complexType element • The text itself is not mentioned in the element, and may go anywhere (it is basically ignored) • <xs:complexType name="paragraph" mixed="true"> <xs:sequence> <xs:element name="someName” type="xs:anyType"/> </xs:sequence></xs:complexType> Semantic web - Computer Engineering Dept. - Fall 2005