1.7k likes | 1.87k Views
ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ , ΑΠΘ ΜΕΤΑΠΤΥΧΙΑΚΟ ΠΡΟΓΡΑΜΜΑ ΣΠΟΥΔΩΝ Κατεύθυνση Πληροφοριακών Συστημάτων - 1ο Εξάμηνο Σημασιολογικός Ιστός lpis .csd.auth.gr/mtpx/sw/index.htm. Διδάσκων: Ν. Βασιλειάδης Αναπλ. Καθ. Τμ. Πληροφορικής ΑΠΘ. Μαθήματα: 2-3 -4. Chapter 2 Structured Web Documents in XML.
E N D
ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ, ΑΠΘΜΕΤΑΠΤΥΧΙΑΚΟ ΠΡΟΓΡΑΜΜΑ ΣΠΟΥΔΩΝ Κατεύθυνση Πληροφοριακών Συστημάτων - 1ο Εξάμηνο Σημασιολογικός Ιστόςlpis.csd.auth.gr/mtpx/sw/index.htm Διδάσκων: Ν. Βασιλειάδης Αναπλ. Καθ. Τμ. Πληροφορικής ΑΠΘ Μαθήματα: 2-3-4
Chapter 2Structured Web Documents in XML Grigoris Antoniou Frank van Harmelen
Introduction • Today HTML (HyperText Markup Language) is the standard language for Web pages. • HTML was derived from SGML (Standard Generalized Markup Language) • International standard (ISO 8879) for the definition of device- and system-independent methods of representing information, both human- and machine-readable. A Semantic Web Primer, 2nd Edition
Standards • Standards are important • They enable effective communication • They support technological progress and business collaboration • In WWW, standards are set by the W3C • They are called recommendations • In a distributed environment without central authority, standards cannot be enforced. A Semantic Web Primer, 2nd Edition
SGML Applications • HTML • Developed because SGML is too complex for Internet-related purposes. • XML (eXtensible Markup Language) • Its development was driven by shortcomings of HTML. A Semantic Web Primer, 2nd Edition
An HTML Example <h2>Nonmonotonic Reasoning: Context- Dependent Reasoning</h2> <i>by <b>V. Marek</b> and <b>M. Truszczynski</b></i><br> Springer 1993<br> ISBN 0387976892 A Semantic Web Primer, 2nd Edition
The Same Example in XML <book> <title>Nonmonotonic Reasoning: Context- Dependent Reasoning</title> <author>V. Marek</author> <author>M. Truszczynski</author> <publisher>Springer</publisher> <year>1993</year> <ISBN>0387976892</ISBN> </book> A Semantic Web Primer, 2nd Edition
HTML versus XML: Similarities • Both use tags (e.g. <h2> and </year>) • Tags may be nested (tags within tags) • Human users can read and interpret both HTML and XML representations quite easily … But how about machines? A Semantic Web Primer, 2nd Edition
HTML versus XML: Differences • All tags in XML must be closed • E.g. for an opening tag <title> there must be a closing tag </title> • The enclosed content, together with its opening and closing tags, is called element • In HTML some tags (<br>), may be left open. • XHTML brings HTML more in line with XML • Any valid XHTML document is also a valid XML document • Opening and closing tags in XHTML are balanced A Semantic Web Primer, 2nd Edition
Problems with Automated Interpretation of HTML Documents An intelligent agent trying to retrieve the names of the authors of the book • Authors’ names could appear immediately after the title • or immediately after the word by • Are there two authors? • Or just one, called “V. Marek and M. Truszczynski”? A Semantic Web Primer, 2nd Edition
HTML vs XML: Structural Information • HTML documents do not contain structural information: pieces of the document and their relationships. • XML more easily accessible to machines: • Every piece of information is described. • Relations are also defined through the nesting structure • E.g., <author> tags appear within <book> tags, so they describe properties of the particular book. A Semantic Web Primer, 2nd Edition
HTML vs XML: Structural Information (2) • A machine processing the XML document would be able to deduce that • the author element refers to the enclosing book element • rather than by proximity considerations • XML allows the definition of constraints on values • E.g. a year must be a number of 4 digits A Semantic Web Primer, 2nd Edition
HTML vs XML: Formatting • The HTML representation provides more than the XML representation: • The formatting of the document is also described • Τhe main use of an HTML document is to display information: it must define formatting • XML: separation of content from display • Same information can be displayed in different ways • Content may be used for purposes other than display A Semantic Web Primer, 2nd Edition
HTML vs XML: Another Example • In HTML <h2>Relationship matter-energy</h2> <i> E = M × c2 </i> • In XML <equation> <meaning>Relationship matter energy</meaning> <leftside> E </leftside> <rightside> M × c2 </rightside> </equation> A Semantic Web Primer, 2nd Edition
HTML vs XML: Different Use of Tags • Both HTML docs have the same tags. • In XML completely different tags are used! • HTML tags define display: color, lists … • XML tags not fixed: user definable tags • XML meta markup language: language for defining markup languages A Semantic Web Primer, 2nd Edition
XML Vocabularies • Web applications must agree on common vocabularies to communicate and collaborate • Communities and business sectors are defining their specialized vocabularies • mathematics (MathML) • bioinformatics (BSML) • human resources (HRML) • … A Semantic Web Primer, 2nd Edition
Data Exchange • The main use of XML today is as a uniform data exchange format between applications. • Rather than a document markup language. • Companies retrieve information from their partners, and update their databases accordingly. • If there is not an agreed common standard, then specialized processing and querying software must be developed for each partner separately • Technical overhead • The software must be updated every time a partner decides to change its own database format. A Semantic Web Primer, 2nd Edition
Lecture Outline • Introduction • Detailed Description of XML • Structuring • DTDs • XML Schema • Namespaces • Accessing, querying XML documents: XPath • Transformations: XSLT A Semantic Web Primer, 2nd Edition
The XML Language An XML document consists of • a prolog • a number of elements • an optional epilog (not discussed) A Semantic Web Primer, 2nd Edition
Prolog of an XML Document The prolog consists of • an XML declaration and • an optional reference to external structuring documents <?xml version="1.0" encoding="UTF-16"?> <!DOCTYPE book SYSTEM "book.dtd"> A Semantic Web Primer, 2nd Edition
XML Elements • The “things” the XML document talks about • E.g. books, authors, publishers • An element consists of: • an opening tag • the content • a closing tag <lecturer>David Billington</lecturer> A Semantic Web Primer, 2nd Edition
XML Elements (2) • Tag names can be chosen almost freely. • The first character must be a letter, an underscore, or a colon • No name may begin with the string “xml” in any combination of cases • E.g. “Xml”, “xML” A Semantic Web Primer, 2nd Edition
Content of XML Elements • Content may be text, or other elements, or nothing <lecturer> <name>David Billington</name> <phone> +61 − 7 − 3875 507 </phone> </lecturer> • If there is no content, then the element is called empty; it is abbreviated as follows: <lecturer/> for <lecturer></lecturer> A Semantic Web Primer, 2nd Edition
XML Attributes • An empty element is not necessarily meaningless • It may have some properties in terms of attributes • An attribute is a name-value pair inside the opening tag of an element <lecturer name="David Billington" phone="+61 − 7 − 3875 507"/> A Semantic Web Primer, 2nd Edition
XML Attributes: An Example <order orderNo="23456“ customer="John Smith" date="October 15, 2002"> <item itemNo="a528" quantity="1"/> <item itemNo="c817" quantity="3"/> </order> A Semantic Web Primer, 2nd Edition
The Same Example without Attributes <order> <orderNo>23456</orderNo> <customer>John Smith</customer> <date>October 15, 2002</date> <item> <itemNo>a528</itemNo> <quantity>1</quantity> </item> <item> <itemNo>c817</itemNo> <quantity>3</quantity> </item> </order> A Semantic Web Primer, 2nd Edition
XML Elements vs Attributes • Attributes can be replaced by elements • When to use elements and when attributes is a matter of taste • But attributes • Cannotbenested • Cannot have two attributes with the same name in the same element A Semantic Web Primer, 2nd Edition
Mixed Content • Elements with mixed content contain both text and other elements at the same time <letter> Dear Mr.<name>John Smith</name>.Your order <orderid>1032</orderid>will be shipped on <shipdate>2001-07-13</shipdate>. </letter> A Semantic Web Primer, 2nd Edition
Further Components of XML Docs • Comments • A piece of text that is to be ignored by parser • <!-- This is a comment --> • Processing Instructions (PIs) • Define procedural attachments • <?stylesheet type="text/css" href="mystyle.css"?> A Semantic Web Primer, 2nd Edition
Well-Formed XML Documents • Syntactically correct documents • Some syntactic rules: • Only one outermost element (called root element) • Each element contains an opening and a corresponding closing tag • Tags may not overlap <author><name>Lee Hong</author></name> • Attributes within an element have unique names • Element and tag names must be permissible A Semantic Web Primer, 2nd Edition
The Tree Model of XML Documents: An Example <email> <head> <from name="Michael Maher" address="michaelmaher@cs.gu.edu.au"/> <to name="Grigoris Antoniou" address="grigoris@cs.unibremen.de"/> <subject>Where is your draft?</subject> </head> <body> Grigoris, where is the draft of the paper you promised me last week? </body> </email> A Semantic Web Primer, 2nd Edition
A Tree Example A Semantic Web Primer, 2nd Edition
The Tree Model of XML Docs • The tree representation of an XML document is an ordered labeled tree: • There is exactly one root • There are no cycles • Each non-root node has exactly one parent • Each node has a label. • The order of elements is important • … but the order of attributes is not important A Semantic Web Primer, 2nd Edition
The Tree Model of XML Docs • Trees are used just as illustrations. • Some XML aspects are not properly represented. • A more refined tree concept is required • The different types of nodes should be differentiated • Element node, attribute node etc. • There is difference between: • the root of the tree (representing the XML document) • the root element (email) • This distinction is important for addressing and querying XML documents. A Semantic Web Primer, 2nd Edition
Lecture Outline • Introduction • Detailed Description of XML • Structuring • DTDs • XML Schema • Namespaces • Accessing, querying XML documents: XPath • Transformations: XSLT A Semantic Web Primer, 2nd Edition
Why Structuring XML Documents? • An XML document is well-formed if it respects certain syntactic rules. • These rules say nothing specific about the structure of the document. • Imagine 2 applications that try to communicate • They wish to use the same vocabulary A Semantic Web Primer, 2nd Edition
Structuring XML Documents • Define all the element and attribute names that may be used • Define the structure • what values an attribute may take • which elements may or must occur within other elements, etc. • If such structuring information exists, the document can be validated A Semantic Web Primer, 2nd Edition
Structuring XML Documents (2) • An XML document is valid if • it is well-formed • respects the structuring information it uses • There are two ways of defining the structure of XML documents: • DTDs (the older and more restricted way) • XML Schema (offers extended possibilities) A Semantic Web Primer, 2nd Edition
DTD: Element Type Definition <lecturer> <name>David Billington</name> <phone> +61 − 7 − 3875 507 </phone> </lecturer> DTD for above element (and all lecturer elements): <!ELEMENT lecturer (name,phone)> <!ELEMENT name (#PCDATA)> <!ELEMENT phone (#PCDATA)> A Semantic Web Primer, 2nd Edition
The Meaning of the DTD • The element types lecturer, name, and phone may be used in the document • A lecturer element contains a name element and a phone element, in that order (sequence) • A name element and a phone element may have any content • In DTDs, #PCDATA is the only atomic type for elements A Semantic Web Primer, 2nd Edition
DTD: Disjunction in Element Type Definitions • We express that a lecturer element contains either a name element or a phone element as: <!ELEMENT lecturer (name | phone)> • Both elements below are validated: <lecturer> <name>David Billington</name> </lecturer> <lecturer> <phone> +61 − 7 − 3875 507 </phone> </lecturer> A Semantic Web Primer, 2nd Edition
DTD: Element Type Definitions – Any order • A lecturer element contains a name element and a phone element in any order. <!ELEMENT lecturer ((name,phone) | (phone,name))> • Both elements below are validated: <lecturer> <name>David Billington</name> <phone> +61 − 7 − 3875 507 </phone> </lecturer> <lecturer> <phone> +61 − 7 − 3875 507 </phone> <name>David Billington</name> </lecturer> A Semantic Web Primer, 2nd Edition
Example of an XML Element <order orderNo="23456" customer="John Smith" date="October 15, 2002"> <item itemNo="a528" quantity="1"/> <item itemNo="c817" quantity="3"/> </order> A Semantic Web Primer, 2nd Edition
The Corresponding DTD <!ELEMENT order (item+)> <!ATTLIST order orderNo ID #REQUIRED customer CDATA #REQUIRED date CDATA #REQUIRED> <!ELEMENT item EMPTY> <!ATTLIST item itemNo ID #REQUIRED quantity CDATA #REQUIRED comments CDATA #IMPLIED> A Semantic Web Primer, 2nd Edition
Comments on the DTD • The item element type is defined to be empty • + (after item) is a cardinality operator: • ?: appears zero times or once • *: appears zero or more times • +: appears one or more times • No cardinality operator means exactly once A Semantic Web Primer, 2nd Edition
Comments on the DTD (2) • In addition to defining elements, we define attributes • This is done in an attribute list containing: • Name of the element type to which the list applies • A list of triplets of attribute name, attribute type, and value type • Attribute name: A name that may be used in an XML document using a DTD A Semantic Web Primer, 2nd Edition
DTD: Attribute Types • Similar to predefined data types, but limited selection • The most important types are • CDATA, a string (sequence of characters) • ID, a name that is unique across the entire XML document • IDREF, a reference to another element with an ID attribute carrying the same value as the IDREF attribute • IDREFS, a series of IDREFs • (v1| . . . |vn), an enumeration of all possible values • Limitations: no dates, number ranges etc. A Semantic Web Primer, 2nd Edition
DTD: Attribute Value Types • #REQUIRED • Attribute must appear in every occurrence of the element type in the XML document • #IMPLIED • The appearance of the attribute is optional • #FIXED "value" • Every element must have this attribute • "value" • This specifies the default value for the attribute A Semantic Web Primer, 2nd Edition
Referencing with IDREF and IDREFS <!ELEMENT family (person*)> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> A Semantic Web Primer, 2nd Edition
An XML Document Respecting the DTD <family> <person id="bob" mother="mary" father="peter"> <name>Bob Marley</name> </person> <person id="bridget" mother="mary"> <name>Bridget Jones</name> </person> <person id="mary" children="bob bridget"> <name>Mary Poppins</name> </person> <person id="peter" children="bob"> <name>Peter Marley</name> </person> </family> A Semantic Web Primer, 2nd Edition