230 likes | 337 Views
XML And XPath. DSA Term 2 Week 14. Lecture overview. Matters arising Character coding Well-formed XML Creating simple XML files Placename to BBC code Introduction to XPath. Character Coding. Character set ISO 8549 - 1 Byte 0 - 127 are ASCII
E N D
XML And XPath DSA Term 2 Week 14 DSA/2006/week 14
Lecture overview • Matters arising • Character coding • Well-formed XML • Creating simple XML files • Placename to BBC code • Introduction to XPath DSA/2006/week 14
Character Coding • Character set • ISO 8549 - 1 Byte • 0 - 127 are ASCII • 128- 255 vary depending on the part of the standard • 15 different character maps • ISO-8859-1 - Latin -1 - the default for HTML • ISO-8859-2 – Central European • A document must be on one encoding • problem of mixing characters e.g. an Arabic quotation in a Cyrillic text • UTF-8 - Unicode 1- 4 byte variable length to support a huge range of international languages in a single code • ASCII is included as characters 0-127 • Ensures that the internet is truly multi-lingual • Key invention by Ken Thompson of self-synchronisation allowing character boundaries to be detected • Character references in HTML • Named ° • decimal &176; • Hexadecimal &#B0; DSA/2006/week 14
Defining the Encoding • Encodings in HTML • In a meta-tag • <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII"> • In the xml processing instruction • <?xml version="1.0" encoding="ISO-8859-1"?> • In the HTTP content header • Content-Type: text/html; charset=ISO-8859-1 • Setting Encoding in PHP • header("Content-type: text/html; charset=UTF-8"); • Setting encoding in the Browser • Firefox • View/Character Encoding DSA/2006/week 14
Design a simple XML file • Design an XML vocabulary to represent pairs of place names and codes • Bristol 1263 • Bath 1123 • First review XML structure DSA/2006/week 14
Example <MapSet> <Map id="P2" desc="P Block level 2"> <room id="2P2"> <area shape="rect" coords="118,39,138,68"/> <type>Staff Room</type> <occupant>Tony Solomonides</occupant> </room> <room id="2P3"> <area shape="rect" coords="141,40,162,69"/> <type>Staff Room</type> <occupant>Richard Lawson</occupant> </room> <room id="2P4"> <area shape="poly" coords="201,40,234,40,234,118,164,119,163,71,200,71"/> <type>Office</type> <occupant>Eleanor Gibbons</occupant> <occupant>Dee Evans</occupant> <occupant>Ali Jack</occupant> </room> …. </Map> </MapSet> DSA/2006/week 14
Well-formed XML documents (1) Every XML document must be well-formed and must therefore adhere to the following rules (among others): • Every start-tag must have a matching end tag. • Elements may nest but must not overlap. <name>Anna<em>Coffey</em></name> - √ <name><em>Anna</name>Coffey</em> - × • There must be exactly one root element. • Attribute values must be quoted. • An element must not be quoted. • Comments and processing instructions may not appear inside tags. • No unescaped < or & signs may occur in the character data of an element. DSA/2006/week 14
Well-formed XML documents (2) Element names are case sensitive - <NAME>, <name>, <Name> & <NaMe> are four different element types. No white spaces in element name - <First Name> not allowed; <First_Name> OK. Element names cannot start with the letters “XML” or “xml” – reserved terms. Element names must start with a letter or a underscore. Element names cannot start with a number but numbers may be embedded within an element name - <2you> not allowed; <me2you> is OK. Attribute names are constrained by the above rules for element names. Entity references are used to substitute specific characters. There are five predefined entities built into XML: Entity Char Notes & & Do not use inside processing instructions < < Use inside attribute values quoted with “. > > Use after ]] in normal text and inside processing instruction. " “ Use inside attribute values quoted with “. ' ‘ Use inside attribute values quoted with ‘. Map DSA/2006/week 14
Errors • Look at the listing of the XML file and identify all the places which prevent this XML from being well-formed DSA/2006/week 14
<Map id=P2 desc="P Block level 2'> <room id="2P2"> This is a nice big office <area rect coords="118,39,138,68"> <typo>Staff Room</typo> <occupant>Tony Solomonides</occupant> </Room> <room id="2P3"> <area rect coords="141,40,162,69"></area> <typo>Staff Room</typo> <occupant>”Richard Lawson”</occupant> </Room> <room id="2P4"> <area poly coords="201,40,234,40,234,118,164,119,163,71,200,71"/> <typo>Office</typo> <occupant>Eleanor Gibbons</occupant> <person>Dee Evans</person> <occupant>Ali Jack</occupant </Room> --- DSA/2006/week 14
Task • Draw the structure • Use ER notation • Attributes in the Entity • Cross-foot notation for one-many, optional • Identify any restricted sets of values (ennumerated types) • In the lab, QSEE will allow you to define the structure and generate the schema definition (XML Schema or DTD) DSA/2006/week 14
XPATH • Core language for selecting nodes in XML • Version 1.0 used in XSLT 1.0 • client-side in Browsers • xalan engine • w3.schools Tutorial is for XPath 1.0 • SimpleXML in PHP • Version 2.0 used in XSLT 2.0 • Saxon parser • XQuery 1.0 • Differences • Code data structure in 2.0 is a node sequence • Full support for all XML schema datatypes • Two kinds of equality operators • Larger function library DSA/2006/week 14
XPath Language • Not a programming language • Expressions to be evaluated • Focus on • Navigation in a tree structure • Multiple directions or ‘axes’ • Down to children (child axis) • Up to parent (parent axis) • Down to attributes (attribute axis) • Across to siblings (sibling axis) • Operators • Functions DSA/2006/week 14
XPath operators • Arithmetic operators + - * div idiv mod • Value comparisons eq, le, ge, gt, lt • Sequence comparisons = , != = is true if there are common elements != is true if there are no common elements (1,2,3) = (2,3,4) is true (1,2,3) != (2,3,4) is also true not ((1,2,3) = (2,3,4) ) is false • Logical operators and, or, not() DSA/2006/week 14
large function library • count (seq) , max((seq)) ,min((seq)), average • count(1,2,3) = 3 • max, min • string functions • string-length(‘abc’) • tokenize(‘a,b,c’,’,’) • string-join((a,b,c),’, ‘) DSA/2006/week 14
Using the eXist database • eXist database as an XPath / XQuery engine. • Rest interface • ..exist/rest/db/chriswallace/rooms?_query=//Map • Java client • Sandbox (using Ajax to do dynamic syntax checking) • Context is the whole database • The demo database includes • the whole text for Romeo and Juliet • the mondial world database DSA/2006/week 14
Examples • all Rooms • /MapSet/Map/room • //room • room 2P5 • //room[@id=‘2P5’] • the occupants of room 2P4 • //room[@id=‘2P4’]/occupant • the roomNo of the room which Colin Fudge occupies • //room[occupant = ‘Colin Fudge’]/@id • the number of occupants of 2P4 • count(//room[@id=‘2P4’]/occupant) • The floor of Ali Jack’s room • //room[occupant = ‘Ali Jack’]/../@desc DSA/2006/week 14
Notes • Note how = tests if a person is amongst the occupants • To ‘serialise’ an attribute use string() • See how ../ allows navigation to the parent element DSA/2006/week 14
Examples for you • The room number for Richard Lawton • The coordinates of room 2P2 • All rooms with poly shape • Who are Ali Jack’s office mates? DSA/2006/week 14
XML design • Rooms is a mixture of text elements and attributes. • Could be all attributes – what would change? • Could be no attributes – what would change? • For the workshop exercise use elements instead of attributes – its simpler even if more verbose • Generally, what do the experts recommend? DSA/2006/week 14
Workshop • Create a simple XML file containing pairs of Place names and BBC codes • Change the PHP script to accept a placename • Read the new xml file and decode the name to get the code using PHP SimpleXML interface and xpath(‘’) DSA/2006/week 14