770 likes | 882 Views
Introduction to XML Path Language (XPath20). Cheng-Chia Chen. What is XPath ?. Latest version: 2.0 : http://www.w3.org/TR/xpath20 XQuery/XPath Data Model (XDM) XQuery/XPath Formal Semantics XQuery 1.0 and XPath 2.0 Functions and Operators 1. 0 : http://www.w3.org/TR/xpath
E N D
Introduction to XML Path Language (XPath20) Cheng-Chia Chen
What is XPath ? • Latest version: • 2.0 : • http://www.w3.org/TR/xpath20 • XQuery/XPath Data Model (XDM) • XQuery/XPath Formal Semantics • XQuery 1.0 and XPath 2.0 Functions and Operators • 1.0 : http://www.w3.org/TR/xpath • a language for addressing parts of an XML document, • designed to be used by XSLT , XQuery, XML Schema and XPointer. • References: W3Schools
TOC • Introduction • Data Model • Location Paths • Expressions • Core Function Library
1. Introduction • What is XPath? • A language used to to address parts of an XML [XML] document, • provides basic facilities for manipulation of strings, numbers and booleans, • operate on the abstract, logical structure of an XML document, rather than its surface syntax.
XPath(2.0) data model • provides • a tree representation of XML documents as well as • atomic values such as number, strings, and booleans, and • flat sequencesthat may contain both references to nodes in an XML document and atomic values. • The result of evaluating an XPath expression is a sequence of items, each of which is either • a node from the input document, or • an atomic value.
Type systems of XPath • XPath Expression: • the primary syntactic construct in XPath. • would be evaluated to yield a value, which is a possibly empty sequence of items. • An item is either • a node or • an atomic value.
Expression evaluation • occurs with respect to acontext. • XSLT, XQuery and XPointer specify how the context is determined. • A context consists of: • 1. a node (the context node) • 2. a pair of non-zero positive integers the context position and the context size) • 3. a set of variable bindings • 4. a function library • 5. the set of namespace declarations in scope for the expression • Notes: • 3,4,5 does not change when evaluating subexpressions. • 2 can only be changed by predicates • Some expression may change 1.
Location path • The most important kind of expressions • used to selects a set of nodes relative to a context node.
2. Data Model • details in XQuery/XPath data Model • XPath operates on an XML document as a tree of nodes. • All xpath expressions are evaluated to produce a sequence of items. • Item • atomic value (atoms) • node (of an XML document tree)
Kinds of Atoms • Kinds of atoms • number1.0 (a double floating-point number) • boolean1.0 (true or false) • string1.0 (a sequence of unicode characters) or • generalized to including all simple datatypes defined by xml schema2.0 • number2.0 is classified further into • integer, decimal, float and double.
Atomization • A sequence of items can be atomized to produce a sequence of atoms by replacing every node item with its string value as follows: • text node • the contents of the text node • root node or element node • the concatenation in document order of the string values of all descendent text nodes • attribute node, comment node, processing-instructin node • string value of the node • The string value of a node can be queried by invoking fn:string(node).
Types of nodes in an XML tree • Same as in XPath 1.0 • The tree contains nodes. • Types of nodes and their possible children: • root nodes : element ( = 1), comment, PI • element nodes: element, text, PI, comment, [attribute, namespace] • text nodes: leaves • attribute nodes : leaves • namespace nodes: leaves • processing instruction nodes : leaves • comment nodes : leaves
Basic concepts • See Concepts from XDM • Node Identities • Document Order • Sequence • Types
Node Identity • Every node has a unique identity. (like objects in Java) • identical to itself, • not identical to any other node. • I.e., node1 = node2 iff node1 and node 2 correspond to the same node occurrence. • Notes: • node identity ≠ ID attribute. • An element has an identity even if it has no ID attributes. • Non-element Nodes also have unique identity. • Atomic values do not have identity; • every occurrence of “5” as an integer is identical to every other occurrence of “5” as an integer.
Example <courses> <course name =“dismath”> <student idref=“Wang” /> <student idref=“chen” /> … </course> <course name=“compiler”> <student idref=“Wang” /> <student idref=“Chang”/> … </course> </courses> Ex: • xpath: ( /courses/course[name=‘dismath’]/student[1] is //student[3] ) returns false. • xapth: (//students[1]/@idref is //students[3]/@idref ) returns false. (why?)
Document order and reverse document order • Same as in XPath 1.0
Example [to be added] <?xml version=“1.0” ?> <a xmlns:ns1 = “uri1” at1 = “…” at2=“…” > <a1> data1 </a1> <a2> data2 </a2> <a3><b3/><!-- comment 1 --> </a3> <?pi pidata ?> </a> • Ddoc order: root < a < ns1 < { at1,at2} < a1 < ns14a1 < data1 … < a3 < ns14a3 < b3 < ns14b3 < comment < pi
Sequences • Sequence of items is the unique output type of all XPath expressions. • A sequence may contain nodes, atomic values, or any mixture of nodes and atomic values. • no distinction between an item and a singleton sequence containing that item. • (‘123’ ) = ‘123’ ; node2 = ( node2 ). • A node does not loose its identity when it is added to a sequence. [i.e., only references to the node are added] • A node may occur in multiple places of one or more sequences. • Sequences are flat and never contain other sequences. • Appending (d e) to (a b c) will not produce (a b c (d e)) but would flat it to (a b c d e ) automatically. • Notes: • Sequences replace node-sets from XPath 1.0. • In XPath 1.0, node-sets do not contain duplicates.
Types in XDM • accept all types defined by XML Schema • supports XSLT and XQuery whose type system are based on XML Schema. • includes 19 built-in primitive types, 5 additional types defined by XDM and user/implementor defined types. • type system defined in XQuery&XPath formal semantics • Every item in the data model has both a value and a type. Examples: • nodes node type, • 5 xsd:integer ; • ‘5’ xsd:string; • “Hello World.” xs:string.
XDM Type Hierarchy • from XDM Type Hierarchy.
Representation of Types • Use expanded-QName (EQName) to represent a type. • Definition: An expanded-QName is a set of three values consisting of • {prefix} a possibly empty prefix, • {namespace name} a possibly empty namespace URI and • {local name} a local name. • Note: Only URI and local name is used for identity. • Lexical representation of an expanded QName: • [pre1:] localName • URI determined by context. • A type [with target namespace = n1 and local name = loc1] is represented by a EQName[ whose URI = n1 and local Name = loc1].
General constraints on nodes All nodes must satisfy the following general constraints: • 1. Every node must have a unique identity, distinct from all other nodes. [unique identity] • 2. The children property of a node must not contain two consecutive Text Nodes. [no adjacent texts ] • 3. The children property of a node must not contain any empty Text Nodes. [no empty text ] • 4. The children and attributes properties of a node must not contain two nodes with the same identity. [no sharing of nodes ] • I.e., no sharing of contained nodes (hence a tree but not a dag ).
Predefined Types • xs:untyped • denotes the dynamic type of an element nodethat has not been validated, or has been validated in skip mode. • xs:untypedAtomic • denotes untyped atomic data, such as text that has not been assigned a more specific type or attribute value that is validated in skip mode • xs:anyAtomicType • derived from xs:anySimpleType • the root of all atomic types (not including list or union type) • the base type of all 23 primitive types. • xs:dayTimeDuration, xs:yearMonthDuration • derived from xs:duration • form: PddDTddHddMdd:ddd • form: PddddYmmM
atomic (Typed) value constructions • signature (format): see XPath constructor functions • prefix:TYPE($arg asxs:anyAtomicType?)asprefix:TYPE? • Notes: • ? means the input and output is a sequence of zero or one atomic value. • if $arg is empty then the output is also the empty sequence. • possible prefix:TYPE • xs:integer, xs:int, xs:datetime, xs:boolean,… • can also be user defined atomic types : bk:ISBN, np:IP QName of target type InputType OutputType
List of constructors for built-in types • xs:string($arg as xs:anyAtomicType?) as xs:string? • xs:string(“abc”) string “abc”; xs:string(123) “123” • xs:boolean($arg as xs:anyAtomicType?) as xs:boolean? • xs:boolean(“abc”) error; xs:boolan(“”) false; xs:boolean(10) true; • xs:boolean() error; xs:booolean(()) () • xs:decimal($arg as xs:anyAtomicType?) as xs:decimal? • xs:decimal(“123.456789” ) 123.456789 • xs:float($arg as xs:anyAtomicType?) as xs:float? • xs:double($arg as xs:anyAtomicType?) as xs:double? • Note: • xs:int(“1234567891234”) error • xs:integer(“1234567891234) 1234567891234
All others are similar. • xs:duration, xs:dateTime, xs:time,xs:date,xs:gYearMonth, • xs:gYear,xs:gMonthDay,xs:gDay,xs:gMonth • xs:hexBinary,xs:base64Binary • xs:anyURI,xs:QName • xs:normalizedString, xs:token, xs:language, • xs:NMTOKEN, xs:Name, xs:NCName, • xs:ID, xs:IDREF, xs:ENTITY, • xs:integer, xs:long, xs:int, xs:short, xs:byte • xs:nonPositiveInteger,xs:negativeInteger • xs:nonNegativeInteger, • xs:unsignedLong,xs:unsignedInt,xs:unsignedShort, xs:unsignedByte, • xs:positiveInteger,xs:yearMonthDuration, • xs:dayTimeDuration, xs:untypedAtomic,
More Examples • xs:string(“abc”), xs:int(“123”) • xs:float(“123.3e10”) • xs:date(“2006-11-12”) • xs:gMonthYear(“--11-12:) • xs:gMonth(“--11”) • xs:gDay(“---12”) • xs:dateTime(“2006-11-12T12:00:00"). • fn:dateTime( xs:date("1999-12-31"),xs:time("12:00:00")) xs:dateTime("1999-12-31T12:00:00"). • fn:dateTime( xs:date("1999-12-31"), xs:time("24:00:00")) returns xs:dateTime("1999-12-31T00:00:00") because "24:00:00" is an alternate lexical form for "00:00:00". • note: 24:00:00 = 00:00:00
construction of typed value w/o namespace • How to construct a value the type of which belongs to a namespace without a namespace URI? • use cast operation: • ex: weight is a subtype of xs:int w/o belonging to any namespace. Then we can use : • 40 cast asweight • to get an instance of weight. • undeclare default namespace : • declare default function namespace “” ; • … weight(40) …
String values • Every atomic value has a string representation. • The value can be obtained by the casting operation: • Ex: • ( xs:int(“123”) + 45 ) cast as xs:string • return “168”
Properties of nodes • string-value • Every node has a string-value, which is part of the node or computed from the string-value of descendant nodes. • expanded-name1.0 ( in 2.0 it is replaced with EQName) • expanded-name = namespce URI + local part • The namespace URI is either null or a URI string [RFC2396]. • Two expanded-names are equal if they have the same local part, and the same namespace URIs
Node relationship • Same as in xpath 1.0
properties/relationship of nodes m(e) is the URI bound to prefix e
3 Location Paths (renamed PathExpr in 2.0) • Same as in xpath 1.0 (except some mirror change) • LocationPath • a special kind of expressions, • used to locate a sequence of nodes in the document. • sorted in document order • no duplicates
4. General Expressions • Every expression evaluates to a sequence of items • atomic values • nodes • Atomic values may be • double1.0 or numeric2.0 • booleans • Unicode strings • or other datatypes defined by XML Schema
Atomization • A sequence may be atomized: • atomic members not affected; node items become strings • This results in a sequence of atomic values • conversion rules: • document or element nodes the concatenation of all descendant text nodes (a string) • other kind of nodes the obvious string. • attribute node atribute vlaues cast as xs:string • text text content • comment commnet text • pI PI data (PI target dropped) • namesapce node text of namespace URI
Kinds of Expressions 3.1 Primary Expressions : string + numeric literls 3.2 Path Expressions 3.3 Sequence Expressions: , to [ … ], |, intersect, - 3.4 Arithmetic Expressions : +, - , *, div, idiv, mod 3.5 Comparison Expressions: is, <, >, =, le, ge, eq, ne… 3.6 Logical Expressions : and, or, not, 3.7 For Expressions : for 3.8 Conditional Expressions : if 3.9 Quantified Expressions : every, some 3.10 Expressions on SequenceTypes
Primary Expressions • Literals • string: “abc”, ‘abc’, “He said “”OK”””, ‘He said “ok” ’. • numerical: 123 xs:integer, 123.4 xs:decimal • 124.4e5 xs:double • non-literals: • xs:int(“125”) = xs:int(125) = 125 cast as xs:int • boolean : fn:true(), fn:false() • Variable References : $pre:name, $var-1 • Parenthesized Expressions : ( ), ( expr ) • Context Item Expression : . • (1 to 100) [. mod 5 eq 0] //book[ fn:count(./author) > 1 ] • Function Calls : pre:fName( arg1, …, argn ) • fn:concate(“abc”, “def”)
Literal Expressions 42 3.1415 6.022E23 ’XPath is a lot of fun’ ”XPath is a lot of fun” ’The cat said ”Meow!”’ ”The cat said ””Meow!””” ”XPath is just so much fun”
Variable References $foo $bar:foo • $foo-17 refers to the variable ”foo-17” • Possible fixes: ($foo)-17, $foo -17, $foo+-17
XPath operators and their precedences • see reference • XPath 2.0 grammar
Path Expressions • Locations paths are expressions • They may be applied to arbitrary sequences • evaluation rule discussed before.
Sequence Expressions • Constructing Sequences : , , to • (1,2,3) ,(), (3) (1,2,3,3) • 2 to 4 (2,3,4) (10, (1 to 3)) (10,1,2,3) • (1,(2,3,4),((5))) (1,2,3,4,5) -- flatten • Filter Expressions : PrimaryExpr [ … ]* • (1 to 30) [ . mod 3 = 0 ] [ . mod 5 = 0 ] (15, 30) • (10 to 20) [ 5] (14) • Combining Node Sequences (for Node only): • assume doc order : A < B < C < D < E • union: (A,B,A) | (B,C) | (A,C) = (A,B) union (B,C) (A,B,C) • intersect, except : • (A,B,C,D )intersect (B,D,A,E) except (B) • (A, D).
Filter Expressions • Predicates generalized to arbitrary sequences • The expression ’.’ is the context item • The expression: (10 to 40)[. mod 5 = 0 and position)>20] has the result: 30, 35, 40
Arithmetic Expressions • +, -, *, div, idiv, mod, +, - (unary) • -3 div 2 -1.5 (decimal) • -3 idiv 2 -1 (integer) • -3.4 mod 2 (or -2) -1.4 • rule: x = y * ( x idiv y) + (x mod y) • precedence : {+,-} < {*, mod, div,idiv} < {unary +,-} • Operators are generalized to sequences • if any argument is empty, the result is empty • () + 3 () • All argument are singleton sequences of numbers: • ( 3) + ( 4) + 5 12 • otherwise, a runtime error occurs • (1,3) + (2,4) error
Comparison Expressions boolean • Value Comparisons • comparison operators : eq, ne, lt, le, gt, ge. • used for comparing single values. • General Comparisons (**) • operators: =, !=, <, <=, >, >=. • are existentially quantified comparisons that may be applied to operand sequences of any length. • The result is true or false if it does not raise an error. • Node Comparisons • operators: is, >>, << • A is B true if A anb B are the same node • A << B = B >> A true if if A preceds B in doc order.
Value Comparison • Comparison operators: • eq(=), ne(≠), lt(<), le(<=), gt(>), ge(>=) • Used on atomic values • When applied to arbitrary values ( sequence ): • atomize • if either argument is empty, the result is empty • if either has length >1, the result is false • if incomparable, a runtime error ; ex:8 < “abc” • otherwise, compare the two atomic values • 8 eq 4+4 (//rcp:ingredient)[1]/@name eq”beef cube steak”
Node Comparison • Operators: is, <<, >> • Used to compare nodes on identity and order • is is for node identity; >>, << for node ordering • When applied to arbitrary values: • if either argument is empty, the result is empty • if both are singleton nodes, the nodes are compared • otherwise, a runtime error. Ex: //book[1] is “abc” Ex: • (//student)[2] is //student[@id = ”s9527”] • /rcp:collection << (//rcp:recipe)[4] • (//rcp:recipe)[4] >> (//rcp:recipe)[3]
General Comparison (use with care!!) • Operators: =, !=, <, <=, >, >= • Used on general sequences: • atomize • if there exists two values, one from each argument, whose value comparison holds, the result is true –Note: It may raise an error during the value comparison • otherwise, the result is false ; 8 = 4+4 (1,2) = (2,4) //rcp:ingredient/@name =”salt” () = () false!! (2) != (“2”) runtime error (1,2) = (1, “2”) true (1,2) = (“2”, 1) runtime error I.e., seq1 gop seq2 means ∃x1∈seq1∃x2∈seq2 (x1 vop x2).
Be Careful About Comparisons ((//rcp:ingredient)[40]/@name,(//rcp:ingredient)[40]/@amount) eq ((//rcp:ingredient)[53]/@name, (//rcp:ingredient)[53]/@amount) • false, only singletons and compatible values can be compared ((//rcp:ingredient)[40]/@name, (//rcp:ingredient)[40]/@amount)= ((//rcp:ingredient)[53]/@name, (//rcp:ingredient)[53]/@amount true, since the two names are found to be equal ((//rcp:ingredient)[40]/@name, (//rcp:ingredient)[40]/@amount) is ((//rcp:ingredient)[53]/@name, (//rcp:ingredient)[53]/@amount) runtime error, since only single-node sequences can be compared
Algebraic Axioms for Comparisons • Reflexivity: • Symmetry: • Transitivity: • Anti-symmetry: • Negation: