820 likes | 979 Views
Introduction to XML Path Language (XPath20). Cheng-Chia Chen. What is XPath ?. Latest version: 2.0 : http://www.w3.org/TR/xpath20 XQuery/XPath Data Model (XDM) XQuery/XPath Formal Semantics XQuery 1.0 and XPath 2.0 Functions and Operators 1. 0 : http://www.w3.org/TR/xpath
E N D
Introduction to XML Path Language (XPath20) Cheng-Chia Chen
What is XPath ? • Latest version: • 2.0 : • http://www.w3.org/TR/xpath20 • XQuery/XPath Data Model (XDM) • XQuery/XPath Formal Semantics • XQuery 1.0 and XPath 2.0 Functions and Operators • 1.0 : http://www.w3.org/TR/xpath • a language for addressing parts of an XML document, • designed to be used by XSLT , XQuery, XML Schema and XPointer. • References: xfront, W3Schools
TOC • Introduction • Data Model • Location Paths • Expressions • Core Function Library
1. Introduction • What is XPath? • A language used to to address parts of an XML document, • provides basic facilities for manipulation of strings, numbers and booleans, • operate on the abstract, logical structure of an XML document, rather than its surface syntax.
XPath(2.0) data model • provides • a tree representation of XML documents as well as • atomic values such as number, strings, and booleans, and • flat sequencesthat may contain both references to nodes in an XML document and atomic values. • The result of evaluating an XPath expression is a sequence of items, each of which is either • a node from the input document, or • an atomic value.
Type systems of XPath • XPath Expression: • the primary syntactic construct in XPath. • would be evaluated to yield a value, which is a possibly empty sequence of items. • An item is either • a node or • an atomic value.
Expression evaluation (xpath 1.0) • occurs with respect to acontext. • XSLT, XQuery and XPointer specify how the context is determined. • A context consists of: • 1. a node (the context node) • 2. a pair of non-zero positive integers ( the context position and the context size) • 3. a set of variable bindings • 4. a function library • 5. the set of namespace declarations in scope for the expression • Notes: • 3,4,5 does not change when evaluating subexpressions. • 2 can only be changed by predicates • Some expression may change 1.
Expression evaluation (xpath 2.0) • Expression Context • consisting of all information that can affect the result of evaluating an expression • Context are organized into two categories : • static context : contains information available prior to execution • dynamic context : • contains information used during execution • = static context + additional information
Static context A static context consists of: 1. XPath 1.0 compatibility mode : boolean 2. Statically known namespaces (i.e.,(prefix, uri) pairs ) 3. Default element/type namespace (or none) • <e1 .../>, <pre:e2 xsi:type="aType" /> 4. Default function namespace (or none) • max(...), fn:f1(...), ... 5. In-scope schema definitions: • schema type definitions(local+global) + • element declarations (global + local + substitution groups) + • attribute declarations (global+local) • Identified by expanded QName (global) , or implementation dependent identifiers(local or anonymous). 6. In-scope variables. : a set of (EQName, type) pairs. • is the set of variables available for reference within an expression. • some constructs (for,some,every ) may extend in-scope variables of its subexpressions.
Context item static type : the static type of the context item • Function signatures(i.e., callable functions and constructors ) • is the set of functions that are callable from within an expression. • Each function identified by its expanded QName and its arity. • Function signature also specifies the static types of the function parameters and result. • Statically known collations. • is a set of (uri, collation) pairs. A collation is a specification of the manner in which character strings are compared and ordered. Collations are identified by a uri string. • Default collation : is one of statically known collations. • Base URI : is the uri for resolution (relative absolute). • Statically known documents : • pairs of (s : absolute doc uri, t: type) , where t is the type of fn:doc( s) and the default value of t is document-node()? . • Statically known collections : pairs of (s: uri, t:type), where t is the type of fn:collection(s). • Statically known default collection type : default type ( is node()* if not given) of fn:collection().
Dynamic context = static context + additional items listed below : • Focus = context {item, position, size} • ., position(), last() • Variable values : pairs of (EQName, value), where value also contains dynamic type info. • Function implementations • contains implementation of function signatures given in static context. • Current dateTime : • current-dateTime(), current-date(), current-time() • Implicit timezone: implicit-timezone() • Available documents: Map<uri, document-node> • Available collections : Map<uri, node()*> • Default collection: value of collection()
Location path • The most important kind of expressions • used to selects a set of nodes relative to a context node.
2. Data Model • details in XQuery/XPath data Model • XPath operates on an XML document as a tree of nodes. • All xpath expressions are evaluated to produce a value. • In Xpath 2.0, a value is always a sequence. • A sequence is an ordered collection of zero or more items. • An item is either • an atomic value or • a node. • An atomic value is a value (in the value space) of an atomic type, as defined in [XML Schema]. • 123 xs:integer; 123.0 xs:decimal; 1.23e2 xs:double • xs:date("2011-12-10") xs:QName('xs:date')
Xpath 2.0 data model • A node is an instance of one of the seven node kinds defined in XQuery/XPath data Model . • Each node has a unique node identity, a typed value, and a string value. • Some nodes have a name, which is a value of type xs:QName. • The typed value of a node is a sequence of zero or more atomic values. • The string value of a node is a value of type xs:string. • In certain situations a value is said to be undefined (for example, the value of the context item, or the typed value of an element node). • This term indicates that the property in question has no value and that • any attempt to use its value results in an error.
Kinds of Atoms • Kinds of atoms • number1.0 (a double floating-point number) • boolean1.0 (true or false) • string1.0 (a sequence of unicode characters) or • generalized to including all atomic datatypes defined by xml schema2.0 • number2.0 is classified further into • integer, decimal, float and double.
Atomization • A sequence of items can be atomized to produce a sequence of atoms by replacing every node item with its typed valueas follows: • root, text node string value +xs:untypedAtomic • comment node, processing-instruction node, namespace node string value +xs:string • attribute value in the typeAnnotation, or string for type:xs:untypedAtomic • ex: "12.3e2" in xs:dobule => 12.3 e2; • "s1 s2 s3" in xs:IDREFS => sequence ('s1' ,'s2', 's3') of type xs:IDREF* • element of simple content • anySimpleType string value + xs:untypedAtomic • o/w value(s) + type // ex: list type • element nodes • xs:untyped or complex type with mixed content string value + xs:untypedAtomic • complex type + empty content (or nilled ='true' ) () • complex type + complex element only content undefined • The typed value of a sequence s can be queried by invoking fn:data(s).
Types of nodes in an XML tree • All but namespace node are the same as in XPath 1.0 • The tree contains nodes. • Types of nodes and their possible children: • root nodes : element ( = 1), comment, PI • element nodes: element, text, PI, comment, [attribute, namespace] • text nodes: leaves • attribute nodes : leaves • namespace nodes:leaves// xpath2.0 need not support • // xquery1.0 do not support • processing instruction nodes : leaves • comment nodes : leaves
Basic concepts • See Concepts from XDM • Node Identities • Document Order • Sequence • Types
Node Identity • Every node has a unique identity. (like objects in Java) • identical to itself, • not identical to any other node. • I.e., node1 is node2 iff node1 and node 2 correspond to the same node occurrence. • Notes: • node identity ≠ ID attribute. • An element has an identity even if it has no ID attributes. • Non-element Nodes also have unique identity. • Atomic values do not have identity; • every occurrence of “5” as an integer is identical to every other occurrence of “5” as an integer.
Example <courses> <course name =“dismath”> <student idref=“Wang” /> <student idref=“chen” /> … </course> <course name=“compiler”> <student idref=“Wang” /> <student idref=“Chang”/> … </course> </courses> Ex: • xpath: ( /courses/course[name=‘dismath’]/student[1] is (//student)[3] ) returns false. • xapth: ((//students)[1]/@idref is (//students)[3]/@idref ) returns false. (why?)
Document order and reverse document order • Same as in XPath 1.0
Example <?xml version=“1.0” ?> <a xmlns:ns1 = “uri1” at1 = “…” at2=“…” > <a1> data1 </a1> <a2> data2 </a2> <a3><b3/><!-- comment 1 --> </a3> <?pi pidata ?> </a> • Doc order: root < a < ns1 < { at1,at2} < a1 < ns14a1 < data1 … < a3 < ns14a3 < b3 < ns14b3 < comment < pi
Sequences • Sequence of items is the unique output type of all XPath expressions. • A sequence may contain nodes, atomic values, or any mixture of nodes and atomic values. • no distinction between an item and a singleton sequence containing that item. • (‘123’ ) = ‘123’ ; node2 = ( node2 ). • A node does not loose its identity when it is added to a sequence. [i.e., only references to the node are added] • A node may occur in multiple places of one or more sequences. • Sequences are flat and never contain other sequences. • Appending (d e) to (a b c) will not produce (a b c (d e)) but would flat it to (a b c d e ) automatically. • Notes: • Sequences replace node-sets from XPath 1.0. • In XPath 1.0, node-sets do not contain duplicates.
Types in XDM • accept all types defined by XML Schema • supports XSLT and XQuery whose type system are based on XML Schema. • includes 19 built-in primitive types, 5 additional types defined by XDM and user/implementor defined types. • type system defined in XQuery&XPath formal semantics • Every item in the data model has both a value and a type. Examples: • nodes node type, • 5 xsd:integer ; • ‘5’ xsd:string; • “Hello World.” xsd:string.
XDM Type Hierarchy • from XDM Type Hierarchy.
Representation of Types • Use expanded-QName (EQName) to represent a type. • Definition: An expanded-QName is a set of three values consisting of • {prefix} a possibly empty prefix, • {namespace name} a possibly empty namespace URI and • {local name} a local name. • Note: Only URI and local name is used for identity. • Lexical representation of an expanded QName: • [pre1:] localName • URI determined by context. • A type [with target namespace = n1 and local name = loc1] is represented by a EQName[ whose URI = n1 and local Name = loc1].
General constraints on nodes All nodes must satisfy the following general constraints: • 1. Every node must have a unique identity, distinct from all other nodes. [unique identity] • 2. The children property of a node must not contain two consecutive Text Nodes. [no adjacent texts ] • 3. The children property of a node must not contain any empty Text Nodes. [no empty text ] • 4. The children and attributes properties of a node must not contain two nodes with the same identity. [no sharing of nodes ] • I.e., no sharing of contained nodes (hence a tree but not a dag ).
Predefined Types (link) • xs:untyped • denotes the dynamic type of an element nodethat has not been validated, or has been validated in skip mode. • xs:untypedAtomic • denotes untyped atomic data, such as text that has not been assigned a more specific type or attribute value that is validated in skip mode • xs:anyAtomicType • derived from xs:anySimpleType • the root of all atomic types (not including list or union type) • the base type of all 23 primitive types. • xs:dayTimeDuration, xs:yearMonthDuration • derived from xs:duration • form: PddDTddHddMdd:ddd • form: PddddYmmM
atomic (Typed) value constructions • signature (format): see XPath constructor functions • prefix:TYPE($arg asxs:anyAtomicType?)asprefix:TYPE? • Notes: • ? means the input and output is a sequence of zero or one atomic value. • if $arg is empty () then the output is defined to be also the empty sequence (). • possible prefix:TYPE • xs:integer, xs:int, xs:datetime, xs:boolean,… • can also be user defined atomic types : bk:ISBN, np:IP QName of target type InputType OutputType
List of constructors for built-in types • xs:string($arg as xs:anyAtomicType?) as xs:string? • xs:string(“abc”) string “abc”; xs:string(123) “123” • xs:boolean($arg as xs:anyAtomicType?) as xs:boolean? • xs:boolean(“abc”) error; xs:boolan(“”) error; xs:boolean(10) true; • xs:boolean() error; xs:boolean(()) () • Note: xs:boolean != fn:boolean (effective boolean value) • xs:decimal($arg as xs:anyAtomicType?) as xs:decimal? • xs:decimal(“123.456789” ) 123.456789 • xs:float($arg as xs:anyAtomicType?) as xs:float? • xs:double($arg as xs:anyAtomicType?) as xs:double? • Note: • xs:int(“1234567891234”) error • xs:integer(“1234567891234) 1234567891234
All others are similar. • xs:duration, xs:dateTime, xs:time,xs:date,xs:gYearMonth, • xs:gYear,xs:gMonthDay,xs:gDay,xs:gMonth • xs:hexBinary,xs:base64Binary • xs:anyURI,xs:QName • xs:normalizedString, xs:token, xs:language, • xs:NMTOKEN, xs:Name, xs:NCName, • xs:ID, xs:IDREF, xs:ENTITY, • xs:integer, xs:long, xs:int, xs:short, xs:byte • xs:nonPositiveInteger,xs:negativeInteger • xs:nonNegativeInteger, • xs:unsignedLong,xs:unsignedInt,xs:unsignedShort, xs:unsignedByte, • xs:positiveInteger,xs:yearMonthDuration, • xs:dayTimeDuration, xs:untypedAtomic,
More Examples • xs:string(“abc”), xs:int(“123”) • xs:float(“123.3e10”) • xs:date(“2006-11-12”) • xs:gMonthYear(“--11-12:) • xs:gMonth(“--11”) • xs:gDay(“---12”) • xs:dateTime(“2006-11-12T12:00:00"). • fn:dateTime( xs:date("1999-12-31"),xs:time("12:00:00")) xs:dateTime("1999-12-31T12:00:00"). • fn:dateTime( xs:date("1999-12-31"), xs:time("24:00:00")) returns xs:dateTime("1999-12-31T00:00:00") because "24:00:00" is an alternate lexical form for "00:00:00". • note: 24:00:00 = 00:00:00
String values • Every atomic value has a string representation. • The value can be obtained by the casting operation: • Ex: • ( xs:int(“123”) + 45 ) cast as xs:string • return “168”
Properties of nodes • string value • Every node has a string-value, which is part of the node or computed from the string-value of descendant nodes. • can be obtained by string(.) • typed value • can be obtained by data(.) • expanded-name1.0 ( in 2.0 it is replaced with EQName) • expanded-name = namespce URI + local part • The namespace URI is either null or a URI string [RFC2396]. • Two expanded-names are equal if they have the same local part, and the same namespace URIs
Node relationship • Same as in xpath 1.0
properties/relationship of nodes m(e) is the URI bound to prefix e
3 Location Paths (renamed PathExpr in 2.0) • Same as in xpath 1.0 (except some mirror change) • LocationPath • a special kind of expressions, • used to locate a sequence of nodes in the document. • sorted in document order • no duplicates
Kinds of Expressions 3.1 Primary Expressions : string + numeric literls 3.2 Path Expressions 3.3 Sequence Expressions: , to [ … ], |, intersect, - 3.4 Arithmetic Expressions : +, - , *, div, idiv, mod 3.5 Comparison Expressions: is, <, >, =, le, ge, eq, ne… 3.6 Logical Expressions : and, or, not, 3.7 For Expressions : for 3.8 Conditional Expressions : if 3.9 Quantified Expressions : every, some 3.10 Expressions on SequenceTypes
Primary Expressions • Literals • string: “abc”, ‘abc’, “He said “”OK”” ”, ‘He said “ok” ’. • numerical: 123 xs:integer, 123.4 xs:decimal • 124.4e5 xs:double • non-literals: • xs:int(“125”) = xs:int(125) = 125 cast as xs:int • boolean : fn:true(), fn:false() • Variable References : $pre:name, $var-1 • Parenthesized Expressions : ( ), ( expr ) • Context Item Expression : . • (1 to 100) [. mod 5 eq 0] //book[ fn:count(./author) > 1 ] • Function Calls : pre:fName( arg1, …, argn ) • fn:concate(“abc”, “def”)
Literal Expressions 42 3.1415 6.022E23 ’XPath is a lot of fun’ ”XPath is a lot of fun” ’The cat said ”Meow!”’ ”The cat said ””Meow!””” ”XPath is just so much fun”
Variable References $foo $bar:foo • $foo-17 refers to the variable ”foo-17” • Possible fixes: ($foo)-17, $foo -17, $foo+-17
Path Expressions • Locations paths are expressions • They may be applied to arbitrary sequences • evaluation rule discussed before.
Sequence Expressions • Constructing Sequences : , , to • (1,2,3) ,(), (3) (1,2,3,3) • 2 to 4 (2,3,4) (10, (1 to 3)) (10,1,2,3) • (1,(2,3,4),((5))) (1,2,3,4,5) -- flatten • Filter Expressions : PrimaryExpr [ … ]* • (1 to 30) [ . mod 3 = 0 ] [ . mod 5 = 0 ] (15, 30) • (10 to 20) [ 5] (14) • Combining Node Sequences (for Node only): • assume doc order : A < B < C < D < E • union: (A,B,A) | (B,C) | (A,C) = (A,B) union (B,C) (A,B,C) • intersect, except : • (A,B,C,D )intersect (B,D,A,E) except (B) • (A, D).
Filter Expressions • Predicates generalized to arbitrary sequences • The expression ’.’ is the context item • The expression: (10 to 40)[. mod 5 = 0 and position)>20] has the result: 30, 35, 40
Arithmetic Expressions • +, -, *, div, idiv, mod, +, - (unary) • -3 div 2 -1.5 (decimal) • -3 idiv 2 -1 (integer) • -3.4 mod 2 (or -2) -1.4 • rule: x = y * ( x idiv y) + (x mod y) • precedence : {+,-} < {*, mod, div,idiv} < {unary +,-} • Operators are generalized to sequences • if any argument is empty, the result is empty • () + 3 () • All argument are singleton sequences of numbers: • ( 3) + ( 4) + 5 12 • otherwise, a runtime error occurs • (1,3) + (2,4) error
Comparison Expressions boolean • Value Comparisons • comparison operators : eq, ne, lt, le, gt, ge. • used for comparing single values. • General Comparisons (**) • operators: =, !=, <, <=, >, >=. • are existentially quantified comparisons that may be applied to operand sequences of any length. • The result is true or false if it does not raise an error. • Node Comparisons • operators: is, >>, << • A is B true if A anb B are the same node • A << B = B >> A true if if A preceds B in doc order.
Value Comparison • Comparison operators: • eq(=), ne(≠), lt(<), le(<=), gt(>), ge(>=) • Used on atomic values • When applied to arbitrary values ( sequence ): • atomize • if either argument is empty => () • if one has length > 1 => type error • if incomparable, a runtime error ; ex:8 < “abc” • otherwise, compare the two atomic values • 8 eq 4+4 (//rcp:ingredient)[1]/@name eq”beef cube steak”
Node Comparison • Operators: is, <<, >> • Used to compare nodes on identity and order • is is for node identity; >>, << for node ordering • When applied to arbitrary values: • if either argument is empty, the result is empty • if both are singleton nodes, the nodes are compared • otherwise, a runtime error. Ex: //book[1] is “abc” Ex: • (//student)[2] is //student[@id = ”s9527”] • /rcp:collection << (//rcp:recipe)[4] • (//rcp:recipe)[4] >> (//rcp:recipe)[3]