Web Data Management

Web Data Management XPath

In this lecture • Review of the XPath specification • data model • examples • syntax Resources: A formal semantics of patterns in XSLT by Phil Wadler. XML Path Language (XPath) www.w3.org/TR/xpath

XPath • http://www.w3.org/TR/xpath (11/99) • Building block for other W3C standards: • XSL Transformations (XSLT) • XML Link (XLink) • XML Pointer (XPointer) • XML Query • Was originally part of XSL

XPath • An expression language to be used in another host language (e.g., XSLT, XQuery). • Allows the description of paths in an XML tree, and the retrieval of nodes that match these paths. • Can also be used for performing some (limited) operations on XML data.

Example for XPath Queries <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><bookprice=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib>

Data Model for XPath • XPath expressions operate over XML trees, which consist of the following node types: • Document: the root node of the XML document; • Element: element nodes; • Attribute: attribute nodes, represented as children of an Element node; • Text: text nodes, i.e., leaves of the XML tree.

bib Data Model for XPath The root Processing instruction Comment The root element book book Attr= “1” element attribute publisher author . . . . Much like the Xquery data model Addison-Wesley Serge Abiteboul text

Data Model for XPath • The root node of an XML tree is the (unique) Document node; • The root element is the (unique) Element child of the root node; • A node has a name, or a value, or both • an Element node has a name, but no value; • a Text node has a value (a character string), but no name; • an Attribute node has both a name and a value. • Attributes are special! Attributes are not considered as first-class nodes in an XML tree. They must be addressed specifically, when needed.

XPath: Simple Expressions /bib/book/year Result: <year> 1995 </year> <year> 1998 </year> /bib/paper/year Result: empty (there were no papers)

XPath Tree Nodes • Seven nodes types: • root, element, attribute, text, comment, processinginstruction and namespace • Namespace and attribute nodes have parent nodes, but are not children of those parent nodes. • The relationship between a parent node and a child node is containment • Attribute nodes and namespace nodes describe their parent nodes

Xpath Tree Nodes <?xml version = "1.0"?>   <html xmlns = "http://www.w3.org/TR/REC-html40"> <head><title>Processing Instruction and Namespace Nodes </title> </head> <?deitelprocessor example = "fig11_03.xml"?> <body> <deitel:book deitel:edition = "1"xmlns:deitel = "http://www.deitel.com/xmlhtp1"><deitel:title>XML How to Program</deitel:title></deitel:book> </body> </html>

XPath Tree Nodes • String-value: Each XPath tree node has a string representation that XPath uses to compare nodes. • The string-value of a text node consists of the character data contained in the node. • Document order: Nodes in an XPath tree have an ordering that is determined by the order in which the nodes appear in the original XML document. • The reverse document order is the reverse ordering of the nodes in a document. • The string-value for the html element node is determined by concatenating the string-values for all of its descendant text nodes in document order. • The string-value for element node html is ProcessingInstructionandNamespaceNodesXMLHowtoProgram • Because all whitespace is removed when the text nodes are normalized, there is no space in the concatenation.

XPath Tree Nodes • For processing instructions, the string-value consists of the remainder of the processing instruction after the target, including whitespace, but excluding the ending ?> • The string-value for the processing instruction is • example = "fig11_03.xml" • Namespace-node string-values consist of the URI for the namespace. • The string-value for the namespace declaration is http://www.deitel.com/xmlhtpl

XPath Tree Nodes • For the root node of the document, the string-value is also determined by concatenating the string-values of its text-node descendents in document order. • The string-value of the root node is therefore identical to the string-value calculated for the html element node • The string-value for the edition attribute node consists of its value, which is 3. • The string-value for a comment node consists only of the comment's text, excluding . • The string-value for the second comment node is therefore: Processing instructions and namespacess.

XPath Tree Nodes • Expanded-name: Certain nodes (i.e., element, attribute, processing instruction and namespace) also have an expanded-name that can be used to locate specific nodes in the XPath tree. • Expanded-names consist of both a local part and a namespace URI. • The local part for the element node html is therefore html. • If there is a prefix for the element node, the namespace URI of the expanded-name is the URI to which the prefix is bound. • If there is no prefix for the element node, the namespace URI of the expanded name is the URI for the default namespace.

XPath Tree Nodes • The local part of the expanded name for a processing instruction node corresponds to the target of the processing instruction in the XML document. • For processing instructions, the namespace URI of the expanded-name is null • The local part of the expanded-name for a namespace node corresponds to the prefix for the namespace, if one exists; or, if it is a default namespace, the local part is empty (i.e., the empty string). • The namespace URI of the expanded-name for a namespace node is always null.

XPath Tree Nodes

XPath: Axes • A location path is an expression that specifies how to navigate an XPath tree from one node to another. • A location path is composed of location steps, each of which is composed of an "axis," a "node test" and an optional "predicate." • Searching through an XML document begins at a contextnode in the XPath tree. • Searches through the XPath tree are made relative to this context node. • An axis indicates which nodes, relative to the context node, should be included in the search. • The axis also dictates the ordering of the nodes in the set. • Axes that select nodes that follow the context node in document order are called forward axes. • Axes that select nodes that precede the context node in document order are called reverse axes.

XPath Context • A step is evaluated in a specific context [< N1,N2, · · · ,Nn>, Nc] which consists of: • a context list < N1,N2, · · · ,Nn> of nodes from the XML tree; • a context node Ncbelonging to the context list. • The context lengthn is a positive integer indicating the size of a contextual list of nodes; it can be known by using the function last(); • The context node positionc  [1,n] is a positive integer indicating the position of the context node in the context list of nodes; it can be known by using the function position().

XPath Steps • The basic component of XPath expression are steps, of the form: axis::node-test[P1][P2]. . . [Pn] • axis is an axis name indicating what the direction of the step in the XML tree is (child is the default). • node-test is a node test, indicating the kind of nodes to select. • Piis a predicate, that is, any XPath expression, evaluated as a boolean, indicating an additional condition. There may be no predicates at all. • A step is evaluated with respect to a context, and returns a node list.

Path Expressions • A path expression is of the form: [/]step1/step2/. . . /stepn • A path that begins with / is an absolute path expression; • A path that does not begin with / is a relative path expression. • Examples • /A/B is an absolute path expression denoting the Element nodes with name B, children of the root named A; • ./B/descendant::text() is a relative path expression which denotes all the Text nodes descendant of an Element B, itself child of the context node; • /A/B/@att1[.> 2] denotes all the Attribute nodes @att1 whose value is greater than 2.

Evaluation of Path Expressions • Each stepiis interpreted with respect to a context; its result is a node list. • A step stepiis evaluated with respect to the context of stepi−1. More precisely: • For i = 1 (first step) if the path is absolute, the context is a singleton, the root of the XML tree; else (relative paths) the context is defined by the environment; • For i > 1 if N = < N1,N2, · · · ,Nn > is the result of step stepi−1, stepiis successively evaluated with respect to the context [N,Nj ], for each j  [1,n]. • The result of the path expression is the node set obtained after evaluating the last step.

Evaluation of Path Expressions • Evaluation of /A/B/@att1 • The path expression is absolute: the context consists of the root node of the tree. • The first step, A, is evaluated with respect to this context.

Evaluation of /A/B/@att1 • The result is A, the root element. • A is the context for the evaluation of the second step, B.

Evaluation of /A/B/@att1 • The result is a node list with two nodes B[1], B[2]. • @att1 is first evaluated with the context node B[1].

Evaluation of /A/B/@att1 • The result is the attribute node of B[1].

Evaluation of /A/B/@att1 • @att1 is also evaluated with the context node B[2].

Evaluation of /A/B/@att1 • The result is the attribute node of B[2].

Evaluation of /A/B/@att1 • Final result: the node set union of all the results of the last step, @att1.

XPath: Axes

XPath: Axes • An axis has a principal node type that corresponds to the type of node the axis may select. • For attribute axes, the principal node type is attribute. • For namespace axes, the principal node type is namespace. • All other axes have a element principal node type.

XPath: Axes • Child axis: denotes the Element or Text children of the context node. • Important: An Attribute node has a parent (the element on which it is located), but an attribute node is not one of the children of its parent. • Example: child::D

XPath: Axes • Parent axis: denotes the parent of the context node. • The node test is either an element name, or * which matches all names, node() which matches all node types. • Always a Element or Document node, or an empty node-set (if the parent does not match the node test or does not satisfy a predicate). • .. is an abbreviation for parent::node(): the parent of the context • Example: parent::node()

XPath: Axes • Attribute axis: denotes the attributes of the context node. • The node test is either the attribute name, or * which matches all the names. • Example: attribute::*

XPath: Axes • Descendant axis: all the descendant nodes, except the Attribute nodes. • The node test is either the node name (for Element nodes), or * (any Element node) or text() (any Text node) or node() (all nodes). • The context node does not belong to the result: use descendant-or-self instead. • Example: descendant::node()

XPath: Axes • Example: descendant::*

XPath: Axes • Ancestor axis: all the ancestor nodes. • The node test is either the node name (for Element nodes), or node() (any Element node, and the Document root node). • The context node does not belong to the result: use ancestor-or-self instead. • Example: ancestor::node()

XPath: Axes • Following axis: all the nodes that follows the context node in the document order. • Attribute nodes are not selected. • The node test is either the node name, * text() or node(). • The axis preceding denotes all the nodes that precede the context node. • Example: following::node()

XPath: Axes • Following sibling axis: all the nodes that follows the context node, and share the same parent node. • Same node tests as descendant or following. • The axis preceding-sibling denotes all the nodes the precede the context node. • Example: following-sibling::node()

Location Path Abbreviations

XPath: Node Tests • The set of selected nodes is refined with node tests. • node tests rely upon the principal node type of an axis for selecting nodes in a location path

XPath: Axes • Location Paths Using Axes and Node Tests • Location paths are composed of sequences of location steps. • A location step contains an axis and a node test separated by a double-colon (::) and, optionally, a "predicate" enclosed in square brackets ([]). child::* • The above location path selects all element-node children of the context node, because the principal node type for the child axis is element.

XPath: Wildcard //author/child::* or //author/* Result: <first-name> Rick </first-name> <last-name> Hull </last-name> * Matches any element

XPath: Axes and Node Tests • child::text() • selects all text-node children of the context node • Combining two location steps to form the location path • child::*/child::text() • selects all text-node grandchildren of the context node

XPath: Node Tests /bib/book/author/text() Result: Serge Abiteboul Victor Vianu Jeffrey D. Ullman Rick Hull doesn’t appear because he has firstname, lastname /bib/book/author/*/text() Result: Rick Hull

XPath: Restricted Kleene Closure • select all author element nodes in an entire document /descendent-or-self::node()/child::author • Instead use the abbreviation: //author Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author> /bib//first-name Result: <first-name> Rick </first-name>

XPath: Attribute Nodes /bib/book/@price Result: “55” @price means that price has to be an attribute

XPath: Predicates • Boolean expression, built with tests and the Boolean connectors and/or (negation is expressed with the not() function); • a test is • either an XPath expression, whose result is converted to a Boolean; • a comparison or a call to a Boolean function. • Important: predicate evaluation requires several rules for converting nodes and node sets to the appropriate type.

Predicate Evaluation • A step is of the form axis::node-test[P] • First axis::node-test is evaluated: one obtains an intermediate result I • Second, for each node in I, P is evaluated: the step result consists of those nodes in I for which P is true. /A/B/descendant::text()[1]

Predicate Evaluation • Beware: an XPath step is always evaluated with respect to the context of the previous step. • Here the result consists of those Text nodes, first descendant (in the document order) of a node B. • /A/B//text()[1]

Web Data Management