Managing XML and Semistructured Data

Managing XML and Semistructured Data Lecture : Indexes

OEM vs. XML • OEM’s objects correspond to elements in XML • Sub-elements in XML are inherently ordered. • XML elements may optionally include a list of attribute value pairs. • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

OEM to XML • Example: • <Member project=“&5 &6”> <name>Jones</name> <age>46</age> <office> <building>gates</building> <room>252</room> </office></member> • This corresponds to rightmost member in the example OEM, where project is an attribute.

Select xFrom A.B xWhere exists y in x.C: y = 5

In this lecture • Indexes • XSet • Region algebras • Indexes for Arbitrary Semistructured Data • Dataguides • 1-2 indexes Resources • Index Structures for Path Expressions by Milo and Suciu, in ICDT'99 • XSet description: http://www.openhealth.org/XSet/ • Data on the WebAbiteboul, Buneman, Suciu : section 8.2

The problem • Input: large, irregular data graph • Output: index structure for evaluating regular path expressions

The Data Semistructured data instance = a large graph

The queries SELECT X fROM (Bib.*.author).(lastname|firstname).Abiteboul X Regular expressions (using Lorel-like syntax) Select x from part._*.supplier.name x Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression. Select X From part._*.supplier: {name: X, address: “Philadelphia”} Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.

Analyzing the problem • what kind of data • tree data (XML): easier to index • graph data: used in more complex applications • what kind of queries • restricted regular expressions (e.g. XPath): may be more efficient

XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:

XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluation (R1) SELECT X FROM part.name X -yes (R2) SELECT X FROM part.supplier.name X -yes (R3) SELECT X FROM *.supplier.name X -maybe (R4) SELECT X FROM part.*.subpart.name X -maybe • To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name. • R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name. • Thus, explore the entire subtree dominated by h2. • Will be efficient if index is small and fits in memory • R3 – leading wild card forces to consider all nodes in the index tree, resulting in less efficient computation than for R4. • Can index the index itself. • Retrieve all hash tables that contain a supplier entry, continue a normal search from there.

Region Algebras • Structured text = text with tags (like XML) • New Oxford English Dictionary • critical limitation:ordered data only (like text) • Assume: data given as an XML text file, and implicit ordering in the file. • less critical limitation: restricted regular expressions

Region Algebras: Definitions • data = sequence of characters [c1c2c3 …] • region = segment of the text in a file • representation (x,y) = [cx,cx+1, … cy], x – start position, y – end position of the region • example: <section> … </section> • region set = a set of regions s.t. any two regions are either disjoint or one included in the other • example all <section> regions (may be nested) • Tree data – each node defines a region and each set of nodes define a region set. • example: region p2 consisting of text under p2, set {p2,s2,s1} is a region set with three regions

Representation of a region set • Example: the <subpart> region set: • region algebra = operators on region set, s1 op s2defines a new region set

Region algebra: some operators • s1intersect s2 = {r | r s1, r s2} • s1included s2 = {r | rs1, r´ s2, r  r´} • s1including s2 = {r | r s1, r´ s2, r  r´} • s1parent s2 = {r | r s1, r´ s2, r is a parent of r´} • s1child s2 = {r | r s1, r´ s2, r is child of r´} Examples: <subpart> included <part> = { s1, s2, s3, s5} <part>including<subpart> = {p2, p3} <name> child <part> = {n1, n3, n12}

From path expressions to region expressions • Use region algebra operators to answer regular path expressions: • Only restricted forms of regular path expressions can be translated into region algebra operators • expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene closure *. part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root)) Region expressions correspond to simple XPath expressions

From path expressions to region expressions • Answering more complex queries: • Translates into the following region algebra expression: • “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text. • Such a region can be computed dynamically using a full text index. • Region expressions correspond to simple XPath expressions Select X From *.subpart: {name: X, *.supplier.address: “Philadelphia”} Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))

Indexes for Arbitrary Semistructured Data • A semistructured data instance that is a DAG

Indexes for Arbitrary Semistructured Data • The data represents employees and projects in a company. • Two kinds of employees – programmers and statisticians • Three kinds of links to projects – leads, workson, consultants • Index graph – reduced graph that summarizes all paths from root in the data graph • Example: node p1 – paths from root to p1 labeled with the following five sequences: Project Employee.leads Employee.workson Programmer.employee.leads Programmer.employee.workson • Node p2 – paths from root to p2 labeled by same five sequences • p1 and p2 are language-equivalent

Indexes for Arbitrary Semistructured Data • For each node x in the data graph, Lx = {w|  a path from the root to x labeled w} Note that Lx will be infinite if graph has a cycle! For any two nodes x and y, they are language equivalent x,y x  y  Lx = Ly Equivalence class of x, [x] = {y | x  y } Nodes(I) = {[x] | x  nodes(G) I = Edges(I) = {[x] [y] | x  [x], y  [y], x y }

Indexes for Arbitrary Semistructured Data • We have the following equivalences: e1  e2 e3  e4  e5 p1  p2 p3  p4 p5  p6  p7

Indexes for Arbitrary Semistructured Data • Computing path expression queries • Compute query on I and obtain set of index nodes • Compute union of all extents, a list of pointers to all data nodes in the equivalence class • Returns nodes h8, h9. • Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8] • Always: size(I)  size(G) • Efficient when I can be stored in main memory • Checking x  y is expensive. Select X From statistician.employee.(leads|consults): X

DataGuides • Goldman & Widom [VLDB 97] • graph data • arbitrary regular expressions

DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

Dataguides Example:

DataGuides • Multiple DataGuides for the same data:

DataGuides Definition Let w, w’ be two words (I.e word queries) and G a graph w G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if G is the same as DB

DataGuides Example: • G1 is a strong dataguide • G2 is not strong person.project !DB dept.project person.project G2 dept.project

DataGuides • Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)= while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) • Use hash table for Nodes(G)

DataGuides • How large are the dataguides ? • if DB is a tree, then size(G) <= size(DB) • why? answer: every node is in exactly one extent of G • here: dataguide = XSet Dataguides usually fail on data with cyclic schemas, like:

T-Indexes • Milo & Suciu [ICDT 99] • 1-index: • data graph • arbitrary regular expressions • 2-index, T-index: for more complex queries, consisting of more regular expressions.

T-Indexes • T-index: template index • Trades space for generality • The class of paths associated with a given T-index is specified by a path template • Example 1: x y. Here can be replaced by any regular expression. • Example 2: (*.Restaurant) x y. The first regular expression is fixed; this T-index takes less space but is less general. • T-indexes can be generated efficiently. • The size of a T-index associated to a single regular expression is at most linear in that of the database P P P P

1-Indexes • Database: DB = (V,E,Roots), V is finite set of nodes, E is a set of labeled edges, R is a set of root nodes. • Regular path expressions P ::=  |  | ƒ | (P|P) | (P.P) | P.* where ƒ are formulas defined over predicates p1, p2,…on the set of data values. • A path expression p = v0 v1 v2…vn-1 vn • Queries: regular path expressions q(DB) • A query path is an expression of the form P1 x1 P2 x2 … Pn xn, xi variable names, Pi’s path expressions • A query has the form Select x1, x2, …, xn from P1 x1 P2 x2 … Pn xn a1 a2 an

1-Indexes P F • Path template t = T1 x1 T2 x2 … T3 x3, Ti a regular expression or or • Instantiating query paths • Query path q = instantiating and by regular path expression and some formula, respectively, in template t • Example: path template t = (*.Restaurant) x1 x2 Name x3 x4 • Query path instantiations: • q1 = (*.Restaurant) x1 * x2Name x3Fridays x4 • q2 = (*.Restaurant) x1 * x2Name x3 _ x4 ( _ is a predicate with True) • q3 = (*.Restaurant) x1 (  | _ ) x2Name x3Fridays x4 P F P F

1-Indexes P • Goal: compute efficiently queries q  inst( x) • A first attempt: • Lu is the set of words on path reachable from root to u. • That is, all the path queries that lead to u. uV. Lu = {a1…an | v0 … vnDB, v0Root, vn=u} u,vV. u  v  Lu = Lv That is, u and v are indistinguishable by path queries from root. uV. [u] = {v | u  v} is a equivalence class containing u a1 an

Managing XML and Semistructured Data