Lore: A Database Management System for Semistructured Data

Lore: A Database Management System for Semistructured Data

Why? • Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema. • Relational DBMS might use null values in this case. • May be difficult to decide in advance on a specific schema. • Data elements may change types. • Structure changes a lot (lots of schema modifications).

Semistructured Data • Examples: • Data from the web • Overall site structure may change often. • It would be nice to be able to query a web site. • Data integrated from multiple, heterogeneous data sources. • Information sources change, or new sources added.

Object Exchange Model (OEM) • Data in this model can be thought of as a labeled directed graph. • Schema-less and self-describing. • Vertices in graph are objects. • Each object has a unique object identifier (oid), such as &5. • Atomic objects have no outgoing edges and are types such as int, real, string, gif, java, etc. • All other objects that have outgoing edges are called complex objects.

OEM (Cont.) • Examples: • Object &3 is complex, and its subobjects are &8, &9, &10, and &11. • Object &7 is atomic and has value “Clark”. • DBGroup is a name that denotes object &1.(Names are entry points into the database).

OEM to XML • Example: • <Member project=“&5 &6”> <name>Jones</name> <age>46</age> <office> <building>gates</building> <room>252</room> </office></member> • This corresponds to rightmost member in the example OEM, where project is an attribute.

Lorel Query Language • Need query language that supports path expressions for traversing graph data and handling of ‘typeless’ data. • A simple path expression is a name followed by a sequence of labels. • DBGroup.Member.Office. • Set of objects that can be reached starting with the DBGroup object, following edges labels member and then office.

Lorel (cont.) • Example: • select DBGroup.Member.Officewhere DBGroup.Member.Age < 30 • Result: • Office “Gates 252” • Office Building “CIS” Room “411”

Lorel Query Rewrite • Previous query rewritten to: • select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30 • Comparison on age transformed to existential condition. • Since all properties are set-valued in OEM. • A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown.

Lorel Query Rewrite • Why? • Breaking query into simple path expressions necessary for query optimization. • Need to explicitly handle coercion. • Atomic objects and values. 0.5 < “0.9” should return true • Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.

Lorel (cont.) • General path expressions are loosely specified patterns for labels in the database.(‘|’ disjunction, ‘?’ label pattern optional) • Example: • select DBGroup.Member.Namewhere DBGroup.Member.Office(.Room%|.Cubicle)? like “%252” • Result: • Name “Jones”Name “Smith”

Query and Update Processing • Query is parsed • Parse tree is preprocessed and translated to new OQL-like query. • Query plan constructed. • Query optimization. • Opt. query plan executed.

System Architecture

Iterators and Object Assignments • Use recursive iterator approach: • execution begins at top of query plan • each node in the plan requests a tuple at a time from its children and performs some operation on the tuple(s). • pass result tuples up to parent.

Object Assignments (OAs) • OA is a data structure containing slots for range variables with additional slots depending on the query. • Each slot within an OA will holds the oid of a vertex on a path being considered by the query engine. • Example: if OA1 holds oid for “Smith” then OA2 and OA3 can hold the oids for one of Smiths Office objects and Age objects.

Query Operators • For example, the Scan operator returns all oids that are subobjects of a given object following a specified path expression. • Scan (StartingOASlot, Path_expression, TargetOASlot) • For each oid in StartingOASlot, check to see if object satisfies path_expression and place oid into TargetOASlot. • Other operators include Join, Project, Select, Aggregation, etc. • Join node like nested-loop join in relational DBMS.

Query Optimization • Does only a few optimizations: • Push selection ops down query tree. • Eliminate/combine redundant query operators. • Explores query plans that use indexes where possible. • Two kinds of indexes: • Lindex (link index) provide parent pointers impl. as hashing. • Vindex (value index) impl. as B+-trees

Indexes • Because of non-strict typing system, have String Vindex, Real Vindex, and String-coerced-to-real Vindex. • Separate B-Trees for each type are constructed. • Using Vindex for comparison (e.g. Age < 30) consider the following: • If type is string, do lookup in String Vindex • If can convert to real the do lookup in String-coerced-to-real Vindex. • If type is real?

Other issues • Update query operator example: • Update(Create_Edge, OA1, OA5, “Member”) • Create edge from results in OA1 to OA5 labeled “Member”. • Lore arranges objects in physical disk pages, each page with a number of slots with a single object in each slot. • Objects placed according to first-fit algorithm. • Supports large objects spanning multiple pages. • Objects clustered in depth-first manner (since Scan traverses depth-first). • Garbage collector removes unreachable objects.

External Data Manager • Enables retrieval of information from other data sources, transparent to the user. • An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source. • The spec for an external object includes: • Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.

Data Guides • A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog). • Why? • No explicit schema, how do we formulate meaningful queries? • Large databases (can’t just view graph structure). • What if a path expression doesn’t exist (waste). • Each possible path expression is encoded once.

{9, 13}

DataGuides As Histograms • Each object in the dataguide can have a link to its corresponding target set. • A target set is a set of oids reachable by that path. • TS of DBGroup.Member.Age is {9, 13}. • This is a path index. Can find set of objects reachable by a particular path. • Can store statistics in DataGuide (more in next paper). • For example, the # of atomic objects of each type reachable by p.

Conclusions • Takes advantage of the structure where it exists. • Handles lack of structure well (data type coercion, general path expressions). • Query language allows users to get and update data from semistructured sources. • DataGuide allows users to determine what paths exist, and gives useful statistical information

Query Optimization for Semistructured Data

OEM vs. XML • OEM’s objects correspond to elements in XML • Sub-elements in XML are inherently ordered. • XML elements may optionally include a list of attribute value pairs. • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

Indexes • Vindex(op, value, l, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l. • Vindex(“Age”, <, 30,y) places into y objects with age < 30. • Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l. • Lindex(x, “Age”, y) places into x all parents of y via label “Age”.

Indexes (cont.) • Bindex(l, x, y) finds all parent-child object pairs connected by a label l. • Bindex(“Age”, x, y) locates all parent-child pairs with label Age. • Pindex(PathExpression, x) placed into x all objects reachable via the path expression. • Pindex(“A.B x, x.C y”, y) places into y all objects reachable by going from A to B to C. • Uses DataGuide.

Simple Query • select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30 • Possible plans: • Top-down (similar to pointer-chasing, nested-loops join) • Use Vindex to check y < 30, traverse backwards from child to parent using Lindex(bottom-up). • Hybrid, both top down and bottom up. Meet in middle.

Select xFrom A.B xWhere exists y in x.C: y = 5

Query Plan Generation (Overview) • Logical query plan generator creates high-level execution strategy. • Physical query plan enumerator uses statistics and a cost model to transform logical query plan into an estimated best physical plan that lies within their search space.

Logical Query Plans (cont.) • Glue node represents a ‘rotation point’ that has as its children two independent subplans. • Rotating the order between independent components yields different plans. • Marks place where execution order is not fixed. • Discover node chooses best way to bind variables x and y. • Chain node chooses best evaluation of a path expression.

Logical query plan for:Select xFrom DBGroup.Member xWhere exists y in x.Age: y<30 from clause where clause

Physical Query Plans

Physical Query Plans (cont.) • Scan(x, l, y) places into y all objects that are subojects of x via edge labeled l. • Top-down (pointer chasing). • Lindex plan is bottom-up approach. • Bindex: Locate edges whose label appears infrequently in database. • NLJ: left subplan passes variables to right subplan.

Statistics • I/O metric uses estimated # of objects fetched. • For every label subpath p of length <= k: • # Of atomic objects of each type reachable by p • Min, and max values of all atomic objects of each type reachable by p • # Of instances of path p, denoted |p| • # Of distinct objects reachable by p, denoted |p|d • # Of l-labeled subobjects of all objects reachable by p • # Of incoming l-labeled edges to any instance of p, denoted |pl|

Plan Enumeration • Doesn’t consider joining two simple path expressions together unless they share a common variable. • Pindex is used only when path expression begins with a name and no variable except the last is used in the query. • Select clause always executes last. • Doesn’t try to reorder multiple independent path expressions.

Results • Used XML database about movies. Database graph contained 62,256 nodes and 130,402 edges. • Experiment 1: Select DB.Movie.Title • Best plan is Pindex, followed by top-down • Worst plan is Bindex, with hash joins.

Results (cont.) • Experiment 2: All Movies with a Genre of “Comedy” • Where clause is very selective, bottom-up does a Vindex for “Comedy” with incoming edge Genre

Results (cont.) • Experiment 3: Query with two existentially quantified variables in the where clause. • Errors due to bad estimates of atomic value distributions and set operation costs.

Results (cont.) • Experiment 4: Select movies with certain quality rating. • Quality ratings uncommon in database so optimizer chooses to find all ratings via Bindex, and then work bottom-up.

Conclusions • Cost estimates are accurate and select the best plan most of the time • Execution times of best and worst plans for a given query can differ by many orders of magnitude. • Best strategy is highly dependent upon the query and database (Query optimization is good for XML data).

Lore: A Database Management System for Semistructured Data