280 likes | 463 Views
Lore: A Database Management System for Semi-structured Data. Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom Stanford University. Mustafa Emre ERKO Ç 20.11.2002. What is Lore (Lightweight Object Repository):
E N D
Lore: A Database Management System for Semi-structured Data Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom Stanford University Mustafa Emre ERKOÇ 20.11.2002
What is Lore (Lightweight Object Repository): A DBMS designed specifically for managing semi structured data. In the paper: Storage manager Indexing Query Processing and Optimization User Interfaces is described. Needs for such a DBMS: 1)The data can be irregular and not conform to a rigid schema. 2)There can be changes in the structure of the schema such as: data elements can change types, data not conforming previous schema can be edit and other schema modifications What Lore Does: since it is designed for managing semi structured data, there is no existence of schema. Lore data model is very simple, self describing and nested object model called OEM(Object Exchange Model)
Outline of Presentation : Representing and Querying Semi Structured Model Data Model Lorel Query Language System Architecture Query and Update Processing Query Operators and Query Plan Query Indexing Lindex and Vindex Index Query Plan Update Query Plan Physical Storage Novel Features External Data Manager Data Guides Interfaces API Web System Status and Future Work
What has done so far? The Lorel query language for semi structured data. 11/96 Enabling query formulation and optimization in semi structured database. 08/97 Representing and querying changes in semi structured data. 02/97 Querying semi structured heterogeneous information.12/96 Query optimization for semi structured data. Integrating dynamically-fetched external information into a dbms for semi structured data. 05/97
Representing and querying The object exchange model • Oem is designed for semistructured data Data can be thought as labeled directed graph • The vertices are objects and names are special labels serve as aliases for objects and as • entry points into database • Every object has a unique object identifier (oid) Example : &5 • Atomic Object: object that has no outgoing edge. Contain a value from one of the basic • atomic types such as integer, real,string, gif etc • Example: Object &7 with value Clark • Complex Objects: all other objects that may have subobjects. • Example: &3 and subobjects &8 &9 &10 and &11 • Any object that cannot be accessed by a path from some name is considered to be deleted. • Members have zero, one or more offices • An office is sometimes string and sometimes integer and sometimes complex object • Room may be a string and integer
Representing and querying semistuctured dataLorel query language Lorel is an extension of OQL. Basic building block of lorel is the simple path expression which is a name followed by a sequence of labels. Ex : DBGruop.Member.Office Ex: select DBGruop.Member.Office Where DBGruop.Member.Age>30 Result is office “Gates 252” Office Building ”CIS” Room “411” This query will not yield a run-time error if an Age object has a string value or is complex or They are singled-value, set-value or even empty for some group members. The OQL style of this query is: Select O From DBGruop.Member M, M.Office O Where exits A in M.Age : A>30 As will be seen later the first step of query processing is to rewriting the query into OQL style
Representing and querying semistuctured datalorel query language One can specify pattern for path pattern for labels pattern for atomic values. Example: Select DBGroup.Member.Name Where DBGroup.Member.Office(.Room%|.Cubicle)? like”%252” Result Name “Jones” Name “ Smith” The following query illustrates subqueries and constructed results. Select N.Name, (select M.Project.Title where M.Project.Title !=“lore”) From DBGroup.Member M Where M.Project.Title = “lore” Result: Name “ Jones” Title “ Tsimmis”
System Architecture • Access to Lore is through a variety of applications or directly from application • Program Interface (API ) • There is a simple textual interface generally used by developers. • The graphical interface is the primary interface for the end users. It provides powerful • tools for query results, formulating simple queries and mechanism for viewing the • multimedia atomic types • The query compilation layer is consist of • The parser -Preprocessor • Query plan generator -Query optimizer. • Query optimizer, in addition to doing some transformations on the query plan, decides • whether the use of indexes is feasible. • The Data Engine Layer houses • - OEM object manager - Query operators • -External data manager - Various utilities. • the object manager functions as the translation layer between OEM and the low level file • constructs. • It supports : Fetching an object, comparing two objects, performing simple coercions, • cache of frequently accessed objects
Query and Update Processing in Lore • Query processing in Lorel is fairly conventional with some exceptions: • Becuase of the flexibility of Lorel, the preprocessing of the parse tree to produce OQL • like query is complex. • Although the Lore engine is built around standard operators, some take an original flavor • Ex: scan operator. • A unique feature of Lore is its automatic coercion of atomic values. Coercion has an • impact on the implementation of comparators. • The result of a Lorel query is always a set of OEM object which become subobject of • a newly created result object.
Query and Update Processing in LoreQuery processing Query execution strategy is based on familiar database operators. A recursive iterator approach is used in query processing. With the iterators execution begins at the top of the query plan,with each node in the plan requesting a tuple from its children and performing some operations on the tuples After a nodes complete its operation, it passes a resulting tuple up to its parent. The tuples on which the operation takes places is Object Assignments. OA is a simple Data structure containing slots corresponding to range variables in the query If OA1holds the oid for member “Smith” then OA2 and Oa3 can hold the oid’s For one of the Smith’s Office subobject and one of the Smith’s Age subobject respectively
Query and Update Processing in LoreQuery operators Join, project, select operators are nearly identical to their corresponding relational operators. Aggregation operator calls its child exhaustively, storing the result temporarily or computing the aggregate incrementally. When the child can produce no more valid OA’s, a new object is created whose value the final aggregation. Other operators : SetOp, ArithOp, CreateSet, and Gruopby. SetOp handles the set operations; Union, intersect and except. ArithOp handles arithmatic operators as; addition, multiplication etc. CreateSet is used to package the result of an arbitrary subquery before proceeding. Gruopby handles queries that include a groupby expression.
Example: select M.Name, count (M.Publication) from DBGroup.Member M Where M/Dept = ‘CS’ OQL like translation is select (select N from M.Name N), count (select P from M.Publication) from DB.Group.Member M where exist D in M.Dept : D = ‘CS’
Query and update processing in LoreQuery Optimization & Indexing Query processor currently implement only a few simple heuristic optimization technique E.g Selection operators are pushed down the query tree, and in some cases redundant operator are eliminated or combined Lore explored query plans that use indexes when feasible. In a relational DBMS, an index is created on an attribute in order to locate tuples with particular attribute values quickly. In Lore, such a value index, alone is not sufficient since path to an object is as important as the value of object. So there are two types of indexes: Link (Edge) Index :also called Lindex, takes an oid and a label , and returns the oid’s of All parents via specified label. It essentially provides “parent pointers” Value Index: also called Vindex, takes a label, operator and value. It returns all atomic objects having an incoming edge with specified label and a value satisfying the specified operator and value.
Indexing Value Indexing Value indexing in Lore requires some novel features due to its non-strict typing system. Indexing system deals with coercion involving integers, reals, and strings only. But the situation is simplified by always coercing integers to reals as shown in the table. Three types of Vindexes are maintained: String vindex Real vindex String coerced to real vindex When using a Vindex for comparison there are two cases: 1- If the comparison type string then: i-do a look up in the string vindex ii- if the Value can be coerced to a real then look up fir the coerced value in the real vindex 2- If the value is real or iteger: i-do a look up in the real vindex; ii- also look up in the String coerced to real vindex
Ex:Select OFrom DBGruop.Member M, M.Office OWhere exits A in M.Age : A>30 Indexing Index query plan If the query contains a comparison between a path expression and an integer, real or string and if the appropriate indexes are exist then a query plan that uses indexes will be created Query plans using indexes are different in the shape from those based on Scan operators index plans traverse the database bottom-up , while scan based plans perform a top-down traversal.
IndexingIndex query plan The new operators are: Vindex: iteratively finds all atomic objects with the value bigger than 30 and an incoming edge labeled Age, placing their oids in slot OA2 Lindex: iteratively places into OA1 all parents of the object in OA2 via an Age edge. Once: the object could potentially have several partners via Age, however; since Age is existentially quantified in the query, we only want to consider each parent once, even if the it has several Age subobject; this is why we use once here. Name_Obj: in the second lindex, operator finds all parents of OA1 object via member edge and place them in OA0. The objects in AO0 should be named as DBGroup and the purpose of Name_Obj operator is to do this.
Query and Update processing in LoreUpdate query plan Update is done by adding an Update operator to the query execution engine. Ex: update P.Member += (select DBGroup.Member where DBGroup.Member.Name=‘Clark’) from DBGroup.Project p where P.Title = ‘Lore’ or P.Title = ‘Tsimmis’ This update operator add an edge between the projects whose titles are Lore and Tsimmis, and members called Clark. Here we use Create_edge operation. The other valid operations are: Destroy_edge, modify_atomic
Query and Update Processing in Lore Bulk Loading and Physical Storage • Bulk Loading • Data can be added to a Lore database in two ways: • By using update statement • By load file • In the second case, a textual description of an OEM database is accepted by a loas utility • which includes useful features such as symbolic references for shared subobject, as well • as the ability to incorporate new data into an existing database. • Physical Structure • Lore arranges objects in a physical disk pages; each page has a number of slots with a • single object in each slot. • Object are variable length and so Lore places object according to a first-fit algorithm. Lore • use an object forwarding mechanism to handle object that grow too large objects that may • span page. • In addition, Lore supports for large objects that span many pages.
Novel Features External Data External Data Manager enables dynamic retrieval of information from other data sources based on the queries issued to Lore. During the query evaluation, externally obtained data is combined with resident data and so The distinction between two type of data is invisible to end user. During the query processing, when the execution engine discover an external object information is fetched from external source to answer query. Also the fetched information is stored in database until it becomes stale.
Novel FeaturesExternal Data Specifications for an external objects: i-The location of wrapper program that fetches the external data and translate it into OEM ii-A quantum that indicates that indicates the time interval until the fetched information become stale. iii-A set of arguments that are used to limit the information fetched in a call to the external source Arguments sent to the external source can come from 3 places: i-the query being processed (query defined) ii-values of other objects in the local database(data defined) iii-constant values tied to external object(hard-coded)
Novel FeaturesExternal Data Many calls to an external sources quickly dominate query processing time. To limit the number of calls some mechanism is used. One of them is to track the argument sets used by previous queries and determine when previously fetched information or entirely subsumes information required by the current argument set
Novel Features Data Guides Lore does not have an explicit schema so query formulation and optimization are particularly challenging. A dataguide is a concise and accurate structure of an OEM database, stored as an OEM object. Each possible path expression of a database is encoded exactly once in the dataguides. Considering the role of missing schema, the Dataguides can guide the query processor In relational or O2 systems the schema is explicitly created before any data is loaded, However, dataguides are dynamically generated or maintained.
Interfaces to Lore Application Programming Interface • API provides a gateway between Lore and any user interface or client application. • For instance, it is used by the system’s textual interface, which passes user commands to • Lore and present result of queries in a hierarchical display. • API is composed of a small collection of C++ classes. At the highest level it is used : • as a client program to connect to a Lore database. • To submit queries • To process query result. • LoreConnection class is to connect a specific database • Submit function is to submit Lorel queries. It is also used for other Lore command such • as index creation and updates. When submit is called with a query, it returns query results • as a LoreOem object. LoreOem only contains oid and the actual value is fetched from • database on demand.
User Interfaces to LoreWeb interface By visiting an url and choosing a database, a user can connect to web interface. The user may submit a textual Lorel query or select a sample prewritten query. Result are displayed in an HTML, hierarchical format.
System status and Future work Lore is consist of approximately 60.000 lines of C++ codes. Some language features such as external predicates and functions are still under implementation. General path expression are not implemented in their full generality, although a substantial and very useful is. It is planned to explore how Lorel could be translated to SQL 3 and thus implemented on top of an object relational dbms. Performance Issues There is a little performance analysis are done for Lore. Comparing the performance of lore against the implemented lorel on top of O2 There is a significant additional research to do in query optimization, including query writing, operation ordering, selecting the best use of indexes in query plans. Currently all expansions of path expressions in query path are done at run-time. It is planned to explore the compile-time approach and compare its performance against the Run-time approach.
System Status and Future Work It is also considering to investigate more sophisticated techniques for customizing the presentation of OEM object in a Web environment. Finally, it is planned to incorporate a special text type along with a full-text indexing System because many applications appropriate for a DMBS such as Lore, include a Significant amount of text data.