580 likes | 747 Views
Lore: A Database Management System for Semistructured Data. LORE. Lore : Lightweight Object Repository http://WWW-DB.Stanford.EDU/lore/demo/. Lore - motivation. The data may be irregular and thus not conform to a rigid schema.
E N D
LORE • Lore : Lightweight Object Repository • http://WWW-DB.Stanford.EDU/lore/demo/
Lore - motivation • The data may be irregular and thus not conform to a rigid schema. • Relational data model has null values, and OO models have inheritance and complex objects. Both have difficulties in designing schemas to incorporate irregular data. • It may be difficult to decide in advance on a single, correct schema, The structure of the data may evolve rapidly, data elements may change types, or data not conforming to previous structure may be added.
Thus, there is a need for management of semi-structured data! • Lore system manages semi-structured data. The data managed by Lore is not confined to a schema and it may be irregular or incomplete. • OEM is the Lore’s data model. OEM - object Exchange Model - graph based self-describing object instance model where nodes are objects and edges are labeled with attribute names and leaf nodes have atomic values • Lore is light weight object repository and Lorel is Lore’s query language.
Path expression queries • Automatic type coercion • Use of Data guides – Structural summary of the database
Objects with Oid • Atomic objects – no outgoing edge and contain a value from one of the basic atomic types such as Integer, real, string, gif, java, audio etc. • Complex objects – may have outgoing edges • Names – Sp. Labels that serves as aliases for objects and as entry points to the database.
Object Exchange Model - OEM • Motivation - information exchange and extraction • Each value exchanged is given an explicit label. Object temp-in-Fahrenheit, integer, 80 - “temp-in-Fahrenheit” is the label. Each object is self-describing, with a label, type and value. set-of-temps, set, {cmpnt1, cmpnt2} cmpnt1 is temp-in-Fahrenheit, integer, 80 cmpnt2 is temp-in-Celsius, integer, 20
Labels • Plays two roles • identifying an object (component) • identifying the meaning of an object (component) person-record, set, {cmpnt1, cmpnt2, cmpnt3} cmpnt1 is person-name, string, ``Fred’’ cmpnt2 is office-num-in-bldg-5, integer, 333 cmpnt3 is department, string, ``toy’’ • Person-name both identifies cmpnt1 and coveys its meaning. • In relational data this corresponds to ….
Labels - Issues • What does the label mean? • Database of labels • Ontology of labels - within each source • Labels are relative (more specific) to the source of the data object. • Similar labels from different sources need to be resolved. • Labels provide the flexibility in representing object structure
Self-describing data models • Have been in existence for a long time? Why additional interest now? • Use the ``nature’’ of self-describing data model for information exchange, and to extend the model to include object nesting. • To provide an appropriate object request language (query facility)
Label Type Value Object-ID OEM - Specification • Each object in OEM has the following structure: • Label: A variable character string describing what the object represents. • Type: The data type of the object’s value. Each is either an atom type, or type set. • Value: A variable-length value of the object. • Object-ID: A unique variable-length identifier for the object or null.
OEM - Summary • OEM is an information exchange model. It does not specify how objects are stored at source. • OEM does specify how objects are received at a client, but after objects are received they can be stored in any way the client likes. • Each source has a distinguished object with lexical identifier ``root’’. • Note the schema-less nature of OEM is particularly useful when a client does not know in advance the labels or structure of OEM objects.
<biblio,set,{doc1,doc2,…,docn}> • doc1 is <doc, set, {auths1, topic1, call-no1}> • auths1 is <auth-set,set {auth11}> • auth11 is <auth-ln, string, ``Ullman’’> • topic1 is <topic, string,``Databases’’> • call-no1 is <internal-call-no, integer, 25> • doc2 is <doc, set, {auths2, topic2, call-no2}> • auths2 is <auth-set,set {auth21, auth22, auth23}> • auth21 is <auth-ln, string, ``Aho’’> • auth22 is <auth-ln, string, ``Hopcroft’’> • auth23 is <auth-ln, string, ``Ullman’’> Example • topic2 is <topic, string,``Algorithms’’> • call-no1 is <dewey-decimal, string, ``BR273’’> • docn is <doc, set, {authsn, topicn, call-non}> • authsn is <auth,string, ``Crichton’’> • topic1 is <topic, string,``Dinosaurs’’> • call-no1 is <fictional-call-no, integer, 95> • biblio is the root object.
OEM - QL SELECT Fetch-expression FROM Object WHERE Condition • The result of this query is itself an object, with special label ``answer’’: answer, set, {obj1, obj2, …, objn} • Each returned obji is a component of object specified in the From clause of the query, where the component is located by the Fetch-expression and satisfies the Condition.
Path • The notion of path is used in both Fetch-Expression in the Select clause and the condition in the Where clause. • Path describes traversals through an object using subobject structure and labels. • Example: ``biblio.doc.auth’’ • Paths are used in Fetch-Expression to specify which components are are returned in the answer object. • Paths are used in the condition to qualify the fetched objects or other (related) components in the same object structure.
Queries - Simple • Retrieve the topic of each document for which ``Ullman’’ is one of the authors: SELECT biblio.doc.topic FROM root WHERE biblio.doc.auth-set.auth-ln = ``Ullman’’ • Intuitively, the query’s where clause finds all paths through subobject structure with the sequence of labels [biblio,doc,auth-set,auth-ln] such that the object at the end of the path has value ``Ullman.’’ <answer, set, {obj1, obj2}> obj1 is <topic, string, ``Databases’’> obj2 is <topic, string, “Algorithms”>
Queries - ``wild-cards’’ • Retrieve all documents with internal call number: SELECT biblio.?.topic FROM root WHERE biblio.?.internal-call-no • ``?’’ label matches any label. For this query, the doc labels can be replaced by any other strings and query would produce the same result. By convention, two occurrences of ? In the same query must match the same label unless variables are used. <answer, set, {obj1}> obj1 is <topic, string, ``Databases’’>
Queries - ``wild-paths’’ • Retrieve all documents with internal call number: SELECT *.topic FROM root WHERE *.internal-call-no • Symbol ``*’’ matches any path of length one or more. The use of * followed by a single label is a convenient and common way to locate objects with a certain label in complex structure. Similar to ?, two occurrences of * in the same query must match the same sequence of labels, unless variables are used. <answer, set, {obj1}> obj1 is <topic, string, ``Databases’’>
Queries - variables • Retrieve each document for which both ``Hopcroft’’ and ``Aho’’ are co-authors: SELECT biblio.doc FROM root WHERE biblio.doc.auth-set.auth-ln(a1)=``Aho’’ and biblio.doc.auth-set.auth-ln(a1)=``Hopcroft’’ • Here, the query finds all the paths with structure [biblio, doc, auth-set], and with two distinct path completions with label auth with values ``Aho’’ and ``Hopcroft’’ <answer, set, {obj1}> obj1 is the complete doc2
OEM (Cont.) • Examples: • Object &3 is complex, and its subobjects are &8, &9, &10, and &11. • Object &7 is atomic and has value “Clark”. • DBGroup is a name that denotes object &1.(Names are entry points into the database).
An OEM Database DBGroup &1 Member Project Member Member Project Member &2 &3 &4 &5 &6 Name Project Name Office Project Name Age Age Office Office &9 &7 &8 &10 &11 &12 &13 &14 &15 &16 “Clark” “Smith” 46 “Gates 252” “Lore” “Tsimmis” “Jones” 28 Building Room Room Building &17 &18 &19 &20 “CIS” “411” “CIS” 252
Lorel Queries - Simple Path Expression • Retrieve the offices of members with age greater than 30 years: Query SELECT DBGroup.Member.Office WHERE DBGroup.Member.Age > 30 Result Office “Gates 252” Office Building “CIS” Room “411”
Lorel Query Rewrite • Previous query rewritten to: • select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30 • Comparison on age transformed to existential condition. • Since all properties are set-valued in OEM. • A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown.
Lorel Query Rewrite • Why? • Breaking query into simple path expressions necessary for query optimization. • Need to explicitly handle coercion. • Atomic objects and values. 0.5 < “0.9” should return true • Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.
Queries - General Path Expression Query SELECT DBGroup.Member.Name WHERE DBGroup.Member.Office(.Room%|.Cubicle)? Like “%252” Result Name “Jones” Name “Smith” • Room% matches all labels starting from Room, like Room68. “|” stands for disjunction. “?” indicates that the label pattern is optional. “like %252” specifies that the data value should end with string “252”.
Queries - SubQueries Retrieve Lore project members who work on other projects Query SELECT M.Name, ( SELECT M.Project.Title WHERE M.Project.Title != “Lore”) FROM DBGroup.Member M WHERE M.Project.Title = “Lore” Result Member Name “Jones” Title “Tsimmis”
Data Guides • A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog). • Why? • No explicit schema, how do we formulate meaningful queries? • Large databases (can’t just view graph structure). • What if a path expression doesn’t exist (waste). • Each possible path expression is encoded once.
DataGuides As Histograms • Each object in the dataguide can have a link to its corresponding target set. • A target set is a set of oids reachable by that path. • TS of DBGroup.Member.Age is {9, 13}. • This is a path index. Can find set of objects reachable by a particular path. • Can store statistics in DataGuide • For example, the # of atomic objects of each type reachable by p.
Conclusions • Takes advantage of the structure where it exists. • Handles lack of structure well (data type coercion, general path expressions). • Query language allows users to get and update data from semistructured sources. • DataGuide allows users to determine what paths exist, and gives useful statistical information • Lore does facilitate query and updates on semi-structural databases
OEM vs. XML • OEM’s objects correspond to elements in XML • Sub-elements in XML are inherently ordered. • XML elements may optionally include a list of attribute value pairs. • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.
OEM to XML • Example: • <Member project=“&5 &6”> <name>Jones</name> <age>46</age> <office> <building>gates</building> <room>252</room> </office></member> • This corresponds to rightmost member in the example OEM, where project is an attribute.
External Data Manager • Enables retrieval of information from other data sources, transparent to the user. • An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source. • The spec for an external object includes: • Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.
Layer 0 • Access to Lore – API, Applications • Parser – textual query as input and parse tree as output • Preprocessor – input parse tree to OQL like query • Query plan and query optimizer (index etc)
Layer 1 • Object manager – translation layer between OEM to lower level files, compare objects, • Query operator – execute query plan, perform simple coercion, iterating over subobjects of a complex object
QUERY select M.Name, (select M.Project.Title where M.Project.Title != "Lore" ) from DBGroup.Member M where M.Project.Title = "Lore" RESULT Member Name "Jones" Title "Tsimmis"
update P.Member += ( select DBGroup.Member where DBGroup.Member.Name = "Clark" ) from DBGroup.Project P where P.Title = "Lore" or P.Title = "Tsimmis" P.member += specifies to add member edges between P and every object returned by subquery.
Query Execution Plan • Iterative approach in query processing • Execution begins at the top of the query plan, with each node in the plan requesting a tuple at a time from its children and performing some operation on the tuple. • After a node completes its operation, it passes a resulting tuple to its parent.
Object assignment – OA is a simple data structures containing slots corresponding to range variables in the query and some additional slots • Each slot within an OA will hold the oid of a vertex on a data path currently being considered by the query engine • Example – OA1 holds oid for member “smith”, then OA2 and OA3 can hold the oids for one of smith’s office subobjects and one of his age subobjects resp.
Query Operators • Each operator takes a number of arguments with the last argument being OA slot that will contain the result of the operation. • Select , project has no target slot. • Scan returns all oids that are subobjects of a given object • Scan (starting OA slot, path exp., Target OA slot) • Scan until no subobjects that satisfies path expression • Scan (OA1, “office”, OA2) : place into slot OA2 one at a time all office subobjects appearing in slot OA1.
Join/select/project – nearly identical to RDBMS • Project is to limit which objects should be returned • Select applied predicate to the object identified by the oid in the OA slot specified • Aggregate – implements quantification and aggregation node calls its child exhaustively , storing the results temp. or computing aggregation • A new object is created when no more valid OAs
select O from DBGroup.Member M, M.Office O where exists A in M.Age : A > 30
Primary Operators • Setop, ArithOp, Create Set and Groupby • Setop handles union, intersect,except • Arithop – addition, multiplication • Createset – is to package the results of an arbitrary subquery before proceeding: its called its child exhaustively , storing each oid returned as part of a newly created complex object. • It stores oid for the new set of objects within target slot • Group by handles subquery that includes a groupby expression
select M.Name, count(M.Publication) from DBGroup.Member M where M.Dept = "CS" select (select N from M.Name N), count(select P from M.Publication P) from DBGroup.Member M where exists D in M.Dept : D = "CS"
Indexing • Value Index – vindex • Lindex – a link (edge) index • Lindex – (oid, label) and returns the oid’s of all parents via the specified label (provides parent pointers) • Vindex (label,operator,value) – returns all atomic objects having an incoming edge with the specified label and a value satisfying the specified operator (<)
Index Query Plan • Bottem – Up • Locate all objects with desired values and appropriately labeled incoming edges via vindex • Using Lindex then traverse up from these objects to match the path exp