360 likes | 491 Views
ORDB Implementation Discussion. From RDB to ORDB. Issues to address when adding OO extensions to DBMS system. Layout of Data. Deal with large data types : ADTs/blobs special-purpose file space for such data, with special access methods Large fields in one tuple :
E N D
From RDB to ORDB Issues to address when adding OO extensions to DBMS system
Layout of Data • Deal with large data types : ADTs/blobs • special-purpose file space for such data, with special access methods • Large fields in one tuple : • One single tuple may not even fit on one disk page • Must break into sub-tuples and link via disk pointers • Flexible layout : • constructed types may have flexible sized sets, , e.g., one attribute can be a set of strings. • Need to provide meta-data inside each type concerning layout of fields within the tuple • Insertion/deletion will cause problems when contiguous layout of ‘tuples’ is assumed
Layout of Data • More layout design choices (clustering on disk): • Lay out complex object nested and clustered on disk (if nested and not pointer based) • Where to store objects that are referenced (shared) by possibly several other and different structures • Many design options for objects that are in a type hierarchy with inheritance • Constructed types such as arrays require novel methods, like array chunking into (4x4) subarrays for non-continuous access
Why (Object) Identifier ? • Distinguish objects regardless of content and location • Evolution of object over time • Sharing of objects without copying • Continuity of identity (persistence) • Versions of a single object
Objects/OIDs/Keys • Relational keys: RDB human meaningful name (mix data value with identity) • Variable name : PL give name to objects in program (mix addressability with identity) • Object identifier : ODB system-assigned globally unique name (location- and data-independent )
OIDs • System generated • Globally unique • Logical identifier (not physical representation; flexibility in relocation) • Remains valid for lifetime of object (persistent)
OID Support • OID generation : • uniqueness across time and system • Object handling : • Operations to test equality/identify • Operations to manipulate OIDs for object merging and copying. • Deal with avoiding dangling references
OID Implementation • By address (physical) • 32 bits; direct fast access like a pointer • By structured address • E.g., page and slot number • Both some physical and logical information • By surrogates • Purely logical oid • Use some algorithm to assure uniqueness • By typed surrogates • Contains both type id and object id • Determine type of object without fetching it
ADTs • Type representation: size/storage • Type access : import/export • Type manipulation: special methods to serve as filter predicates and join predicates • Special-purpose index structures : efficiency
ADTs • Mechanism to add index support along with ADT: • External storage of index file outside DBMS • Provide “access method interface” a la: • Open(), close(), search(x), retrieve-next() • Plus, statistics on external index • Or, generic ‘template’ index structure • Generalized Search Tree (GiST) – user-extensible • Concurrency/recovery provided
Query Processing • Query Parsing : • Type checking for methods • Subtyping/Overriding • Query Rewriting: • May translate path expressions into join operators • Deal with collection hierarchies (UNION?) • Indices or extraction out of collection hierarchy
Query Optimization Core • New algebra operators must be designed : • such as nest, unnest, array-ops, values/objects, etc. • Query optimizer must integrate them into optimization process : • New Rewrite rules • New Costing • New Heuristics
Query Optimization Revisited • Existing algebra operators revisited : SELECT • Where clause expressions can be expensive • So SELECT pushdown may be bad heuristic
Selection Condition Rewriting • EXAMPLE: • (tuple.attribute < 50) • Only CPU time (on the fly) • (tuple.location OVERLAPS lake-object) • Possibly complex CPU-heavy computations • May Involve both IO and CPU costs • State-of-art: • consider reduction factor only • Now, we must consider both factors: • Cost factor : dramatic variations • Reduction factor: unrelated to cost factor
Operator Ordering op1 op2
Ordering of SELECT Operators • Cost factor : now could be dramatic variations • Reduction factor: orthogonal to cost factor • We want maximal reduction and minimal cost: Rank ( operator ) = (reduction) * ( 1/cost ) • Order operators by increasing ‘rank’ • High rank : • (good) -> low in cost, and large reduction • Low rank • (bad) -> high in cost, and small reduction
Access Structures/Indices ( on what ?) • Indices that are ADT specific • Indices on navigation path • Indices on methods, not just on columns • Indices over collection hierarchies (trade-offs) • Indices for new WHERE clause expressions not just =, <, > ; but also “overlaps”,”similar”
Registering New Index (to Optimizer) • What WHERE conditions it supports • Estimated cost for “matching tuple” (IO/CPU) • Given by index designer (user?) • Monitor statistics; even construct test plans • Estimation of reduction factors/join factors • Register auxiliary function to estimate factor • Provide simple defaults
Methods • Use ADT/methods in query specification • Achieves: • flexibility • extensibility
Methods • Extensibility : Dynamic linking of methods defined outside DB • Flexibility : Overwriting methods for type hierarchy • Semantics : • Use of “methods” with implied semantics? • Incorporation of methods into query process may cause side-effects? • Performance of methods may be unpredictable ? • Termination may not be guaranteed?
Methods • “Untrusted” methods : • corrupt server • modify DB content (side effects) • Handling of “untrusted” methods : • restrict language; • interpret vs compile, • separate address space of DB server
Query Optimization with Methods • Estimation of “costs” of method predicates • See earlier discussion • Optimization of method execution: • Methods may be very expensive to execute • Idea: Similar as handling correlated nested subqueries • Recognize repetition and rewrite physical plan. • Provide some level of pre- computation and reuse
Strategies for Method Execution • 1. If called on same input, cache that one result • 2. If on full column, presort column first (groupby) • 3. Or, in general use full precomputation: • Precompute results for all domain values (parameters) • Put in hash-table : fct (val ); • During query processing lookup in hash-table val fct (val) • Or, possibly even perform a join with this table
Query Processing • User-defined methods • User-defined aggregate functions: • E.g., “second largest” or “most brightest picture” • Distributive aggregates: • incremental computation
Query Processing: Distribute Aggregates • For incremental computation of distributive aggregates: • Provide: • Initialize(): set up state space • Iterate(): per tuple update the state • Terminate(): compute final result based on state; and cleanup state • For example : “second largest” • Initialize(): 2 fields • Iterate(): per tuple compare numbers • Terminate(): remove 2 fields
Following Disk Pointers? • Complex object structures with object pointers may exist (~ disk pointers) • Navigate complex objects following pointers • Long-running transaction like in CAD design may work with complex object for longer duration • Question : What to do about “pointers” between subobjects or related objects ?
Following Disk Pointers: Options • Swizzle : • Swizzle = Replace OIDs references by in-memory pointers • Unswizzle = Convert back to disk-pointers when flushing to disk. • Issues : • In-memory table of OIDs and their state • Indicate in each object, pointer type via a bit. • Different policies for swizzling: • never • on access • attached to object brought in
Persistence? • We may want both persistent and transient data • Why ? • Programming language variables • Handle intermediate data • May want to apply queries to transient data
Properties for Persistence? • Orthogonal to types : • Data of any type can be persistent • Transparent to programmer : • Programmer can treat persistent and non-persistent objects the same way • Independent from mass storage: • No explicit read and write to persistent database
Models of Persistence • Persistence by type • Persistence by call • Persistence by reachability
Model of Persistence : by type • Parallel type systems: • Persistence by type, e.g., int and dbint • Programmer is responsible to make objects persistent • Programmer must make decision at object creation time • Allow for user control by “casting” types
Model of Persistence : by call • Persistence by explicit call • Explicit create/delete to persistent space • E.g., objects must be placed into “persistent containers” such as relations in order to be kept around • Eg., Insert object into Collection MyBooks; • Could be rather dynamic control without casting • Relatively simple to implement by DBMS
Model of Persistence: by reachability • Persistence by reachability : • Use global (or named) variables to objects and structures • Objects being referenced by other objects that are reachable by application, then they are also persistent by transitivity • No explicit deletes; rather need garbage collection to garbage the objects away once no longer referenced • Garbage collection techniques : • mark&sweep : mark all objects reachable from persistent roots; then delete others • scavenging : copy all reachable objects from one space to the other; but may suffer in disk-based environment due to IO overhead and distruction of clustering
Summary • A lot of work to get to OO support : From physical database design/layout issues up to logical query optimizer extensions • ORDB: Reuses existing implementation base and incrementally adds new features on (but relation is first-class citizen)