480 likes | 706 Views
Comparison of Access Methods for Time-Evolving Data. Betty Salzberg and Vassilis Tsotras CS599, Temporal and Spatial Databases Course. Presented by: Atousa Golpayegani 11/16/2000. Temporal database design problem. Temporal Queries. Access Methods. R-tree. snapshot-index.
E N D
Comparison of Access Methods for Time-Evolving Data Betty Salzberg and Vassilis Tsotras CS599, Temporal and Spatial Databases Course. Presented by: Atousa Golpayegani 11/16/2000
Temporal database design problem Temporal Queries Access Methods R-tree snapshot-index DBMS Software B+-tree Time Dimension time-index Access Method? Database Designer
Temporal database design problem Temporal Queries Access Methods B+-tree R-tree Access Method Selection Criteria snapshot-index DBMS Software time-index Selected Time Dimension Access Method Database Designer
Outline • Brief introduction to temporal databases • Criteria for comparison of access methods • Efficient method design for transaction/Bitemporal data • Method classification and comparison
Conventional Vs. Temporal Databases • Conventional database captures only a single snapshot of the modeled reality, usually the most current. • Can not support past and future data. • Ex: “ the current salary of an employee”. • Temporal database supports some time domain and is thus able to manage time varying data. • User-defined time is excluded. • Ex: “ the current and the previous salaries of an employee since hiring”.
What is Stored in a Temporal Database? • Tuple-versioning Temporal model is used. In this model Database is a set of records that store the versions of the real life objects. Each record has: • A time invariant key • A number of time variant attributes (for simplicity just one) • One or two intervals • Start time • End time • Ex: Employee (ss#, name, salary, [t1, now) )
Taxonomy of Time in a Temporal Database • Transaction Time • Is defined as the time when a fact is stored in the database. • Valid Time • Is defined as the time when a fact becomes effective (valid) in reality. • Bitemporal Time • Is the combination of the above two types. • Time is assumed to be discrete, and consecutive nonnegative integers • Any change is assumed to occur only at an indicated time • Addition of an object • Deletion of an object • Value change of an object’s attribute
Logical state T4 a b b T1 T2 T3 T4 T5 Time Transaction Time Database • Time Interval shows when an object was added and deleted. • An object is alive from the time it is added to the database until it is deleted. • Ex. Object ‘a’ is the only one that is currently alive. • No record is physically deleted (logical deletion). • Therefore database has logical state • Ex. Object ‘b’ is deleted. • Logical state at time t consists of those records whose transaction time interval contains t. • Ex. S(T4)= (‘a’ , ‘b’)
Assumptions in a Transaction Time Database • Past states can not be changed. • Linear transaction time evolution • A New database state is created by updating only the current database state. • Another option is called branching transaction time evolution, where new states can be created from any past database state. • Implicit updating assumption • If an object is updated at time t, the database system will be updated at the same time. There is no delay.
Transaction Time Database Example key Salary Transaction Time Name A [ T1, now) SS1 X1 B [ T2, T3) SS2 X2 B [T4, T5) X2 SS2 (SS1,A,X1) (SS2,B,X2) (SS12B,X2) T1 T2 T3 T4 T5 Time
Logical state T4 a b b T1 T2 T3 T4 T5 Time Transaction Time Database Access Method Characteristic • Store the past logical states. • Support addition/deletion/modification changes on the objects of the current logical state. • Efficiently access and query objects in any of logical states.
I2 I1 Valid Time Valid Time State i-1 State i Valid Time Database • Time Interval shows the validity period of an object in reality. • Ex. The period that a contract is valid. • An object that is not valid anymore will be physically deleted from the database. • Only the latest “snapshot” of the collection of interval-objects are kept.
I2 Valid Time Valid Time Database Access Method Characteristic • Store the latest collection of interval-objects. • Support addition/deletion/modification changes to this collection. • Efficiently query the interval-objects contained in the collection.
Bitemporal Database • Supports both transaction time and valid time. • Instead of a single collection of interval-objects, there is a sequence of collections, C(ti), indexed by transaction time. • Transaction time and valid time are two orthogonal time dimensions. • Ex. (Contract#, amount, duration, [t1, now)) C(t2) C(t3) C(t4) C(t1) t2 t3 t4 t1
Bitemporal Database Access Method Characteristic • Store its past logical states. • Support addition/deletion/modification changes on the interval-objects of its current logical state. • Efficiently access and query the interval-objects on any of its states.
Criteria for Comparison of Access Methods • Query • Access Method Costs • Index Pagination and Data Clustering • Migration of Past Data to Another Location • Lower Bounds on I/O Complexity
Query • One of the criteria for the comparison is how efficient an access method answers a query. • Three general classes of queries are chosen. • They can be used for both valid time and transaction time since from a query perspective, valid time and transaction time are simply collections of intervals.
Query Type I Given a contiguous interval T, find all objects alive during this interval. • Representative query for this type: • Pure-timeslicequery: A special case of type I when interval T is reduced to a single time instant t Ex.” Find all employees working at the company at time T’ “ T’ T a b b T1 T2 T3 T8 T9 Time
Query Type II Given a key range and contiguous time interval T, find the objects with keys in the given range that are alive during interval T. • Representative query for this type: Range-timeslice query: A special case of type II when interval T is reduced to a single time instant t Ex. “ Find the employees working at the company at time T’ and whose ssn belongs in range K “ T’ Range = {a,b} T c a b b T1 T2 T3 T8 T9 Time
Query Type III Given a key range, find a history of objects in this range. • Representative query for this type: • Pure-Key query: A special case of type III where the key range is reduced to a single key Ex. “Find the salary history of employee with ssn b“ Range = {b} c a b b T1 T2 T3 T8 T9 Time
Bitemporal Queries • Bitemporal queries will be a combination of the previous types. • Ex.“ Find all the company contracts that were valid on v = Jan,1,94 as recorded in the database during transaction time interval T = May,1 - May,20, 1993.
Three-Entry Notation Query-Type Representation Key / Valid / Transaction • Values for Each entry: • Point • Range • ‘ * ’ • ‘ - ‘ • Example • Transaction time Pure-Timeslice: “ * / - / point “ • Valid Pure-Key: “ point / * / - “
Access Method Costs • The performance of an access method is characterized by three costs: • 1) Storage Space, to physically store the data records and the structures of the access method • 2) Update Processing Time, isthe time to update the method’s data structures as a result of a change. • 3) Query Time, for each of the basic queries.
Storage Space • For Transaction time and Bitemporal methods, storage space is a function of ‘n’, O(n). ‘n’ isthe number of changes in the database. • Ex. For 1000 insertions and 1000 deletions, n is 2000. • For Valid time methods, storage space is a function of ‘l ’, O(l). ‘l ’ isthe number of interval-objects currently stored in the method.. • Ex. If seven Objects are stored in the database, l = 7.
Update Processing Time • Update processing time is dependent on the access method index structure. • For tree structured access method, it is a logarithmic time. • Ex. The best in-core algorithm for valid pure-timeslice query, requires O (log l ) update processing time per change.
Query Time • A method’s query time is a function of the query answer size, denoted by ‘ a ‘. • It is usually consists of two parts: • Logarithmic time for searching the access method structure to find the answer records. • Time that is required to retrieve the result. • Ex. The best in-core algorithm for valid pure-timeslice query, requires O (log l + a) query time.
Index Pagination and Data clustering • In a database environment the major cost of computation is based on how many pages are transferred between main memory and secondary memory. • Index pagination. How well index nodes of a method are paginated on the secondary memory. • Since index is used to search for and update data, its pagination greatly improves the performance of the access method. • Ex. B+-tree is a well paginated index, since it needs O(logB r) page accesses for searching or updating r objects, using pages of size B.
Index Pagination and Data clustering • Data Clustering. Is to store ‘logically’ related data, physically close. • It can substantially improve the performance of an access method since in order to answer query, fewer pages are accessed. • Ex. A transaction pure-timeslice query takes O(logB n + a/B) page accesses if the result are clustered in pages of size B. ‘n’ is the number of changes in the database and ‘a’ is the answer size. If result is not clustered O(logB n + a) page accesses are required.
Migration of Past Data to Another Location • Methods that support transaction time maintain all their past states. This property can easily result in excessive amounts of data. • Methods that support the following two criteria will have a better performance: • Whether or not past data can be separated from the current data, so that the smaller collection of the current data can be accessed more efficiently. • Whether data is appended sequentially to the method and never changed, so that write-once read-many (WORM) devices can be used.
Migration of Past Data to Another Location • Since current data is more frequently accessed than past data, If a method supports these criteria, past data can be moved to a higher capacity, but slower medium such as optical disk. • There are two ways for this separation: • WithManual approach, a process vacuums all records that are ‘dead’ when the process is called and moves them to the optical disk. • With Automated approach, any objects that become ‘dead’ (logically deleted) will be moved to the optical disk
Lower Bound on I/O Complexity • The minimum number of I/Os to solve a transaction time query is at least (logB n + a/B) • Since in apaginated environment O(logB n) is the minimum number of I/Os to search for the answer in the index structure and O(a/B) is the minimum number of pages to store the answer. • This lower bound is under the assumption that all time instants have the same probability of being in the query predicate. • If queries have special probabilities, the logarithmic search part will be improved.Ex. If most queries ask for the recent times.
Lower Bound on I/O Complexity • The minimum number of I/Os to solve a valid time query is also at least (logB l + a/B). • A method is called “ I/O Optimal “ if it achieves: • O(logB n + a/B) query time and O(n/B) space for transaction time • O(logB l + a/B) query time and O(l/B) space for valid time. • O(logB m) update processing per change. ‘m’ is the number of alive objects when the update takes place.
Copy Copy Copy Copy Copy a b b T1 T2 T3 T4 T5 Time Efficient Method Design for Transaction/Bitemporal Data • TheQuestion is how to efficiently store large amount of data produced by a transaction time database. • There are two approaches for storing the data to answer pure-timeslice queries efficiently (copy, log): • Copy approach stores a copy of the transaction database state (timeslice) for each transaction time that at least one change occurred.
Copy 1 Rec. Copy 2 Rec. Copy 2 Rec. Copy 2 Rec. Copy 2 Rec. a b b T1 T2 T3 T4 T5 Time Efficient Method Design for Transaction/Bitemporal Data • These copies are indexed by time t. Since changes arrive in order, therefore indexes are paginated (which provides a minimal query time). • It means O(logB n) page accesses are needed to find the timeslice and O(a/B) pages to retrieve the result. Therefore the page accesses and also query time is O(logB n + a/B)
Copy 1 Rec. Copy 2 Rec. Copy 3 Rec. Copy 4 Rec. Copy 5 Rec. . . . a b c d e T1 T2 T3 T4 T5 Time Efficient Method Design for Transaction/Bitemporal Data • Disadvantages: • The space used can in the worst case be O(n2/B) pages. 1 + 2 + 3 + 4 + 5 + . . . + n= n *(n+1)/2 = (n2 + n)/2 O(n2/B). • Update processing per change in the worst: O(n/B).
Copy 1 Rec. Copy 1 Rec. Copy 1 Rec. Copy 1 Rec. Copy 1 Rec. a b b T1 T2 T3 T4 T5 Time Efficient Method Design for Transaction/Bitemporal Data • Log approach stores only the changes that occur in the database time-stamped by the time instant on which they occurred. • The space used is reduced to O(n/ B) pages. • The Update processing per change is also reduces to O(1). • The Query time increased to O(n/ B) Since to find a timeslice all the past records may have to be searched
Efficient Method Design for Transaction/Bitemporal Data • For the Pure-Key queries a better approach than copy and log is to store history of each key separately, This creates a key-Only method. • For the Range-Timeslice queries it is best to cluster by transaction time and key within pages. • For the Bitemporal queries the copy approach can be used to store the collections at given transaction time and a log approach to store changes between copies. • Another approach is to index the bitemporal objects on a single time axis. This approach is not efficient because very few of the accessed objects may satisfy the valid-time predicate.
Method Classification • Transaction access method is classified into three categories: • Time-only : clustering data by time only. • Key-only : clustering data by key only. • Time-key : clustering data by both time and key.
Examples of Method Classes Transaction Time time-only key-only time-keyappend-only tree reverse chaining composite indexes time-index accession lists segment-R tree differential file approach time sequence array write-once B-tree achievable time index C-lists time-split B-tree snap shot index multiversion access window method structure overlapping B-tree
Time Index Access Method (Time Only) • Based B+-Tree access method on the time axis. • New nodes are always added on the rightmost leaf of the index. • It is equivalent to Copy approach
Key Key Current Past B+-tree Index B+-tree Index (K3,A3’,I3’) ø (K2,A2’,I2’) ø (K2,A2”,I2”) (K4,A4,I4) ø (K4,A4’,I4’) (K2,A2,I2) (K4,A4”,I4”) (K1,A1,I1) (K3,A3,I3) P3 P2 ø Reverse chaining (Key-Only)
Reverse Chaining • The current and past data are stored separately. • Previous version of a given key are linked together in reverse chronological order. • Current data is assume to be queried more often.(smaller data size means faster query) • Each tuple includes • Key • Attribute value • Life span interval • Pointer (which points to the previous version, if any, of this key). • Update processing is O(logB n) , the query time is O(logB n + a), and the used space is O(n / B)