470 likes | 582 Views
ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML. Fusheng Wang University of California, Los Angeles. Motivation: Temporal Applications. Financial applications Record-keeping applications Scheduling applications Scientific applications.
E N D
ArchIS: An Efficient Transaction-Time Temporal Database SystemBuilt on Relational Databases and XML Fusheng Wang University of California, Los Angeles
Motivation: Temporal Applications • Financial applications • Record-keeping applications • Scheduling applications • Scientific applications Most database applications are temporal in nature:
Temporal Databases: the Reality • Over 40 temporal data models and query languages have been proposed in the past • A long struggle to get around the limitations of RDBMS • No DBMS vendors have moved aggressively to extend SQL with temporal support
What’s Needed? • Expressive temporal representations and data models with minimal or no extension • Powerful languages for temporal queries with minimal or no extension • Indexing, clustering and query optimization techniques for efficient query support • Architectures that bring these together A temporal database system that provides:
Outline • Motivation • Viewing Relation History in XML • Temporal Queries with XQuery • The ArchIS System • Performance Study • Database Compression • Conclusion
Background: Publishing Relational Database as XML • Publishing relational DBs as XML • as actual XML documents: SQL/XML • as XML views: SilkRoute, XPeranto
Viewing Relation History in XML • Our proposal: view the history of relational DBs as XML documents: • Such history can be naturally represented in XML, without any extension to the data model • Temporal queries can be expressed in XQuery as is—without any extension to the language • Amenable for efficiently implementations
Temporal Grouping in XML • Temporal data models can be classified as: • Temporally ungrouped • Temporally grouped • Temporally grouped data models have more expressive power and are more natural for users • It is difficult to fit temporally grouped models into RDBMS • Temporally grouped data model can be represented well in XML
Example: Transaction-Time History of Tables • Timestamped tuple snapshots (temporally ungrouped) • Temporally grouped history of employees
XML Representation of DB History <employees tstart="1995-01-01" tend="1996-12-31"> <employee tstart="1995-01-01" tend="1996-12-31"> <empno tstart="1995-01-01" tend="1996-12-31">10003</empno> <name tstart="1995-01-01" tend="1996-12-31">Bob</name> <salary tstart="1995-01-01" tend="1995-05-31">60000</salary> <salary tstart="1995-06-01" tend="1996-12-31">70000</salary> <title tstart="1995-01-01" tend="1995-09-30">Engineer</title> <title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title> <title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title> <deptno tstart="1995-01-01" tend="1995-09-30">d01</deptno> <deptno tstart="1995-10-01" tend="1996-12-31">d02</deptno> <DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB> </employee> <!-- … --> </employees>
Advantages of XML Representations • The attribute value history is grouped, and can be queried directly • The H-document has a well-defined schema generated from the current table • The interval constraints are maintained in the updates
Outline • Motivation • Viewing Relation History in XML • Temporal Queries with XQuery • The ArchIS System • Performance Study • Database Compression • Conclusion
Temporal Queries with XQuery • XQuery: the coming standard query language for XML • With XQuery, we can specify temporal queries without any extension: • Temporal projection, snapshot queries, temporal joins, interval queries • Complex queries: ASINCE B, continuous periods, period containment
Temporal Queries with XQuery • Temporal projection: retrieve the salary history of “Bob”: element salary_history { for$s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return$s } • Snapshot queries: retrieve the departments on 1996-01-31: for$din doc("depts.xml")/depts/dept [tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"] let$n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"] let$m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>= "1996-01-31"] return( element dept{$n,$m } )
Temporal Functions • Shield the user from the low-level details used in representing time, e.g., “now” • Eliminate the need for the user to write complex functions, e.g., coalescing • Predefined functions: • Restructuring: coalese($l) • Period comparison : toverlaps, tprecedes, tcontains, tequals, tmeets • Duration and date/time: tstart($e), tend($e), timespan($e) • telement(Ts, Te): constructs an empty element element timestamped as tstart=Ts, tend=Te
Support for ‘now’ • ‘now’: no change until now • Internally, “end of time” values are used to denote ‘now’, e.g., 9999-12-31 • Intervals are only accessed through built-in functions: tstart() returns the start of an interval, tend() returns the end or CURRENT_DATE if it’s different from 9999-12-31 • In the output, tend value can be: • “9999-12-31” • CURRENT_DATE by using rtend($e) that recursively replaces all the occurrence of 9999-12-31 with the current date, • “now”, using externalnow($e) that recursively replaces all the occurrence of \9999-12-31" with the string \now".
Outline • Motivation • Viewing Relation History in XML • Temporal Queries with XQuery • The ArchIS System • Performance Study • Database Compression • Conclusion
The ArchIS System • Two approaches are possible for storing and querying H-documents (H-views) • Native XML database approach: store H-documents directly into XML DB • XML-enabled RDBMS. Design issues include: • mapping (shredding) the XML views representing the H-documents into tables (H-tables) • translation of queries from the XML views to the H-tables • indexing, clustering and query mapping techniques • ArchIS: Archival Information System
The ArchIS System: Architecture Current Database Relational Data SQL Queries Active Rules/ update logs A R C H I S Temporal XML Data H-views (H-documents) Temporal XML Queries H-tables
H-tables • Assumptions • Each entity or relation has a unique key ( or composite keys) to identify it which will not change along the history. e.g., employee: empno • H-tables: • attribute history table: store history of each attribute • key table: built for the key • global relation table: record the history of relations • e.g.: current database: • employee(empno, name, sex, DOB, deptno, salary, title)
H-tables (cont’d) • Sample contents of employee_salary: ID SALARY TSTART TEND ======= ======= ========== ========== 100022 58805 02/04/1985 02/04/1986 100022 61118 02/05/1986 02/04/1987 100022 65103 02/05/1987 02/04/1988 100022 64712 02/05/1988 02/03/1989 100022 65245 02/04/1989 02/03/1990 100023 43162 07/13/1988 07/13/1989 ...
Updating Table Histories • Changes in the current database can be tracked with either update logs or triggers • DB2: triggers • ArchIS: update logs
Query Mapping • General purpose query mapping: XPeranto • In ArchIS, we have well-defined mapping between H-documents (or H-views) and H-tables • We map temporal XQuery queries into SQL, utilizing SQL/XML • SQL/XML is a new standard to map between RDBMS and XML • Both tag-binding and structure construction is pushed inside the relational engine, thus be very efficient
SQL/XML Publishing Functions • XMLElement and XMLAttribute select XMLElement (Name "dept", XMLAttributes (tstart as "tstart", tend as "tend"), deptname) from dept where deptname = ‘Sales’ <dept tstart ="02/04/1985" tend ="12/31/9999"> Sales </dept> • XMLAgg select XMLElement (Name as "new_employees", XMLAttributes ("02/04/2003" as "Since") XMLAgg (XMLElement (Name as "employee", e.name)) from employee_name as e where e.tstart >= ‘02/04/2003’ <new_employees Since ="02/04/2003"> <employee>Bob</employee> <employee>Jack</employee> </new_employees>
XQuery Mapping to SQL with SQL/XML • Temporal projection: retrieve the salary history of “Bob”: element salary_history { for$s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return$s } select XMLElement (Name "salaryhistory", XMLAgg (XMLElement (Name as "salary", XMLAttributes (S.tstart as tstart, S.tend as "tend"), S.salary))) from employee_salary as S, employee_name as N where N.id = S.id and N.name = 'Bob' group by N.id
XQuery Mapping to SQL with SQL/XML: Steps • Identification of variable range • Map variables in FOR/LET clause into underlying H-tables • Generation of join conditions • There is a join condition any pair of distinct tuple variables: join them by ids • Translation of built in functions • Map built-in temporal functions in XQuery into functions in ArchIS • Output generation • use XMLElement and XMLAgg constructs
Temporal Clustering and Indexing • Tuples in H-tables are stored in the order of updates, thus neither temporally clustered nor clustered by objects • Traditional indexes such as B+ Tree will not help on snapshot queries, and better temporal clustering is needed • For every segment, usefulness: U = Nlive/Nall • At the beginning, U =100%, and it decreases with updates • The minimum tolerable usefulness: Umin
Segment-based Clustering Scheme Live Live All All All Segment 1 Segment 2 Segment 3 segstart1 segstart2 segstart3 segend1 segend2 segend3 tstarttuple <= segendSEG tendtuple >= segstartSEG
Segment-based Clustering Scheme • Initially all tuples for an attribute history table are archived in a live segment SEGlive with usefulness U =100%. With updates, when U drops below Umin: 1. A new segment is allocated; 2. The interval of this segment is recorded in the table segment(segno, segstart, segend); 3. All tuples in SEGlive are copied into a new segment Si sorted by id; 4. All live tuples in SEGlive are copied into a new live segment SEGlive', and the old live segment is dropped; After that, the new segment SEGlive’ becomes the new starting segment for updates
Segment-based Clustering Scheme (cont’d) • Sample segments: Segment1 (01/01/1985 - 10/17/1991): ID SALARY TSTART TEND 100002 40000 02/20/1988 02/19/1989 100002 42010 02/20/1989 02/19/1990 100002 42525 02/20/1990 02/19/1991 100002 42727 02/20/1991 12/31/9999 ... Segment2 (10/18/1991 - 07/08/1995): ID SALARY TSTART TEND 100002 42727 02/20/1991 02/19/1992 100002 45237 02/20/1992 02/18/1993 100002 46465 02/19/1993 02/18/1994 100002 47418 02/19/1994 02/18/1995 100002 47273 02/19/1995 12/31/9999 ...
Advantages of Segment-based Clustering Scheme • The current live segment always has a high usefulness, assuring efficient updates; • Records are globally temporally clustered on segments; • For snapshot queries, only one segment is used; for interval queries, only segments involved are used; • Flexibility to control the number of redundant tuples in segments with Umin
Storage Usage of Segment-based Clustering Relative storage size with different Umin Nseg <= N0/(1-Umin) NS
Query Performance on Temporal Data with Segment-based Clustering Queries: Point: Q1 Snapshot: Q2 Interval: Q5 History: Q3, Q4, Q6
Outline • Motivation • Viewing Relation History in XML • Temporal Queries with XQuery • The ArchIS System • Performance Study • Database Compression • Conclusion
Performance Study: Experimental Setup • Systems: Tamino, DB2, and ArchIS • ArchIS uses BerkeleyDB as its storage manager, and it builds on top of it a SQL query engine • Temporal data set: the history of 300,024 employees over 17 years • The simulation models real world salary increases, changes of titles, and changes of departments • The size of the XML data is 334MB • The single large XML document is cut into a collection of 15,000 small XML documents with around 25KB each • Machine: Pentium IV 2.4GHz PC with RedHat 8.0
Performance Study: Query Performance DB2 and ArchIS: with clustering Tamino: without clustering snapshot query Q2 on ArchIS is 137 times faster than that on Tamino; interval query Q5 is 91 times faster; history Q6 is 25 times faster; Q4 4 times faster, and Q3 near 3 times faster. Tamino with clustering: snapshot Q2 is 3.3 times faster than without clustering ( still 41 times slower than archIS); interval query Q5 is 2.9 times faster than without clustering ( still 31 times slower than on ArchIS); history queries are much slower
Outline • Motivation • Viewing Relation History in XML • Temporal Queries with XQuery • The ArchIS System • Performance Study • Database Compression • Conclusion
Database Compression • The disparity between CPU/memory and disk speeds is becoming larger and larger • Cost to read one IDE disk page: 14ms • Cost to uncompress one page: 1.1ms(500MHz CPU) 0.26ms(2.4GHz CPU) • Cost to retrieve one compressed page: 14ms + 0.26ms = 14.3ms • Cost to retrieve uncompressed pages (3.6 pages): 14ms x 3.6 = 50.4ms
Page-based Compression: PageZIP • Traditional data compression tools: compress a file as a whole • PageZIP: page-based compression and uncompression at the granularity of a page • Based on gzip library: zlib • Benefit: save space; point, snapshot or interval queries only retrieve a small fraction of the history, and can be efficient
PageZIP Segment 1 Segment n page 1 ID: 1001 - 1100 page 2 ID: 1100 - 1203 page 3 ID: 1203 - 1331 … …
Storage Utilization with Compression • For each attribute history table, we compress it as a sequence of pages and store each page as a BLOB in a RDBMS employee_salary (sid, salary, tstart, tend) => employee_salary_blob(pageno, startsid, endsid, pageblob)
Update Performance • For RDBMS, only the current segment is used for updates. For Tamino, current data and historical data are clustered together • Update an employee’s salary: • DB2: 0.29 seconds; Tamino: 1.2 seconds • Assume that every employee gets updated once a year: about 1/260 of the total employee get updated every day on average • DB2: 1.52 seconds; Tamino: 15 seconds • In the worse case for segment-based archiving: 39 seconds for copying segments and 36 segments for compression: but only once
Summary • We built a transaction time temporal database on RDBMS and XML, with: • XML to support temporally grouped (virtual) representations of the database history • XQuery to express powerful temporal queries on such views • temporal clustering for managing the actual historical data in a RDBMS • SQL/XML for executing the queries on the XML views as equivalent queries on the relational DB • compression as option for efficient storage • ArchIS provides a unified solution for a wide spectrum of temporal application problems
Future Work • Friendly temporal query interfaces based on temporally grouped models • Other clustering and indexing techniques to be investigated • Other efficient data compression techniques proposed for XML data to be investigated • Apply the approach to valid-time DB and bi-temporal DB • Apply the approach to OODBMS and semi-structured data model