340 likes | 447 Views
ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML. Fusheng Wang University of California, Los Angeles. Temporal Databases: the Reality. Over 40 temporal data models and query languages have been proposed in the past
E N D
ArchIS: An Efficient Transaction-Time Temporal Database SystemBuilt on Relational Databases and XML Fusheng Wang University of California, Los Angeles
Temporal Databases: the Reality • Over 40 temporal data models and query languages have been proposed in the past • A long struggle to get around the limitations of RDBMS • No DBMS vendors have moved aggressively to extend SQL with temporal support • The problem is not due to a lack of applications. • SQL is at least in part to blame for the problem.
Outline • Transaction Time History of Database relations in XML • and Temporal Queries with XQuery • The Performance Problem • The ArchIS System
Background: Publishing Relational Database as XML • Publishing relational DBs as XML • as actual XML documents: SQL/XML • as XML views: SilkRoute, XPeranto • Here we publish the history of relational content as XML documents or XML views • Using a temporally grouped representation.
Example: Transaction-Time History of Tables • Timestamped tuple snapshots (temporally ungrouped) • Temporally grouped history of employees
XML Representation of DB History <employees tstart="1995-01-01" tend="1996-12-31"> <employee tstart="1995-01-01" tend="1996-12-31"> <empno tstart="1995-01-01" tend="1996-12-31">10003</empno> <name tstart="1995-01-01" tend="1996-12-31">Bob</name> <salary tstart="1995-01-01" tend="1995-05-31">60000</salary> <salary tstart="1995-06-01" tend="1996-12-31">70000</salary> <title tstart="1995-01-01" tend="1995-09-30">Engineer</title> <title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title> <title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title> <deptno tstart="1995-01-01" tend="1995-09-30">d01</deptno> <deptno tstart="1995-10-01" tend="1996-12-31">d02</deptno> <DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB> </employee> <!-- … --> </employees>
Departments <?xml version="1.0" encoding="UTF-8"?> <departments tend="9999-12-31 tstart="1985-01-01"> <department tend="9999-12-31" tstart="1985-01-01"> <deptno tend="9999-12-31" tstart="1985-01-01">d001</deptno> <deptname tend="9999-12-31"tstart="1985-01-01">Marketing</deptname> <mgrno tend="1991-10-01" tstart="1985-01-01">110022</mgrno> <mgrno tend="9999-12-31" tstart="1991-10-01">110039</mgrno> </department> <department tend="9999-12-31" tstart="1985-01-01"> <deptno tend="9999-12-31" tstart="1985-01-01">d002</deptno> <deptname tend="9999-12-31” tstart="1985-01-01">Finance</deptname> <mgrno tend="1989-12-17" tstart="1985-01-01">110085</mgrno> <mgrno tend="9999-12-31” tstart="1989-12-17">110114</mgrno> </department> ….etc.
Advantages of XML Representations • The attribute value history is grouped, and can be queried directly without coalescing • The H-document has a well-defined schema generated from the current table
Temporal Queries with XQuery • XQuery: the coming standard query language for XML • With XQuery, we can specify temporal queries without any extension: • Temporal projection, snapshot queries, temporal joins, interval queries • Complex queries: ASINCE B, continuous periods, period containment
Temporal Queries with XQuery • Temporal projection: retrieve the salary history of “Bob”: element salary_history { for$s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return$s }
Temporal Queries with XQuery • Snapshot queries: retrieve the departments on 1996-01-31: for$din doc("depts.xml")/depts/dept [tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"] let$n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"] let$m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>= "1996-01-31"] return( elementdept{$n,$m } )
Temporal Functions • Shield the user from the low-level details used in representing time, e.g., “now” • Eliminate the need for the user to write complex functions, e.g., coalescing • Predefined functions: • Restructuring: coalese($l) • Period comparison : toverlaps, tprecedes, tcontains, tequals, tmeets • Duration and date/time: tstart($e), tend($e), timespan($e) • telement(Ts, Te): constructs an empty element element timestamped as tstart=Ts, tend=Te
Support for ‘now’ • ‘now’: no change until now • Internally, “end of time” values are used to denote ‘now’, e.g., 9999-12-31 • Intervals are only accessed through built-in functions: tstart() returns the start of an interval, tend() returns the end or CURRENT_DATE if it’s different from 9999-12-31 • In the output, tend value can be: • “9999-12-31” • CURRENT_DATE by using rtend($e) that recursively replaces all the occurrence of 9999-12-31 with the current date, • “now”, using externalnow($e) that recursively replaces all the occurrence of \9999-12-31" with the string \now".
Outline • Viewing Relation History in XML • Temporal Queries with XQuery • Performance Issues • The ArchIS (Archival Information System) Project
Architecture: Two approaches • Native XML database approach: store H-documents directly into XML DB—limited performance and scalability • XML-enabled RDBMS. Design issues include: • mapping (shredding) the XML views representing the H-documents into tables (H-tables) • translation of queries from the XML views to the H-tables • indexing, clustering and query mapping techniques • ArchIS: Archival Information System
The ArchIS System: Architecture Current Database Relational Data SQL Queries Active Rules/ update logs A R C H I S Temporal XML Data H-views (H-documents) Temporal XML Queries H-tables
H-tables • Assumptions • Each entity or relation has a unique key ( or composite keys) to identify it which will not change along the history. e.g., employee: empno • H-tables: • attribute history table: store history of each attribute • key table: built for the key • global relation table: record the history of relations • e.g.: current database: • employee(empno, name, sex, DOB, deptno, salary, title)
H-tables (cont’d) • Sample contents of employee_salary: ID SALARY TSTART TEND ======= ======= ========== ========== 100022 58805 02/04/1985 02/04/1986 100022 61118 02/05/1986 02/04/1987 100022 65103 02/05/1987 02/04/1988 100022 64712 02/05/1988 02/03/1989 100022 65245 02/04/1989 02/03/1990 100023 43162 07/13/1988 07/13/1989 ...
Query Mapping • General purpose query mapping: XPeranto • In ArchIS, we have well-defined mapping between H-documents (or H-views) and H-tables • We map temporal XQuery queries into SQL, utilizing SQL/XML • SQL/XML is a new standard to map between RDBMS and XML • Both tag-binding and structure construction is pushed inside the relational engine, thus be very efficient
XQuery Mapping to SQL with SQL/XML • Temporal projection: retrieve the salary history of “Bob”: element salary_history { for$s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return$s } select XMLElement (Name "salaryhistory", XMLAgg (XMLElement (Name as "salary", XMLAttributes (S.tstart as tstart, S.tend as "tend"), S.salary))) from employee_salary as S, employee_name as N where N.id = S.id and N.name = 'Bob' group by N.id
SQL/XML Publishing Functions • XMLElement and XMLAttribute select XMLElement (Name "dept", XMLAttributes (tstart as "tstart", tend as "tend"), deptname) from dept where deptname = ‘Sales’ <dept tstart ="02/04/1985" tend ="12/31/9999"> Sales </dept>
SQL/XML Publishing Functions • XMLAgg select XMLElement (Name as "new_employees", XMLAttributes ("02/04/2003" as "Since") XMLAgg (XMLElement (Name as "employee", e.name)) from employee_name as e where e.tstart >= ‘02/04/2003’ <new_employees Since ="02/04/2003"> <employee>Bob</employee> <employee>Jack</employee> </new_employees>
Performance Study: Experimental Setup • Systems: Tamino, DB2, and ArchIS • ArchIS uses BerkeleyDB as its storage manager, and it builds on top of it its SQL query engine • Temporal data set: the history of 300,024 employees over 17 years • The simulation models real world salary increases, changes of titles, and changes of departments • The size of the XML data is 334MB • The single large XML document is cut into a collection of 15,000 small XML documents with around 25KB each • Machine: Pentium IV 2.4GHz PC with RedHat 8.0
Performance Study: Query Performance DB2 and ArchIS: with clustering Tamino: without clustering snapshot query Q2 on ArchIS is 137 times faster than that on Tamino; interval query Q5 is 91 times faster; history Q6 is 25 times faster; Q4 4 times faster, and Q3 near 3 times faster. Tamino with clustering: snapshot Q2 is 3.3 times faster than without clustering ( still 41 times slower than archIS); interval query Q5 is 2.9 times faster than without clustering ( still 31 times slower than on ArchIS); history queries are much slower
Update Performance • For RDBMS, only the current segment is used for updates. For Tamino, current data and historical data are clustered together • Update an employee’s salary: • DB2: 0.29 seconds; Tamino: 1.2 seconds • Assume that every employee gets updated once a year: about 1/260 of the total employee get updated every day on average • DB2: 1.52 seconds; Tamino: 15 seconds • In the worse case for segment-based archiving: 39 seconds for copying segments and 36 segments for compression: but only once
ArchIS Summary • We built a transaction time temporal database on RDBMS and XML, with: • XML to support temporally grouped (virtual) representations of the database history • XQuery to express powerful temporal queries on such views • The XML historical views are shredded back into relations (H-tables) • SQL/XML for executing the queries on the XML views as equivalent queries on the relational DB • Temporal clustering schemes based on the concept of page useful deliver good performance.
Temporal Clustering and Indexing • Tuples in H-tables are stored in the order of updates, thus neither temporally clustered nor clustered by objects • Traditional indexes such as B+ Tree will not help on snapshot queries, and better temporal clustering is needed • For every segment, usefulness: U = Nlive/Nall • At the beginning, U =100%, and it decreases with updates • The minimum tolerable usefulness: Umin
Segment-based Clustering Scheme Live Live All All All Segment 1 Segment 2 Segment 3 segstart1 segstart2 segstart3 segend1 segend2 segend3 tstarttuple <= segendSEG tendtuple >= segstartSEG
Segment-based Clustering Scheme • Initially all tuples for an attribute history table are archived in a live segment SEGlive with usefulness U =100%. With updates, when U drops below Umin: 1. A new segment is allocated; 2. The interval of this segment is recorded in the table segment(segno, segstart, segend); 3. All tuples in SEGlive are copied into a new segment Si sorted by id; 4. All live tuples in SEGlive are copied into a new live segment SEGlive', and the old live segment is dropped; After that, the new segment SEGlive’ becomes the new starting segment for updates
Segment-based Clustering Scheme (cont’d) • Sample segments: Segment1 (01/01/1985 - 10/17/1991): ID SALARY TSTART TEND 100002 40000 02/20/1988 02/19/1989 100002 42010 02/20/1989 02/19/1990 100002 42525 02/20/1990 02/19/1991 100002 42727 02/20/1991 12/31/9999 ... Segment2 (10/18/1991 - 07/08/1995): ID SALARY TSTART TEND 100002 42727 02/20/1991 02/19/1992 100002 45237 02/20/1992 02/18/1993 100002 46465 02/19/1993 02/18/1994 100002 47418 02/19/1994 02/18/1995 100002 47273 02/19/1995 12/31/9999 ...
Advantages of Segment-based Clustering Scheme • The current live segment always has a high usefulness, assuring efficient updates; • Records are globally temporally clustered on segments; • For snapshot queries, only one segment is used; for interval queries, only segments involved are used; • Flexibility to control the number of redundant tuples in segments with Umin
Storage Usage of Segment-based Clustering Relative storage size with different Umin Nseg <= N0/(1-Umin) NS
Query Performance on Temporal Data with Segment-based Clustering Queries: Point: Q1 Snapshot: Q2 Interval: Q5 History: Q3, Q4, Q6