Fusheng Wang University of California, Los Angeles

ArchIS: An Efficient Transaction-Time Temporal Database SystemBuilt on Relational Databases and XML Fusheng Wang University of California, Los Angeles

Temporal Databases: the Reality • Over 40 temporal data models and query languages have been proposed in the past • A long struggle to get around the limitations of RDBMS • No DBMS vendors have moved aggressively to extend SQL with temporal support • The problem is not due to a lack of applications. • SQL is at least in part to blame for the problem.

Outline • Transaction Time History of Database relations in XML • and Temporal Queries with XQuery • The Performance Problem • The ArchIS System

Background: Publishing Relational Database as XML • Publishing relational DBs as XML • as actual XML documents: SQL/XML • as XML views: SilkRoute, XPeranto • Here we publish the history of relational content as XML documents or XML views • Using a temporally grouped representation.

Example: Transaction-Time History of Tables • Timestamped tuple snapshots (temporally ungrouped) • Temporally grouped history of employees

XML Representation of DB History <employees tstart="1995-01-01" tend="1996-12-31"> <employee tstart="1995-01-01" tend="1996-12-31"> <empno tstart="1995-01-01" tend="1996-12-31">10003</empno> <name tstart="1995-01-01" tend="1996-12-31">Bob</name> <salary tstart="1995-01-01" tend="1995-05-31">60000</salary> <salary tstart="1995-06-01" tend="1996-12-31">70000</salary> <title tstart="1995-01-01" tend="1995-09-30">Engineer</title> <title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title> <title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title> <deptno tstart="1995-01-01" tend="1995-09-30">d01</deptno> <deptno tstart="1995-10-01" tend="1996-12-31">d02</deptno> <DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB> </employee>  </employees>

Departments <?xml version="1.0" encoding="UTF-8"?> <departments tend="9999-12-31 tstart="1985-01-01"> <department tend="9999-12-31" tstart="1985-01-01"> <deptno tend="9999-12-31" tstart="1985-01-01">d001</deptno> <deptname tend="9999-12-31"tstart="1985-01-01">Marketing</deptname> <mgrno tend="1991-10-01" tstart="1985-01-01">110022</mgrno> <mgrno tend="9999-12-31" tstart="1991-10-01">110039</mgrno> </department> <department tend="9999-12-31" tstart="1985-01-01"> <deptno tend="9999-12-31" tstart="1985-01-01">d002</deptno> <deptname tend="9999-12-31” tstart="1985-01-01">Finance</deptname> <mgrno tend="1989-12-17" tstart="1985-01-01">110085</mgrno> <mgrno tend="9999-12-31” tstart="1989-12-17">110114</mgrno> </department> ….etc.

Advantages of XML Representations • The attribute value history is grouped, and can be queried directly without coalescing • The H-document has a well-defined schema generated from the current table

Temporal Queries with XQuery • XQuery: the coming standard query language for XML • With XQuery, we can specify temporal queries without any extension: • Temporal projection, snapshot queries, temporal joins, interval queries • Complex queries: ASINCE B, continuous periods, period containment

Temporal Queries with XQuery • Temporal projection: retrieve the salary history of “Bob”: element salary_history { for$s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return$s }

Temporal Queries with XQuery • Snapshot queries: retrieve the departments on 1996-01-31: for$din doc("depts.xml")/depts/dept [tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"] let$n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"] let$m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>= "1996-01-31"] return( elementdept{$n,$m } )

Temporal Functions • Shield the user from the low-level details used in representing time, e.g., “now” • Eliminate the need for the user to write complex functions, e.g., coalescing • Predefined functions: • Restructuring: coalese($l) • Period comparison : toverlaps, tprecedes, tcontains, tequals, tmeets • Duration and date/time: tstart($e), tend($e), timespan($e) • telement(Ts, Te): constructs an empty element element timestamped as tstart=Ts, tend=Te

Support for ‘now’ • ‘now’: no change until now • Internally, “end of time” values are used to denote ‘now’, e.g., 9999-12-31 • Intervals are only accessed through built-in functions: tstart() returns the start of an interval, tend() returns the end or CURRENT_DATE if it’s different from 9999-12-31 • In the output, tend value can be: • “9999-12-31” • CURRENT_DATE by using rtend($e) that recursively replaces all the occurrence of 9999-12-31 with the current date, • “now”, using externalnow($e) that recursively replaces all the occurrence of \9999-12-31" with the string \now".

Outline • Viewing Relation History in XML • Temporal Queries with XQuery • Performance Issues • The ArchIS (Archival Information System) Project

Architecture: Two approaches • Native XML database approach: store H-documents directly into XML DB—limited performance and scalability • XML-enabled RDBMS. Design issues include: • mapping (shredding) the XML views representing the H-documents into tables (H-tables) • translation of queries from the XML views to the H-tables • indexing, clustering and query mapping techniques • ArchIS: Archival Information System

The ArchIS System: Architecture Current Database Relational Data SQL Queries Active Rules/ update logs A R C H I S Temporal XML Data H-views (H-documents) Temporal XML Queries H-tables

H-tables • Assumptions • Each entity or relation has a unique key ( or composite keys) to identify it which will not change along the history. e.g., employee: empno • H-tables: • attribute history table: store history of each attribute • key table: built for the key • global relation table: record the history of relations • e.g.: current database: • employee(empno, name, sex, DOB, deptno, salary, title)

H-tables(cont’d)

H-tables (cont’d) • Sample contents of employee_salary: ID SALARY TSTART TEND ======= ======= ========== ========== 100022 58805 02/04/1985 02/04/1986 100022 61118 02/05/1986 02/04/1987 100022 65103 02/05/1987 02/04/1988 100022 64712 02/05/1988 02/03/1989 100022 65245 02/04/1989 02/03/1990 100023 43162 07/13/1988 07/13/1989 ...

Query Mapping • General purpose query mapping: XPeranto • In ArchIS, we have well-defined mapping between H-documents (or H-views) and H-tables • We map temporal XQuery queries into SQL, utilizing SQL/XML • SQL/XML is a new standard to map between RDBMS and XML • Both tag-binding and structure construction is pushed inside the relational engine, thus be very efficient

XQuery Mapping to SQL with SQL/XML • Temporal projection: retrieve the salary history of “Bob”: element salary_history { for$s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return$s } select XMLElement (Name "salaryhistory", XMLAgg (XMLElement (Name as "salary", XMLAttributes (S.tstart as tstart, S.tend as "tend"), S.salary))) from employee_salary as S, employee_name as N where N.id = S.id and N.name = 'Bob' group by N.id

SQL/XML Publishing Functions • XMLElement and XMLAttribute select XMLElement (Name "dept", XMLAttributes (tstart as "tstart", tend as "tend"), deptname) from dept where deptname = ‘Sales’ <dept tstart ="02/04/1985" tend ="12/31/9999"> Sales </dept>

SQL/XML Publishing Functions • XMLAgg select XMLElement (Name as "new_employees", XMLAttributes ("02/04/2003" as "Since") XMLAgg (XMLElement (Name as "employee", e.name)) from employee_name as e where e.tstart >= ‘02/04/2003’ <new_employees Since ="02/04/2003"> <employee>Bob</employee> <employee>Jack</employee> </new_employees>

Performance Study: Experimental Setup • Systems: Tamino, DB2, and ArchIS • ArchIS uses BerkeleyDB as its storage manager, and it builds on top of it its SQL query engine • Temporal data set: the history of 300,024 employees over 17 years • The simulation models real world salary increases, changes of titles, and changes of departments • The size of the XML data is 334MB • The single large XML document is cut into a collection of 15,000 small XML documents with around 25KB each • Machine: Pentium IV 2.4GHz PC with RedHat 8.0

Performance Study: Query Performance DB2 and ArchIS: with clustering Tamino: without clustering snapshot query Q2 on ArchIS is 137 times faster than that on Tamino; interval query Q5 is 91 times faster; history Q6 is 25 times faster; Q4 4 times faster, and Q3 near 3 times faster. Tamino with clustering: snapshot Q2 is 3.3 times faster than without clustering ( still 41 times slower than archIS); interval query Q5 is 2.9 times faster than without clustering ( still 31 times slower than on ArchIS); history queries are much slower

Update Performance • For RDBMS, only the current segment is used for updates. For Tamino, current data and historical data are clustered together • Update an employee’s salary: • DB2: 0.29 seconds; Tamino: 1.2 seconds • Assume that every employee gets updated once a year: about 1/260 of the total employee get updated every day on average • DB2: 1.52 seconds; Tamino: 15 seconds • In the worse case for segment-based archiving: 39 seconds for copying segments and 36 segments for compression: but only once

ArchIS Summary • We built a transaction time temporal database on RDBMS and XML, with: • XML to support temporally grouped (virtual) representations of the database history • XQuery to express powerful temporal queries on such views • The XML historical views are shredded back into relations (H-tables) • SQL/XML for executing the queries on the XML views as equivalent queries on the relational DB • Temporal clustering schemes based on the concept of page useful deliver good performance.

Temporal Clustering and Indexing • Tuples in H-tables are stored in the order of updates, thus neither temporally clustered nor clustered by objects • Traditional indexes such as B+ Tree will not help on snapshot queries, and better temporal clustering is needed • For every segment, usefulness: U = Nlive/Nall • At the beginning, U =100%, and it decreases with updates • The minimum tolerable usefulness: Umin

Segment-based Clustering Scheme Live Live All All All Segment 1 Segment 2 Segment 3 segstart1 segstart2 segstart3 segend1 segend2 segend3 tstarttuple <= segendSEG tendtuple >= segstartSEG

Segment-based Clustering Scheme • Initially all tuples for an attribute history table are archived in a live segment SEGlive with usefulness U =100%. With updates, when U drops below Umin: 1. A new segment is allocated; 2. The interval of this segment is recorded in the table segment(segno, segstart, segend); 3. All tuples in SEGlive are copied into a new segment Si sorted by id; 4. All live tuples in SEGlive are copied into a new live segment SEGlive', and the old live segment is dropped; After that, the new segment SEGlive’ becomes the new starting segment for updates

Segment-based Clustering Scheme (cont’d) • Sample segments: Segment1 (01/01/1985 - 10/17/1991): ID SALARY TSTART TEND 100002 40000 02/20/1988 02/19/1989 100002 42010 02/20/1989 02/19/1990 100002 42525 02/20/1990 02/19/1991 100002 42727 02/20/1991 12/31/9999 ... Segment2 (10/18/1991 - 07/08/1995): ID SALARY TSTART TEND 100002 42727 02/20/1991 02/19/1992 100002 45237 02/20/1992 02/18/1993 100002 46465 02/19/1993 02/18/1994 100002 47418 02/19/1994 02/18/1995 100002 47273 02/19/1995 12/31/9999 ...

Advantages of Segment-based Clustering Scheme • The current live segment always has a high usefulness, assuring efficient updates; • Records are globally temporally clustered on segments; • For snapshot queries, only one segment is used; for interval queries, only segments involved are used; • Flexibility to control the number of redundant tuples in segments with Umin

Storage Usage of Segment-based Clustering Relative storage size with different Umin Nseg <= N0/(1-Umin) NS

Query Performance on Temporal Data with Segment-based Clustering Queries: Point: Q1 Snapshot: Q2 Interval: Q5 History: Q3, Q4, Q6

Fusheng Wang University of California, Los Angeles

Fusheng Wang University of California, Los Angeles

Presentation Transcript

Los Angeles California

Judea Pearl University of California Los Angeles

California State University, Los Angeles

Los Angeles California

University of California, Los Angeles ( UCLA)

Claire Chik (University of California, Los Angeles)

University of California, Los Angeles

Los Angeles, California

University of California, Los Angeles

Los Angeles, California

Anne J. Gilliland University of California Los Angeles

University of California: Los Angeles

Jennifer A. Lindholm University of California, Los Angeles

California State University, Los Angeles

Silvia Nittel University of California, Los Angeles

Steve Horvath University of California, Los Angeles Acknowledgement:

Fusheng Wang University of California, Los Angeles

California State University, Los Angeles

Steve Horvath University of California, Los Angeles Acknowledgement:

Los Angeles California