TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources

TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources Jiansheng Huang Jeffrey F. Naughton Miron Livny

Motivating Scenario In a distributed monitoring system, autonomous nodes report in at unpredictable intervals, state captured at central site is always out of date and inconsistent

Specific Scenario • A cluster of machines • A job can be submitted to any node. • Submit node may schedule a job to be run on another machine. • The state is captured in a centralized database.

Submit job j Schedule job j Job j received here Job j running here m2 m1 State of m1 State of m2 Job j received here & scheduled to run on m2 Job j received here & scheduled to run on m2 Job j is running here Job j is running here No info about job j No info about job j Central site No info about job j No info about job j An Example

Enforcing Consistency • Option 1: Do everything in distributed transactions • Won’t scale to large systems. • At odds with the autonomous nature of nodes. • Option 2: Only present latest consistent snapshot • Can give rise to very out of date information to user.

Problem Addressed • Question: how can we help users cope with inconsistencies in collected data while retaining the scalability and autonomy of the system? • Our answer: instead of enforcing consistency, allow inconsistency and help user interpret what they see. • Issue: How to do so efficiently and without swamping user with too much irrelevant information?

A user asks: “Has the machine that m1 scheduled the job j to run started running it?” State of m1 State of m2 … 9998 more Job j received here & scheduled to run on m2 No info about job j …… Information system at central site Answer without recency reporting: NOT YET Reporting Recency Our idea is to only report: m1 last reported in at 09/12/2006 15:20, m2 last reported in at 09/11/2006 09:30 and nothing else. Naïve way to report recency: m1 last reported in at…, m2 last reported in at…, m3 last reported in at …… m10000 last reported in at …

Roadmap • Background • Definitions and Techniques • Prototype and Evaluation • Conclusion and Future Work

RDBMS Terminology Data source: an abstraction for a node being monitored. S streaming of facts Recency timestamp: the most recent time a data source reported in.

Goals • Completeness: all “relevant” data sources are in report. • Precision: reduce the number of “irrelevant” data sources included in report using efficient techniques.

lastReport table (source id, recency ts) • Other relevant relations (c1,c2,…, source id) Schema Model • Assumption: updates from a data source can only make changes to tuples with its own source id in the data source field.

Definitions Definition 1. If Q references a single relation R, we say a data source s is relevant if exists a potential tuple t for R from s, s.t. t satisfies Q’s predicates. Definition 2. For a query Q referencing relations R1, R2, …, Rn, we say that a data source s is relevant for Q if exists j and a potential tuple tj for Rj, and for any k ≠j, exists a tuple tk for Rk, such that these tuples together satisfies Q’s predicates. In this case we say that s is relevant for Q via Rj. Theorem 1. No single update from an irrelevant data source can change the result of a query

Table 1: An example instance for Activity Mach_id Value Event_time M1 Idle 03/11/2006 20:37:46 M2 Busy 02/10/2006 18:22:01 M3 Idle 03/12/2006 10:23:05 SELECTmach_idFROMActivityWHEREmach_idIN (‘m1’, m2’) ANDvalue = ‘idle’; The query result is {‘m1’}. The set of relevant data sources for the query is {‘m1’, ‘m2’}. Example 1 Suppose we keep track of machine activities in a table called Activity. The attributes of Activity are mach_id, activity value and the time when an activity value becomes valid. We treat the machine ID as the data source column. Activity(mach_id, value, event_time)

Table 3: An example instance for Activity Table 2: A sample instance for Routing Mach_id Value Event_time Mach_id Neighbor Event_time M1 Idle 03/11/2006 20:37 M1 M3 03/12/2006 23:20 M2 Busy 02/10/2006 18:22 M2 M3 02/10/2006 03:34 M3 Idle 03/12/2006 10:23 SELECTA.mach_idFROMRouting R, Activity AWHERER.mach_id = ‘m1’ANDR.neighbor = A.mach_idANDA.value = ‘idle’; The query result is {‘m3’} . The set of data sources relevant via R is {‘m1’}, the set of data sources relevant via A is {‘m3’} Example 2 Consider a P2P system where we use Routing to capture neighboring relationships. Mach_id is treated as the data source column. Activity is same as in example 1. Routing(mach_id, neighbor, event_time)

The Focused Method Query parts User Query Analyze Generate lastReport Recency query Evaluate Recency Report Evaluate Query Result System

PL/pgSQL table function recencyReport: accepts a user query, evaluates it and reports recency information • Usage spec: SELECT * FROMrecencyReport($$SQL TEXT$$); Recency Reporting Prototype • Recency information includes: • A temporary table name for exceptional relevant data sources and their recency timestamps • Another temporary table name for the other relevant data sources and their recency timestamps • The least recent data source and its recency timestamp • The most recent data source and its recency timestamp • Bound of inconsistency

Goals of Experiments • Our approach raises many questions: • Is it expensive to analyze user queries and generate focused recency queries? • Is it inefficient to evaluate the recency query in addition to each user query? • Does the focused recency query really succeed in reducing the number of irrelevant data sources in recency report? • Our experiments are an attempt to begin to answer these questions empirically.

Methods Evaluated • Naïve method: the recency query simply returns all data sources as being relevant • Focused method: with automatic generation of recency query • A variant of Focused method: with hardcoding of recency query

Evaluation Metrics • False positive rate: the percentage of the number of irrelevant data sources reported vs. the number of relevant data sources. • Response time overhead: time overhead for the additional recency reporting

Schema and Data • lastReport, Activity, Routing as in earlier examples • Synthetic data: total row count of Activity fixed at 10,000,000. Vary the number of data sources and number of rows per data source (data ratio) inversely.

Test Queries • Selection query and two-way join queries are measured • Q3: joins Routing and Activity with a very selective predicate on Routing • Q4: similar to Q3, but with a non selective predicate on Routing

Overhead Comparisons: Q3 Figure1: Q3’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).

High Overhead Region, Q3 Figure2: Response times for Q3 with and without recency report w.r.t data ratio and # of data sources ((data ratio)×(#of data sources)= 10,000,000). The Focused method with auto generation of recency query is used here.

Overhead Comparisons, Q4 Figure3: Q4’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).

Performance Evaluation Summary • The overhead for analyzing a user query and generating a recency query is insignificant. • The overhead for evaluating a recency query is insignificant and the focused method has less or equal cost than the naïve method unless • Data ratio is very low and • A query has a join and • The query is not selective on data sources.

False Positive Rates • Naïve method: depending on exact number of data sources, assuming 100,000 for illustration • Q3: fpr = (100000-6)/6 = 16665 • Q4: fpr = 6/(100000-6) = 0.00006 • Focused methods: all 0 because the precise sets of relevant data sources are found

Conclusion • Large scale asynchronous system: reporting recency, rather than enforcing it, is a viable solution • Defining “relevance” is non-trivial. Our solution: a data source is relevant if a single update from it will change the result of a query • Evaluation on our prototype showed that our methods incur insignificant overhead in most cases and are more precise than the naïve method

Future Work • Key constraints • Maintenance cost • Other definitions of relevance • A data source is “relevant” if N updates from it may change query result • A data source is “relevant” if a sequence of updates from it may change query result

Backup Slides

Another Example [1] 1. http://www.cs.wisc.edu/condor/map

Example 3 Table 1: Another sample instance for Activity, same Routing table instance as in Example 2 SELECTA.mach_idFROMRouting R, Activity AWHERER.mach_id = ‘m1’ANDA.value = ‘idle’ANDR.neighbor = A.mach_id; The query result is empty.The set of data sources relevant via R is Ø, the set of data sources relevant via A is {‘m3’}. Therefore m1 is not relevant, but two updates from m1 will change the query result: 1) m1 is updated to ‘idle’ in Activity, and 2) m1 is added as a neighbor of m1 itself in Routing.

Reporting recency • Recency information of relevant data sources are stored in a session duration temporary table • Possible that even the number of relevant sources will be large, so we provide additional summary information: the minimum recency timestamp, the maximum recency timestamp, and the range of recency • Also since machine failures are common in Condor, report exceptional data sources using z-score outlier detection

Experiment Setup • System: Tao Linux 1.0, 2.4 GHz Intel, 512MB memory • Database: PostgreSQL 8.0.0 • Shared buffer pool: 8MB • Working memory size: 1MB • Each query run 11 times and avg time of the last 10 runs is used to minimize fluctuation.

Test Queries • Q1: SELECTCOUNT(*)FROMActivity AWHEREA.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’; • Q2: SELECTCOUNT(*)FROMActivity AWHEREA.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’; • Q3: SELECTCOUNT(*)FROMRouting R, Activity AWHERER.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’; • Q4: SELECTCOUNT(*)FROMRouting R, Activity AWHERER.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’;

Roadmap • Background • Definitions and Techniques • Prototype and Evaluation • Conclusion and Future Work • Related Work

Enforcing currency and consistency • R. Alonso et al., Quasi-copies: Efficient data sharing for information retrieval systems. In EDBT, pages 443-468, 1988. • H. Garcia-Molina et al., Read-only transactions in a distributed database. ACM Trans. Database Syst., 7(2):209-234, 1982. • R. Lenz. Adaptive distributed data management with weak consistent replicated data. In SAC, pages 178-185, 1996. • A. Segev and W. Fang. Currency-based updates to distributed materialized views. In ICDE, pages 512-520, 1990. • A. Labrinidis et al., Balancing performance and data freshness in web database servers. In VLDB, pages 393-404, 2003. • L. Bright et al., Using latency-recency profiles for data delivery on the web. In VLDB, pages 550-561, 2002. • H. Guo et al., Relaxed currency and consistency: How to say “good enough” in SQL. In SIGMOD Conference, pages 815-826, 2004. A common theme here is to enforce recency constraints through a combination of choosing the correct version of an object to query (I.e., the cached or the primary copy) or refreshing “stale” objects by synchronously “pulling” new data in response to a query.

Data Lineage • Y. Cui et al., Lineage tracing for general data warehouse transformations. In VLDB, pages 471-480, 2001. Identify the set of source data items that produced a view item, our work is different in that even if a data source doesn’t contribute any lineage data items (possibly due to latency in reporting in or some error), it may still be “relevant”.

Distributed Query Processing • R. Munz et al., Application of sub-predicate tests in database systems. In A. L. Furtado and H. L. Morgan, editors, VLDB, pages 426-435, 1979. The problem statement relies on data items being placed in a distributed environment in such a way that they satisfy various predicates, and then they use the interaction of the data placement predicates and query predicates to identify where data satisfying a simple query might be located

Partitioning Pruning • D. J. DeWitt et al., Gamma - a high performance dataflow database machine. In VLDB, pages 228-237, 1986. • IBM. DB2 UDB for z/OS Version 8 Performance Topics, 2005. • Oracle Corporation. Oracle Database Concepts, 10g Release 1, 2003. Choose partitions by matching certain types of selection predicates with the “partitioning predicates”

TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources