470 likes | 619 Views
Databases With Uncertainty And Lineage. Written by : Omar Benjelloun · Anish Das Sarma · Alon Halevy · Martin Theobald · Jennifer Widom Presented by : Alex Gorodetsky. Outline. Lineage? Uncertainty? Combined together? The logic behind the given solution
E N D
Databases With Uncertainty And Lineage Written by : Omar Benjelloun · Anish Das Sarma · Alon Halevy · Martin Theobald · Jennifer Widom Presented by : Alex Gorodetsky
Outline • Lineage? Uncertainty? Combined together? • The logic behind the given solution • The solution – The Trio system • Conclusions and future work
Databases with lineage • Definition: LDB Dis a triple (, S, λ), where • = • S = I () ∪ Eis a set of symbols. • E is external lineage - a set of external symbols. • λis a lineage function from S to
Databases with lineageExample :“crime-solver” database • In this example, the lineage function λ is an obvious lineage function for the tuples in the result. • Some operations have less obvious lineage functions, such as: • Negation • Duplicate-elimination • Aggregation • The operations we consider in this paper: • Simple well-behaved lineage functions.
Uncertain databases • An uncertain database represents a set of possible instances, each of which is one possible state of the database. • Definition: An x-tuple is a multiset of tuples, called alternatives. • Definition: A maybe x-tuple is a x-tuple annotated with a ‘?’. • Definition: An x-relation is a multiset of x-tuples.
Uncertain databases • X-relations is a specific formalism for uncertain databases. • X-relations provide a good balance of simplicity and expressiveness, and are orthogonal to the capabilities brought in by lineage.
Uncertain databasesExample • Here, Amy may have seen a Mazda, a Toyota, or no car at all, and the relation has three possible instances. • x-relations are not a complete formalism. • For example, the join Accuses cannot be represented as an x-relation: • x-tuples are independent, so they cannot express the fact that if Amy accuses Jimmy (due to the Mazda), then she must accuse Billy as well. • We will soon see that although x-relations alone are incomplete, adding lineage makes them complete.
Data Integration Systems • Systems that offer a uniform interface to a multitude of data sources. • We will focus on the semantics integration.
Data Integration - Any Problem? • “Subjectivity” effect - No standard. • As data integration applications strive to offer a single “objective” and coherent integrated view of data sources, uncertainty is bound to appear. • 3 main kinds of uncertainty in data integration applications: • Data – automatic extraction from unstructured data. • Mappings between the schemas. • Mappings between data objects in different sources.
Uncertainty, Lineage and Data Integration • Consider an Uncertainty-generating data integration operations are performed. • Lineage keeps track of the origins of data: • Manages uncertainty • Explains uncertainty • Potentially correct the uncertainty • So maybe we need to consider combining uncertainty with lineage in databases?
Outline • Lineage? Uncertainty? Combined together? • The logic behind the given solution • The solution – The Trio system • Conclusions and future work
Combining lineage and uncertainty • Definition: A ULDBD is a triple (, S, λ), where • = , =x-relation • Sis a set of symbols containing I () • λis a lineage function from S to
Combining lineage and uncertainty Example • We combine the uncertain Saw x-relation with the earlier Drives relation to create a new version of Accuses that has both uncertainty and lineage:
Combining lineage and uncertainty • Definition: Let D = (, S, λ) be a ULDB. A possible LDB of D is obtained as follows. • Pick a set of symbols ⊆ S such that:
Combining lineage and uncertainty • Definition: Let D = (, S, λ) be a ULDB. A possible LDB of D is obtained as follows. • Pick a set of symbols ⊆ S such that: • Alternatives of the same x-tuple are mutually exclusive.
Combining lineage and uncertainty • Definition: Let D = (, S, λ) be a ULDB. A possible LDB of D is obtained as follows. • Pick a set of symbols ⊆ S such that: • If an alternative is present in a possible instance, so must be the alternatives it was derived from.
Combining lineage and uncertainty • Definition: Let D = (, S, λ) be a ULDB. A possible LDB of D is obtained as follows. • Pick a set of symbols ⊆ S such that: • An x-tuple must yield a tuple in a possible instance unless: • It is a maybe x-tuple, and • None of its alternatives has a nonempty lineage that would have been consistent with condition 2.
Combining lineage and uncertaintyExample • Consider the choices for x-tuple 21 of Saw, do we satisfy all the conditions? • Why did we add them? • To satisfy condition 3
Combining lineage and uncertaintyExample • Can we add (42,1)? • No. Why? • condition 2 would be violated
Combining lineage and uncertaintyExample • All in all we have 3 possible LDBs:
Combining lineage and uncertainty -Completeness • Definition: a formalism is complete if it is possible to represent any set of possible instances within the formalism. • Theorem – ULDBs Completeness: Given any set of possible LDB’s P={P1,P2,..., Pm} over relations R={R1, R2,..., Rn}, there exists a ULDB D = (, S, λ) whose possible LDB’s are P.
Combining lineage and uncertainty -Well-behaved lineage • Definition: The lineage of an x-tuple is well-behaved if it satisfies : • Acyclic • Deterministic • Uniform • Definition: a ULDB D = (, S, λ) is well-behaved if all its x-tuples have well-behaved lineage. • Well-behaved lineage is well-suited for database queries.
Combining lineage and uncertainty –DL-monotonic queries • Intuitively, any operation that can produce its results in a “tuple-by-tuple” fashion is DL-monotonic. • Aggregation, duplicate-elimination, and some set operators are not DL-monotonic. • From now on we assume all queries Q to be DL-monotonic.
Combining lineage and uncertainty – Query evaluation(Algorithm 1) ? Q ?
Combining lineage and uncertainty – Applying a query to a ULDB • Theorem: Given a ULDB D and a query Q: • Algorithm 1 returns Q(D). • If D is a well-behaved ULDB, then so is Q(D).
Combining lineage and uncertainty –ULDB minimality • ULDBs do not have a unique representation. • We can have two different x-relations that have exactly the same set of possible instances. • Two notions of ULDB minimality: • Data minimality • Lineage minimality
Combining lineage and uncertainty –Data minimality • D-minimality: A ULDB D is D-minimal if it does not include any extraneous alternatives or ‘?’s.
Combining lineage and uncertainty –Lazy Algorithm for D-minimization • Extraneous ‘?’: • (44,1) • Extraneous alternative - Search recursively : • (Carol,Acura,Lexus)
Combining lineage and uncertainty –Extraction • Extraction is important in the context of data integration. • A flexible way to bring into the ULDB just the data that is needed from multiple external sources. • The extracted x-relations preserves their information, while discarding irrelevant data and lineage.
Outline • Lineage? Uncertainty? Combined together? • The logic behind the given solution • The solution – The Trio system • Conclusions and future work
The Trio system • The Trio system: • A relational DBMS that supports uncertainty and lineage. • Based on the ULDB data model, and accepts queries in the TriQLlanguage. • TriQL: extension of SQL with uncertainty and lineage-specific features.
The Trio system Standard SQL Standard relational DBMS
The Trio system • Trio API and translator • (Python) • The Python layer presents a simple Trio API that extends the standard Python DB 2.0 API for database access (Python’s analog of JDBC). • The Trio API accepts TriQL queries in addition to regular SQL, and query results may be x-tuples as well as regular tuples. • The API also exposes lineage tracing, along with the other ULDB-specific operations.
TrioExplorer • (GUI client) • Command-line • client The Trio system • Using the Trio API they constructed: • A generic command-line interactive client similar to that provided by most DBMS’s. • A fully-featured graphical user interface called TrioExplorer.
The Trio system –Encoding ULDB data • New Attribute functions: • aid - is a unique alternative identifier. • xid- identifies the x-tuple that this alternative belongs to. • conf- stores the confidence of the alternative. • num- is a nonnegative integer that tracks whether the alternative’s x-tuple has a “?”.
The Trio system –Encoding ULDB data • The lineage information for each table T is stored in a separate table lin_T(aid, src_aid, src_table). • A tuple (, , ) lin_T denotes that T’s alternative a1 has alternative a2 from table in its lineage.
The Trio system –TriQL to SQL query • Send the translated SQL query to the underlying DBMS, and opens a cursor on the result. • Trio stored procedures: • Tfetch: a cursor call to the Trio API for the original TriQL query. • Sfetch: a cursor call to the underlying DBMS for the translated SQL query.
The Trio system –TriQL to SQL query • Enable Tfetch to collect all SQL result tuples for a single pair. • In order to propagate the “?” annotations, Multiply the num values underlying base tuples. • These values, together with the table names, comprise the lineage for the alternatives in the result x-tuple.
The Trio system –Trio queries • Definition: TriQL(Trio’s query language for ULDBs): • Extension of SQL. • TriQL queries return: • Uncertain relations • Lineage that connects query result data to the queried data.
The Trio system –Built-in predicates and functions • TriQL provides three built-in predicates and functions: • Conf()- Filter query results based on: • The confidence of the input data - Conf(Saw) • The confidence of the result - Conf(*). • Maybe() - takes no arguments and is true if and only if the current x-tuple has a “?”. • Lineage(X,Y)– is true whenever Y is reachable from X by one or more lineage steps.
The Trio system –More possibilities • TrioExplorer – visualize the lineage. • The trio system demo • Coexistence checks • Extraneous data removal
Outline • Lineage? Uncertainty? Combined together? • The logic behind the given solution • The solution – The Trio system • Conclusions and future work
Conclusions and future work • The writers are not aware of any previously proposed formal data representation that integrates both lineage and uncertainty. • We have seen that ULDBs are a good way to solve data integration semantic problems. • Boost in performance may be achieved over computing query operators and confidences in tandem. • ULDBs are not expressive enough to fully represent complex operations. • Extend ULDBs with richer primitives to support data integration – cost in complexity. • Keep data integration “outside the ULDB box” and transfer the lineage and uncertainty primitives to data integration systems.