90 likes | 219 Views
Query Language Constructs for Provenance. Murali Mani, Mohamad Alawa , Arunlal Kalyanasundaram University of Michigan, Flint Presented at IDEAS 2011. Provenance Metadata. Data about origins of data Applications: Check whether data item is valid – in health records
E N D
Query Language Constructs for Provenance Murali Mani, MohamadAlawa, ArunlalKalyanasundaram University of Michigan, Flint Presented at IDEAS 2011.
Provenance Metadata • Data about origins of data • Applications: • Check whether data item is valid – in health records • How much do we trust an inference/observation – scientific computation • Audit trails – manufacturing/shipping/trading • Database community found provenance could be useful in • updating views • maintenance of materialized views • interpretation of query results • querying probabilistic/uncertain data • In short, numerous applications …
OPM (Open Provenance Model) http://openprovenance.org/ • Developed by several researchers who have been involved with provenance • Describes a logical representation of provenance information for a wide variety of applications. • Provenance information represented as a directed graph consisting of: • Nodes (can be artifact, process, or agent) • Edges or dependencies. There are 5 types of edges • Used: a process used an artifact • wasGeneratedBy: an artifact generated by a process • wasControlledBy: a process controlled by an agent • wasTriggeredBy: a process trigged by another process • wasDerivedFrom: an artifact derived from another artifact • Nodes and edges have annotations (attribute-value pairs)
OPM: A Simple Example A1, A2 are artifacts P = a process that is performing division (A1/A2) – note the used edges between P and A1, A2 A3, A4 are artifacts generated by P (representing quotient, remainder) – note the wasGeneratedBy edges between P and A3, A4 A1 A3 A2 A4 used(dividend) used(divisor) P type=division wasGeneratedBy (quotient) wasGeneratedBy (remainder) Example taken from http://openprovenance.org/tutorial/
Queries for OPM • We can write complex “multi-step inference” queries using Datalog/SQL based on the different edges in OPM • Example: find artifacts directly or indirectly derived from another artifact (recursive query using wasDerivedFrom edges) • However, is it sufficient? We may need to express • Sub-graph isomorphism (given a graph query pattern, check whether the pattern appears in a provenance graph) • Studied in graph query languages ([Graph-QL]), [OPQL] … • Shortest path queries (using some notion of distance) • Typically not studied in graph query languages
Our approach • Two sets of constructs • Constructs for Querying Content • Select nodes, edges based on annotations (attribute values) associated with them • Operators include typical relational algebra operators: select, project, union, • Constructs for Querying Structure • 6 basic functions • from (e)/to (e): node from where e starts/e ends • from-1 (n)/to-1 (n): edges that start at node n/end at node n • next (n): nodes to where is an edge from n • prev (n): nodes from where there is an edge to n • Generalized selection operator, specified as • specifies what nodes in G must appear in the result • specifies what edges in G must appear in the result • Result: , is a sub-graph of G (i.e., , )
Examples of Generalized Selection Operator • descendant graph given a set of nodes S • = set of nodes, n | there is a path from s S to n • = set of edges between the nodes selected by • shortest path graph between s and t • = set of edges on the shortest path between s and t • = set of nodes adjacent to an edge selected by • Note: The constructs for querying content and for querying structure can be integrated to yield a powerful query model, that can express a wide range of queries.
Conclusions and Future Work • Observation: Provenance query language should not be restricted to Datalog/SQL. • Developed a query model that provides constructs for querying structure and for querying content. • Using our query model, we can express a wide range of queries including shortest path (not expressible using SQL/Datalog).
References • [Graph-QL]: He, H., and Singh, A. K. 2008. Graphs-at-a-time: Query Language and Access Methods for Graph Databases. ACM SIGMOD (2008). • [OPQL]: Lim, C., Lu, S., Chebotko, A., and Fatouhi, F. 2011. OPQL: A First OPM-Level Query Language for Scientific Workflow Provenance. IEEE SCC (2011). • [OPM]: The OPM Provenance Model (OPM), available at http://openprovenance.org/