1 / 28

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Was Derived From. Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase. Artem Chebotko Joint work with John Abraham and Pearl Brazier University of Texas – Pan American Anthony Piazza Piazza Consulting Andrey Kashlev and Shiyong Lu

maalik
Download Presentation

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Was Derived From Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University of Texas – Pan American Anthony Piazza Piazza Consulting AndreyKashlev and Shiyong Lu Wayne State University 7th IEEE International Workshop on Scientific Workflows, July 2, 2013

  2. Provenance in eScience • Metadata that captures history of an experiment • Problem diagnosis • Result interpretation • Experiment reproducibility • Scientific Workflow Community Provenance Challenges • 2006: understanding and sharing information about provenance representations and capabilities • 2006: interoperability of different provenance • 2009: evaluating various aspects of OPM • 2010: showcase OPM in the context of novel applications • Open Provenance Model (2007 - 2010) • PROV-DM: The PROV Data Model (W3C Recommendation 30 April 2013)

  3. SWFMS and Provenance • Support provenance collection • Use proprietary or third-party systems to manage provenance • Differ in provenance models, provenance vocabularies, inference support, and query languages. • May eventually converge to W3C PROV specifications • Taverna • Kepler • View • VisTrails, • Pegasus • Swift • Galaxy • Triana • OPMProv • Karma • RDFProv • etc.

  4. Sample OPM Provenance Graph • Nodes: • artifacts • processes • agents • Edges: • used • wasGeneratedBy • wasControlledBy • wasTriggeredBy • wasDerivedFrom

  5. Sample Graph Serialization: OPMV and Terse RDF Triple Language utpb:schemardf:typeopmv:Artifact . utpb:instancerdf:typeopmv:Artifact . utpb:datasetrdf:typeopmv:Artifact . utpb:loadDatardf:typeopmv:Process . utpb:loadDataopmv:usedutpb:schema, utpb:dataset . utpb:instanceopmv:wasGeneratedByutpb:loadData . utpb:instanceopmv:wasDerivedFromutpb:schema, utpb:dataset .

  6. Provenance Serialization and Querying • Both OPM and PROV-DM can be serialized in RDF • Queried in SPARQL Find all artifacts and their values, if any, in a provenance graph with identifier http://cs.panam.edu/utpb#opmGraph

  7. This Work - Motivation • Single provenance graph as an RDF graph • In general, readily manageable in main memory of a single machine • Hundreds of thousands or even millions of provenance graphs as a provenance (RDF) dataset • Challenging to manage • Our Focus/Problem: Efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs (in an Apache HBase database)

  8. This Work - Contributions • Novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets • Novel and efficient querying algorithms to evaluate SPARQL queries in HBase that are optimized to make use of bitmap indices and numeric values instead of triples • Empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark (UTPB)

  9. Talk Outline • RDF Data and Queries • Indexing Scheme • Storage Scheme • Query Processing • Performance Study • Related Work • Summary and Future work

  10. RDF Data and Queries

  11. RDF Data and Queries

  12. Indexing Scheme • Selection Indices: Is, Ip, Io • Find a triple with known s, p and o:

  13. Indexing Scheme • Join Indices: Iss, Iso, Ios, Ioo • Find triples with the same object as subject in triple at position i: Iso(i)

  14. Storage Scheme • One table with two column families for data and indices • Each row stores one complete provenance graph

  15. Query Processing • Four efficient algorithms/functions: • application of selection indices • application of join indices • handling of special cases not supported by the indices • basic graph pattern evaluation

  16. Query Processing

  17. Query Processing

  18. Query Processing

  19. Query Processing

  20. Query Processing

  21. Query Processing

  22. Performance Study • Implementation • Java, Hadoop 1.0.0, HBase 0.94 • Cluster setup • One HBase Master • Eight HBase Region Servers • All commodity machines • Benchmark – UTPB (5 datasets, 11 queries)

  23. Performance Study • Q1 – simplest, yet most expensive query due to a large result set • Q1. Find all provenance graph identifiers. PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> SELECT * WHERE { ?graph rdf:typeowl:Thing . }

  24. Performance Study • Q2 – Q11 – different complexity, yet similar performance • Example: Q8. Find all artifacts and their values, if any, in a particular provenance graph. PREFIX opmv: <http://purl.org/net/opmv/ns#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX opmo: <http://openprovenance.org/model/opmo#> PREFIX utpb: <http://cs.panam.edu/utpb#> SELECT ?artifact ?value F ROM NAMED <http://cs.panam.edu/utpb#opmGraph> WHERE { GRAPH utpb:opmGraph { ?artifact rdf:typeopmv:Artifact . OPTIONAL { ?artifact opmo:annotation ?annotation . ?annotation opmo:property ?property . ?property opmo:value ?value . } . OPTIONAL { ?artifact opmo:avalue ?artifactValue . ?artifactValueopmo:content ?value . } . } }

  25. Performance Study • Please see other queries in the paper – very efficient and scalable (nearly constant scalability due to minimal data transfers and fast index-based join processing)

  26. Related Work • HBase, BigTable, Cassandra • Hadoop, Hive, Pig, CouchDB, MongoDB, etc. • NoSQL solutions to RDF data management • Provenance management systems • RDF data indexing

  27. Summary and Future Work • Designed novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets • Empirical evaluation results are promising • Future work • Compare, compare, compare • More experiments with multi-user workloads • More optimizations • PROV-DM benchmark anyone?

  28. THANK YOU! Questions? • My contact information: • Artem Chebotko, Department of Computer Science, University of Texas – Pan American • chebotkoa@utpa.edu • http://www.cs.panam.edu/~artem WasDerivedFrom WasDerivedFrom

More Related