1 / 26

July 8 th , 2010

shubha
Download Presentation

July 8 th , 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Storage and Querying Techniques fora Semantic Web of Scientific Workflow ProvenanceThe ProvBase SystemArtem Chebotko(joint work with John Abraham, Pearl Brazier, Jaime Navarro, and Anthony Piazza)Department of Computer ScienceUniversity of Texas – Pan Americanartem@cs.panam.eduhttp://www.cs.panam.edu/~artem July 8th, 2010

  2. Background • Semantic Web: • Web of Data • Machine-processable semantic data - metadata that describes resources and relationships among them • Semantic Web standards: RDF, RDFS, OWL, SPARQL • Scientific Workflows & Provenance: • Powerful paradigm for formalizing and automating complex and data intensive scientific processes • In-silico experiments, e-science • Provenance: metadata that captures the origin and derivation history of data products • Scientific discovery reproducibility, result interpretation, and problem diagnosis primarily depend on provenance • Semantic Web of Scientific Workflow Provenance: • Semantic Web technologies, scientific workflow provenance, interoperability, and integration

  3. Motivation, Goals, Challenges • Scientific workflows generate a lot of provenance • A scientific workflow can be executed numerous times with different settings, parameters, and inputs to obtain interesting results • The TangoInSilico workflow designed in VIEW has over 20 different parameters and can generate around 500 RDF triples every 3 seconds. That is >14 million triples per day! • Provenance from different projects can be integrated • The Open Provenance Model (http://openprovenance.org) and the Third Provenance Challenge (http://twiki.ipaw.info) • There exist a growing need for efficient database systems that can employ distributed storage and querying techniques to cope with large-scale provenance data management • Most existing solutions assume a single-machine deployment

  4. Motivation, Goals, Challenges Shared-disk and shared-nothing clustering Google’s Bigtable is an eye catcher! Caches most data on the Web; 16+ billions of webpages according to http://www.worldwidewebsize.com Has open-source implementation HBase (http://hbase.apache.org); HBase builds on top of Hadoop (http://hadoop.apache.org) Java framework that supports intensive data communication among computers in a cluster Capable to connect and coordinate thousands of nodes inside a cluster Distributes data to obtain the best performance A scalable, distributed database that supports structured data storage for large tables Not a relational database; "a sparse, distributed multi-dimensional sorted map"

  5. Motivation, Goals, Challenges Goal: Store and query scientific workflow provenance (in RDF) using HBase Challenges Data partitioning in HBase is based on row keys; RDF triples have no keys. A table cell can contain a set of values with different timestamps. A relationship between two values from different cell sets in different columns but for the same row key is not captured. No high-level, declarative query language like SQL; a simple API instead. Querying is based on row keys. What database schema is suitable for storing RDF triples to efficiently support triple pattern matching? How can SPARQL queries be evaluated against an HBase database?

  6. Contributions • Architect the ProvBase system that incorporates an HBase/Hadoop backend for distributed storage and querying of provenance triples • Design a three-table storage schema that can be instantiated in HBase to hold provenance triples • Explore querying algorithms to evaluate SPARQL queries in HBase using its native API • Conduct an experimental study using the Third Provenance Challenge queries

  7. Organization of This Talk • Related Work • ProvBase Architecture • Storage Schema • Querying Algorithms • Performance Study • Concluding Remarks & Future Work

  8. Related Work: Scientific Workflow Provenance Data Management

  9. Related Work: Distributed RDF Data Management • Heart (Highly Extensible & Accumlative RDF Table); http://rdf-proj.blogspot.com and http://heart.korea.ac.kr • SPIDER (http://dbserver.korea.ac.kr/projects/spider/) • M. F. Husain, P. Doshi, L. Khan, and B. M. Thuraisingham, “Storage and retrieval of large RDF graph using Hadoop and MapReduce,” in Proc. of CLOUD, 2009, pp. 680–686. • M. F. Husain, L. Khan, M. Kantarcioglu, and B. M. Thuraisingham, “Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools,” in Proc. of CLOUD, 2010. • J. Urbani, S. Kotoulas, E. Oren, F. van Harmelen, “Scalable Distributed Reasoning Using MapReduce,” in Proc. of ISWC, 2009, pp. 634-649. • RDFCube and RDFPeers • More related works in the paper

  10. ProvBase Architecture • Clients collect and query provenance • ProvBase servers process all client requests • An active master coordinates an HBase cluster • Region servers store provenance

  11. Storage Schema An HBase table has rows and columns. A row is uniquely identified by a row key. A table cell can contain a set of values. A cell value has a timestamp. Figure courtesy of Google, Inc.

  12. Storage Schema Sample RDF triples <D> <generatedArtifact> <A> . <D> <generatedByProcess> <P> . <C> <usedByProcess> <P> . When stored in an HBase database, each triple should be searchable by a subject, predicate and object. Triple pattern <D> ?p ?o matches two triples Triple pattern ?s <usedByProcess> ?o matches one triple Triple pattern ?s ?p <P> matches two triples Other variations, including ?s ?p ?o that should match all the triples

  13. Storage Schema Sample RDF triples <D> <generatedArtifact> <A> . <D> <generatedByProcess> <P> . <C> <usedByProcess> <P> . Three-table schema Ts – search by subject Tp – search by predicate To – search by object Other considerations Possible to combine Ts, Tp and To into one table Tables that allow searching by both subject and object, subject and predicate and so forth Row key hashing

  14. Querying Algorithms Three algorithms in the paper matchTP-T – matching a triple pattern over a triple matchTP-DB – matching a triple pattern over a database matchBGP-DB – matching a basic graph pattern over a database matchTP-T checks that three conditions are satisfied (1) a variable can match anything, (2) a URI or literal must match itself, and (3) a variable that occurs more than once must match the same term for all occurrences. Returns true or false

  15. Querying Algorithms matchTP-DB handles three cases If subject pattern is not a variable, retrieve a row from Ts with the corresponding key Otherwise, if object pattern is not a variable, retrieve a row from To with the corresponding key Otherwise, if predicate pattern is not a variable, retrieve a row from Tp with the corresponding key Otherwise, retrieve all rows from Ts, Tp or To Each retrieved row contains one or more triples, each of which must be tried to match with an input triple pattern using matchTP-T matchTP-DB returns a set of matching triples

  16. Querying Algorithms matchBGP-DB steps: Sort triple patterns in the non-descending order of their selectivities, such that triple patterns that yield smaller results appear first in the list (1) if a triple pattern contains only variables, it has the highest selectivity, (2) if a triple pattern contains a non-variable only at the predicate pattern position, it has moderate selectivity, and (3) if a triple pattern has a non-variable at the subject and/or object pattern positions, it has a low selectivity Evaluate each triple pattern using matchBGP-DB to obtain matching triple sets Join resulting sets using a nested-loops-like join strategy N triple patterns in a basic graph pattern require N loops nested inside of each other Intermediate results are not materialized; all the joins are performed concurrently – if triple from the first set joins with a triple from the second set, attempt to join the result with a triple from the third set and so forth Joining conditions require that triples must agree on the values (bindings) of shared variables

  17. Querying Algorithms Other SPARQL features: Projection (SELECT) Performed on triple pattern or basic graph pattern matching phase Filtering (FILTER) Logic connectives, inequality and equality operators, unary predicates, etc. Performed on triple pattern or basic graph pattern matching phase Alternative graph patterns (UNION) Easy if triple-sets are union-compatible Need to extend “schemas” otherwise Optional graph patterns (OPTIONAL) Most complicated Nested optional and parallel optional constructs Top-down and bottom-up evaluation approaches

  18. Performance Study Algorithms were implemented in Java Cluster setup 5 nodes, 1 master and ProvBase server, 4 region servers Gateway E3600 computers, 1.8 GHz Pentium 4 processor, 1 GB RAM, IDE hard drives with 16+ GB of free space, gigabit Ethernet adapter. D-Link DGS-2208 gigabit switch Debian operating system 5.0.3, OpenJDK 1.6.0, Hadoop 0.20.1, HBase 0.20.3 Datasets and queries Load Workflow from the 3rd Provenance Challenge Three queries from the 3rd Provenance Challenge Tupelo’s OWL vocabulary Each workflow run generated ~700 RDF triples

  19. Performance Study Datasets

  20. Performance Study Queries

  21. Performance Study Data ingest performance

  22. Performance Study Query performance

  23. Performance Study Query optimization Triple patterns from Q2: ?table opm:generatedArtifact p:tableID Tp returns ~3million triples for the largest dataset To returns 1 triple ?table opm:generatedByProcess ?process Tp returns millions of triples Ts can be used only if the binding of ?table is know from the previous triple pattern, such that ?table opm:generatedByProcess ?process  p:table1 opm:generatedByProcess ?process Ts returns few triples Performance of Q2 and Q3 can be substantially improved using this variable substitution technique

  24. Concluding Remarks & Future Work • Provenance of 100,000 workflow execution was efficiently stored and queried on a small cluster of commodity machines. Very cost effective. • Future Work • Optional graph patterns • Distributing workload among ProvBase servers • Row key hashing • Encoding multiple triple pattern terms or even graph patterns as row keys • Experimental comparison with existing relational and native RDF stores • Experimental comparison with a relational RDF store deployed on a MySQL cluster • Inference • Region size optimizations

  25. Acknowledgement We would like to thank David Kirtley, the Software Systems Specialist for the Department of Computer Science at UTPA, for his assistance with various technical issues occurred in this research.

  26. Thank You!Questions? Artem ChebotkoDepartment of Computer ScienceUniversity of Texas – Pan Americanartem@cs.panam.eduhttp://www.cs.panam.edu/~artem

More Related