1 / 42

Scalable SPARQL Querying of Large RDF Graphs

Scalable SPARQL Querying of Large RDF Graphs. Xu Bo 2012.06.11. In PVLDB, 4(21), 2011. Outline. About Presenter Semantic Web Previous Work New Problem SYSTEM ARCHITECTURE EXPERIMENTS CONCLUSIONS AND FUTURE WORK. About Presenter. Daniel Abadi Associate Professor of Computer

abeni
Download Presentation

Scalable SPARQL Querying of Large RDF Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable SPARQL Querying of Large RDF Graphs Xu Bo 2012.06.11 In PVLDB, 4(21), 2011

  2. Outline • About Presenter • Semantic Web • Previous Work • New Problem • SYSTEM ARCHITECTURE • EXPERIMENTS • CONCLUSIONS AND FUTURE WORK

  3. About Presenter • Daniel Abadi • Associate Professor of Computer Science in Yale University • Research • Column-Oriented Database Systems • Petascale Parallel Database Systems (HadoopDB) • Semantic Web Data Management

  4. Semantic Web • The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web

  5. Google Knowledge Graph

  6. Key Technology • HTML • XML

  7. The Disadvantage of XML • David Billington is a lecturer of Discrete Mathematics. • there is no standard way of assigning meaning to tag nesting

  8. The Disadvantage of Xpath • Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember • XML is semantically unsatisfactory

  9. RDF • Resource Description Framework • 用Web标识符(称作统一资源标识符,Uniform Resource Identifiers或URIs)来标识事物,用简单的属性(property)及属性值来描述资源

  10. RDF as Triples and a Graph

  11. SPARQL • RDF query language • A basic graph pattern • Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern

  12. Example for Star Pattern • Find the names of the strikers that play for FC Barcelona.

  13. Another Example • Find football players playing for clubs in a populous region where they were born.

  14. Previous Work • RDF In RDBMSs • Property Tables • Vertically Partitioned Approach

  15. RDF In RDBMSs • Get the title of the book(s) Joe Fox wrote in 2001

  16. Property Tables

  17. Vertically Partitioned Approach

  18. New Problem • Single node RDF management systems are abundant • Sesame • Jena • RDF3X • 3store • Research in clustered RDF management is less significantly explored: The focus of the talk

  19. SYSTEM ARCHITECTURE

  20. Graph Partitioning • Hash vs. Graph partitioning • Hash: Only efficient for star patterns • Graph: Taking advantage of graph model

  21. Graph Partitioning • Edge vs. Vertex partitioning • Edge: Natural but inefficient for query execution • Vertex: Superior for common graph patterns

  22. Vertex Partitioning • Preprocess • remove triples whose predicate is rdf:type • METIS partitioner

  23. Triple Placement • Minimizing data shuffling/exchange • Allowing data overlap • N-hop guarantee • The extent of data overlap • If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine

  24. DIRECTED N-HOP GUARANTEE

  25. A potential problem • triples (s, p, o) and (o, p’, o’) • 2-hop guarantee • triples (s, p, o) and (s’, p’, o) • not guaranteed • “object-connected” is not unusual • undirected n-hop guarantee

  26. Triple Placement Algorithm

  27. Query Processing • Queries are executed in RDF-stores and/or Hadoop • Query execution is more efficient in RDF-stores than in Hadoop • Pushing as much of the processing as possible into RDF-stores • Minimizing the number of Hadoop jobs • The larger the hop guarantee, the more work is done in RDF-stores

  28. To Communicate, or not to Communicate • Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed? • Choose the “center” of the query graph • Calculate the distance from the “center” to the furthest edge • If distance > n, communication is needed; not needed otherwise

  29. Determining whether a Query is PWOC • PWOC Query • parallelizable without communication • DoFE • distance of farthest edge • the vertex in a graph with the smallest DoFE will be the most central in a graph

  30. The algorithm

  31. the issue of duplicate results • naive approach • remove duplicates after the query has completed • owner-computes model • add triples (v, ‘<isOwned>’, ‘Yes’) to a partition • For each query issued to the RDF-stores, add an additional pattern (core, ‘<isOwned>’, ‘Yes’)

  32. A query is not PWOC • decompose the query into PWOC subqueries • use Hadoop jobs to join the results of the PWOC subqueries • The number of Hadoop jobs required to complete the query increases as the number of subqueries increases

  33. minimal number of subqueries • reduces to the problem of finding minimal edge partitioning of a graph into subgraphs of bounded diameter • brute-force

  34. Examlple DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1 the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,

  35. Decompose Example

  36. EXPERIMENTS • 20-machine cluster • Leigh University Benchmark (LUBM): 270 million triples • Competitors: • Single-node RDF-3X • SHARD: triple-store system in Hadoop • Graph partitioning (the proposed system) • Hash partitioning on subjects

  37. Data Load Time

  38. Performance Comparison

  39. Varying Number of Machines

  40. Summary

  41. Thanks !

More Related