Scalable SPARQL Querying of Large RDF Graphs

Scalable SPARQL Querying of Large RDF Graphs Xu Bo 2012.06.11 In PVLDB, 4(21), 2011

Outline • About Presenter • Semantic Web • Previous Work • New Problem • SYSTEM ARCHITECTURE • EXPERIMENTS • CONCLUSIONS AND FUTURE WORK

About Presenter • Daniel Abadi • Associate Professor of Computer Science in Yale University • Research • Column-Oriented Database Systems • Petascale Parallel Database Systems (HadoopDB) • Semantic Web Data Management

Semantic Web • The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web

Google Knowledge Graph

Key Technology • HTML • XML

The Disadvantage of XML • David Billington is a lecturer of Discrete Mathematics. • there is no standard way of assigning meaning to tag nesting

The Disadvantage of Xpath • Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember • XML is semantically unsatisfactory

RDF • Resource Description Framework • 用Web标识符（称作统一资源标识符，Uniform Resource Identifiers或URIs）来标识事物，用简单的属性（property）及属性值来描述资源

RDF as Triples and a Graph

SPARQL • RDF query language • A basic graph pattern • Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern

Example for Star Pattern • Find the names of the strikers that play for FC Barcelona.

Another Example • Find football players playing for clubs in a populous region where they were born.

Previous Work • RDF In RDBMSs • Property Tables • Vertically Partitioned Approach

RDF In RDBMSs • Get the title of the book(s) Joe Fox wrote in 2001

Property Tables

Vertically Partitioned Approach

New Problem • Single node RDF management systems are abundant • Sesame • Jena • RDF3X • 3store • Research in clustered RDF management is less significantly explored: The focus of the talk

SYSTEM ARCHITECTURE

Graph Partitioning • Hash vs. Graph partitioning • Hash: Only efficient for star patterns • Graph: Taking advantage of graph model

Graph Partitioning • Edge vs. Vertex partitioning • Edge: Natural but inefficient for query execution • Vertex: Superior for common graph patterns

Vertex Partitioning • Preprocess • remove triples whose predicate is rdf:type • METIS partitioner

Triple Placement • Minimizing data shuffling/exchange • Allowing data overlap • N-hop guarantee • The extent of data overlap • If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine

DIRECTED N-HOP GUARANTEE

A potential problem • triples (s, p, o) and (o, p’, o’) • 2-hop guarantee • triples (s, p, o) and (s’, p’, o) • not guaranteed • “object-connected” is not unusual • undirected n-hop guarantee

Triple Placement Algorithm

Query Processing • Queries are executed in RDF-stores and/or Hadoop • Query execution is more efficient in RDF-stores than in Hadoop • Pushing as much of the processing as possible into RDF-stores • Minimizing the number of Hadoop jobs • The larger the hop guarantee, the more work is done in RDF-stores

To Communicate, or not to Communicate • Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed? • Choose the “center” of the query graph • Calculate the distance from the “center” to the furthest edge • If distance > n, communication is needed; not needed otherwise

Determining whether a Query is PWOC • PWOC Query • parallelizable without communication • DoFE • distance of farthest edge • the vertex in a graph with the smallest DoFE will be the most central in a graph

The algorithm

the issue of duplicate results • naive approach • remove duplicates after the query has completed • owner-computes model • add triples (v, ‘<isOwned>’, ‘Yes’) to a partition • For each query issued to the RDF-stores, add an additional pattern (core, ‘<isOwned>’, ‘Yes’)

A query is not PWOC • decompose the query into PWOC subqueries • use Hadoop jobs to join the results of the PWOC subqueries • The number of Hadoop jobs required to complete the query increases as the number of subqueries increases

minimal number of subqueries • reduces to the problem of finding minimal edge partitioning of a graph into subgraphs of bounded diameter • brute-force

Examlple DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1 the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,

Decompose Example

EXPERIMENTS • 20-machine cluster • Leigh University Benchmark (LUBM): 270 million triples • Competitors: • Single-node RDF-3X • SHARD: triple-store system in Hadoop • Graph partitioning (the proposed system) • Hash partitioning on subjects

Data Load Time

Performance Comparison

Varying Number of Machines

Summary

Thanks !

Scalable SPARQL Querying of Large RDF Graphs