Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com

Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion

RDF • Resource Description Framework • subject-predicate-object expressions (S-P-O) http://www.mpii.de/yago/resource/ Albert Einstein Albert Einstein S isCalled isCalled Albert Einstein isCalled isCalled P O wasBornIn wasBornIn 阿尔伯特•爱因斯坦 hasWonPrize hasWonPrize Ulm Nobel Prize in Physics Nobel Prize in Physics

SPARQL Query Language for RDF PREFIX source:<http://www.mpii.de/yago/resource/> SELECT ?name ?where WHERE { ?who source:hasWonPrize Nobel Prize in Physics. ?who source:isCalled ?name. ?who source:wasBornIn ?where} Query: http://www.mpii.de/yago/resource/ isCalled isCalled Albert Einstein Albert Einstein isCalled isCalled wasBornIn wasBornIn 阿尔伯特•爱因斯坦 hasWonPrize hasWonPrize Ulm Nobel Prize in Physics

RDF knowledge base… Semantic web , Web2.0 Extract Knowledge from the Web YAGO DBpedia Freebase Billion Triple Challenge…

RDF knowledge base 295 data sets 31 billion RDF triples 504 million RDF links (September 2011)

Challenge and Opportunity Challenge The RDF data is growing rapidly. Researchers are working with billions of triples. Relational database has limited ability on scalability. Opportunity Google GFS, MapReduce, BigTable Hadoop: implementation of the MapReduce framework and HDFS Achievements:Yahoo！，Amazon，腾讯，百度，淘宝...... We need to consider the recent achievements for handling massive scale Web data on clusters

MapReduce：word count • Map(k1,v1) → list(k2,v2) • Reduce(k2, list (v2)) → list(k3,v3) Map output Reduce Input Reduce Output • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), • (good 1), (good 1) • Worker 1: • (the 1), (weather 1), • (is 1), (good 1). • Worker 2: • (today 1), (is 1), (good 1). • Worker 3: • (good 1), (weather 1), • (is 1), (good 1). • Worker 1: • (the 1) • Worker 2: • (is 3) • Worker 3: • (weather 2) • Worker 4: • (today 1) • Worker 5: • (good 4) file1: the weather is good file2: today is good flie3: good weather is good.

Solution 1 • Directly map the SPARQL into a sequence of MapReduce Jobs • Pro. • scalable • Con. • a burden on the user in terms of usage and maintenance • Not support complex query • No index • Not consider the RDF data characteristics

Solution 2 • Map the SPARQL to Pig -> MapReduce Jobs • Pro. • Scalable • Support complex query • Con. • No index • Not consider the RDF data characteristics

Architecture overview SPARQL Translator RDF 2 JSON Loader BGP Union Filter Optional Transform Filter Join Sort Group Built-in Functions JAQL Query Language Optimizer JSON Data Model Map-Reduce Runtime HDFS Cluster Deployment and Management

JSON • JSON (JavaScript Object Notation) is a lightweight data-interchange format • It is based on a subset of the JavaScript Programming Language • JSON is built on two structures: • A collection of name/value (Key/value) pairs • An ordered list of values (array)

RDF to JSON • JSON is built on two structures: • name/value (Key/value) pairs {s:Albert Einstein} • list of values(array) [{s:Albert Einstein},{}…]

JAQL JAQL is an open-source language for querying JSON (JavaScript Object Notation) data. It provides a general parallel data processing platform on Hadoop Developed by IBM

Basic Idea • SPARQL can be supported on Hadoop by translating queries into JAQL operators

SPARQL to JAQLTransformation 1 2 3 1 Mapreduce job1 Mapreduce job2 2 3 Mapreduce job3 Mapreduce job4 4 {s:Albert Einstein, p:isCalled, o:Albert Einstein }

Data storage In Hadoop framework, a file is the smallest unit of input to a MapReduce job and read from the disk. One straightforward partitioning strategy is to store all the data in one file Must scan the entire data in the read operation Data Partitioning Strategy

Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning

Horizontal partitioning with JSON • For example • Store in HDFS

Vertical Partitioning with JSON • For example • Store in HDFS

Clustered property partitioning with JSON • For example • Store in HDFS

Partition Index: Vertical Partitioning

Partition Index: Horizontal partitioning

Partition Index: Clustered property partitioning

Experiments • Dataset:Billion Triples Challenge 2010(BTC10) . • 3.2B <s, p, o, q> quads,624 GBs;The resulted of dataset have 1,426,823,976 unique triples; • Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server 64bit. • 30nodes: One node is a master, and the others are slaves • 47G memory, 4.3TB disk space and 24 processor of Intel(R) Xeon(R) CPU E5645@ 2.40GHz • “dfs.replication” is 2 • JAQL is 0.5.1 version • Java 1.6

Experiments Fig. Distribution of data

Experiments Fig. Cost time of each query

Conclusion Solution for SPARQL queries in MapReduce Transforming the queries to JAQL operators running on Hadoop. Transformation of SPARQL to JAQL Filter, Transform, Join …… Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning Experiments show the performance Clustered property partitioning has best performance Horizontal partitioning is the worst one

Scalability RDBMS: Waits and deadlocks are increasing nonlinearly with the size of the transactions and concurrency. Scale-up(Vertical scaling):Commercial RDBMSes are very, very expensive Schema:Structured data MapReduce Linear, High throughput Scale-out (horizontal scaling) Schema-free: Unstructured data

RDBMS V.S MapReduce Table . RDBMS compared to MapReduce

Limit of hadoop The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines The MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance

The Next Generation of Apache Hadoop MapReduce • Divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. • ResourceManager ApplicationMaster

Conclusion • SPARQL on Cloud • Pro. • Scalable • High throughput • Con. • Expense of latency • Complex query:JAQL • Join operation Hadoop(MapReduce) Pro. Scalable High throughput Con. Expense of latency No index No more than 4000 nodes

Thanks!

Sparql query Q1:select?X ?Y where{?X rdfs:label Albert Einstein. ?X smc:page ?Y. ?X rdf:type smc:Subject. } Q2:select ?x ?y ?z where { dbsc:Ulm rdf:type ?x. ?x rdfs:label ?y. ?x rdfs:comment ?z. } Q3:select? Who ?Y ?date1 ?Z ?date2 ?prize where{?who source:bornIn ?Y.?who source:bornOnDate?date1.?whosource:diedIn?Z.?whosource:diedOnDate ?date2. ?who source:hasWonPrize ?prize. } Q4:select ?x ?author ?title where {?x purl:hasAuthor ?author. ?x purl:hasBooktitle ISWC 2009. ?x purl:hasTitle ?title.} Q5:select distinct ?name ?lat ?long ?pop where {?a property:name ?name.?a property:regoin dbsc: Nord-Pas-de-Calais.a pos:lat ?lat.?a pos:long ?long.?a property:population ?pop. }

Sparql query Q6: select ?bn ?b ?p where{ ?a property:name ?bn. ?a property:dateOfBirth ?b. ?a property:placeOfBirth ?p. } Q7:select ?Y ?type ?prize where{source:Albert_Einstein source:bornIn ?Y. source:Albert_Einsteinrdf:type?type.source:Albert_Einstein source:hasWonPrize ?prize. } Q8:select ?a ?type ?pub where{?a rdf:type ?type.?a semweb:publisher ?pub.?a semweb:periodical_title Theory of Computing Systems.} Q9:select distinct ?a ?lat ?long ?pop where{?a geo:ontology#name Chevilly.?a geo:ontology#inCountry geo:countries#FR.?a pos:lat ?lat.?a pos:long ?long.?a geo:ontology#population ?pop.} Q10:select distinct ?l ?long ?lat where{?a property:placeOfBirth ?l.?l pos:lat ?lat.?l pos:long ?long.}

Sparql query Q3, Q10 are star join queries with poplar predicates and unspecified object Q1, Q4, Q5, Q6, Q8, Q9 are also star join but with one or more known object. Q2 is a chain query The value of subject is literals in Q7

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing