420 likes | 555 Views
Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing. Nie Zhi niezhixuesen@163.com. Outline. Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion. Outline. Introduction Related work SPARQL Query Processing in MapReduce
E N D
Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com
Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion
Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion
RDF • Resource Description Framework • subject-predicate-object expressions (S-P-O) http://www.mpii.de/yago/resource/ Albert Einstein Albert Einstein S isCalled isCalled Albert Einstein isCalled isCalled P O wasBornIn wasBornIn 阿尔伯特•爱因斯坦 hasWonPrize hasWonPrize Ulm Nobel Prize in Physics Nobel Prize in Physics
SPARQL Query Language for RDF PREFIX source:<http://www.mpii.de/yago/resource/> SELECT ?name ?where WHERE { ?who source:hasWonPrize Nobel Prize in Physics. ?who source:isCalled ?name. ?who source:wasBornIn ?where} Query: http://www.mpii.de/yago/resource/ isCalled isCalled Albert Einstein Albert Einstein isCalled isCalled wasBornIn wasBornIn 阿尔伯特•爱因斯坦 hasWonPrize hasWonPrize Ulm Nobel Prize in Physics
RDF knowledge base… Semantic web , Web2.0 Extract Knowledge from the Web YAGO DBpedia Freebase Billion Triple Challenge…
RDF knowledge base 295 data sets 31 billion RDF triples 504 million RDF links (September 2011)
Challenge and Opportunity Challenge The RDF data is growing rapidly. Researchers are working with billions of triples. Relational database has limited ability on scalability. Opportunity Google GFS, MapReduce, BigTable Hadoop: implementation of the MapReduce framework and HDFS Achievements:Yahoo!,Amazon,腾讯,百度,淘宝...... We need to consider the recent achievements for handling massive scale Web data on clusters
MapReduce:word count • Map(k1,v1) → list(k2,v2) • Reduce(k2, list (v2)) → list(k3,v3) Map output Reduce Input Reduce Output • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), • (good 1), (good 1) • Worker 1: • (the 1), (weather 1), • (is 1), (good 1). • Worker 2: • (today 1), (is 1), (good 1). • Worker 3: • (good 1), (weather 1), • (is 1), (good 1). • Worker 1: • (the 1) • Worker 2: • (is 3) • Worker 3: • (weather 2) • Worker 4: • (today 1) • Worker 5: • (good 4) file1: the weather is good file2: today is good flie3: good weather is good.
Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion
Solution 1 • Directly map the SPARQL into a sequence of MapReduce Jobs • Pro. • scalable • Con. • a burden on the user in terms of usage and maintenance • Not support complex query • No index • Not consider the RDF data characteristics
Solution 2 • Map the SPARQL to Pig -> MapReduce Jobs • Pro. • Scalable • Support complex query • Con. • No index • Not consider the RDF data characteristics
Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion
Architecture overview SPARQL Translator RDF 2 JSON Loader BGP Union Filter Optional Transform Filter Join Sort Group Built-in Functions JAQL Query Language Optimizer JSON Data Model Map-Reduce Runtime HDFS Cluster Deployment and Management
JSON • JSON (JavaScript Object Notation) is a lightweight data-interchange format • It is based on a subset of the JavaScript Programming Language • JSON is built on two structures: • A collection of name/value (Key/value) pairs • An ordered list of values (array)
RDF to JSON • JSON is built on two structures: • name/value (Key/value) pairs {s:Albert Einstein} • list of values(array) [{s:Albert Einstein},{}…]
JAQL JAQL is an open-source language for querying JSON (JavaScript Object Notation) data. It provides a general parallel data processing platform on Hadoop Developed by IBM
Basic Idea • SPARQL can be supported on Hadoop by translating queries into JAQL operators
SPARQL to JAQLTransformation 1 2 3 1 Mapreduce job1 Mapreduce job2 2 3 Mapreduce job3 Mapreduce job4 4 {s:Albert Einstein, p:isCalled, o:Albert Einstein }
Data storage In Hadoop framework, a file is the smallest unit of input to a MapReduce job and read from the disk. One straightforward partitioning strategy is to store all the data in one file Must scan the entire data in the read operation Data Partitioning Strategy
Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning
Horizontal partitioning with JSON • For example • Store in HDFS
Vertical Partitioning with JSON • For example • Store in HDFS
Clustered property partitioning with JSON • For example • Store in HDFS
Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion
Experiments • Dataset:Billion Triples Challenge 2010(BTC10) . • 3.2B <s, p, o, q> quads,624 GBs;The resulted of dataset have 1,426,823,976 unique triples; • Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server 64bit. • 30nodes: One node is a master, and the others are slaves • 47G memory, 4.3TB disk space and 24 processor of Intel(R) Xeon(R) CPU E5645@ 2.40GHz • “dfs.replication” is 2 • JAQL is 0.5.1 version • Java 1.6
Experiments Fig. Distribution of data
Experiments Fig. Cost time of each query
Outline • Introduction • Related work • SPARQL Query Processing in MapReduce • Experiments • Conclusion
Conclusion Solution for SPARQL queries in MapReduce Transforming the queries to JAQL operators running on Hadoop. Transformation of SPARQL to JAQL Filter, Transform, Join …… Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning Experiments show the performance Clustered property partitioning has best performance Horizontal partitioning is the worst one
Scalability RDBMS: Waits and deadlocks are increasing nonlinearly with the size of the transactions and concurrency. Scale-up(Vertical scaling):Commercial RDBMSes are very, very expensive Schema:Structured data MapReduce Linear, High throughput Scale-out (horizontal scaling) Schema-free: Unstructured data
RDBMS V.S MapReduce Table . RDBMS compared to MapReduce
Limit of hadoop The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines The MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance
The Next Generation of Apache Hadoop MapReduce • Divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. • ResourceManager ApplicationMaster
Conclusion • SPARQL on Cloud • Pro. • Scalable • High throughput • Con. • Expense of latency • Complex query:JAQL • Join operation Hadoop(MapReduce) Pro. Scalable High throughput Con. Expense of latency No index No more than 4000 nodes
Sparql query Q1:select?X ?Y where{?X rdfs:label Albert Einstein. ?X smc:page ?Y. ?X rdf:type smc:Subject. } Q2:select ?x ?y ?z where { dbsc:Ulm rdf:type ?x. ?x rdfs:label ?y. ?x rdfs:comment ?z. } Q3:select? Who ?Y ?date1 ?Z ?date2 ?prize where{?who source:bornIn ?Y.?who source:bornOnDate?date1.?whosource:diedIn?Z.?whosource:diedOnDate ?date2. ?who source:hasWonPrize ?prize. } Q4:select ?x ?author ?title where {?x purl:hasAuthor ?author. ?x purl:hasBooktitle ISWC 2009. ?x purl:hasTitle ?title.} Q5:select distinct ?name ?lat ?long ?pop where {?a property:name ?name.?a property:regoin dbsc: Nord-Pas-de-Calais.a pos:lat ?lat.?a pos:long ?long.?a property:population ?pop. }
Sparql query Q6: select ?bn ?b ?p where{ ?a property:name ?bn. ?a property:dateOfBirth ?b. ?a property:placeOfBirth ?p. } Q7:select ?Y ?type ?prize where{source:Albert_Einstein source:bornIn ?Y. source:Albert_Einsteinrdf:type?type.source:Albert_Einstein source:hasWonPrize ?prize. } Q8:select ?a ?type ?pub where{?a rdf:type ?type.?a semweb:publisher ?pub.?a semweb:periodical_title Theory of Computing Systems.} Q9:select distinct ?a ?lat ?long ?pop where{?a geo:ontology#name Chevilly.?a geo:ontology#inCountry geo:countries#FR.?a pos:lat ?lat.?a pos:long ?long.?a geo:ontology#population ?pop.} Q10:select distinct ?l ?long ?lat where{?a property:placeOfBirth ?l.?l pos:lat ?lat.?l pos:long ?long.}
Sparql query Q3, Q10 are star join queries with poplar predicates and unspecified object Q1, Q4, Q5, Q6, Q8, Q9 are also star join but with one or more known object. Q2 is a chain query The value of subject is literals in Q7