400 likes | 539 Views
Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce HyeongSik Kim, Padmashree Ravindra , Kemafor Anyanwu { hkim22, pravind2, kogan }@ ncsu.edu. COUL – Semantic CO mp U ting research L ab. Outline. Background RDF Graph Pattern Matching
E N D
Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce HyeongSikKim, PadmashreeRavindra, KemaforAnyanwu {hkim22, pravind2, kogan}@ncsu.edu COUL– Semantic COmpUtingresearchLab
Outline • Background • RDF Graph Pattern Matching • Graph Pattern Matching on MapReduce • Queries with Repeated Properties (QRP) • Nested Triplegroup Algebra (NTGA) • Challenges: Processing QRP with NTGA • Approach: TripleGroup Cloning • Well-formed, Ambiguous, and Perfect TripleGroups • TripleGroup Cloning in TG_GroupFilter • Evaluation • Related Work
The Growing Amount of RDF data • The amount of RDF on the web is rapidly growing. • Example: DBPedia(http://dbpedia.org) • A dataset extracted from Wikipedia. • Contains 1 billion RDF triples. • Linked Data on the web: May 2007 - # of datasets: 12 Sep 2011 - # of datasets:295 Growing #RDF triples: currently 31 billion
RDF Data Model(Resource Description Framework) • How is knowledge represented in the Semantic Web? • e.g., Information on mobile device products. • Resource Description Framework (RDF) is used. • W3Cstandard data model for the Semantic web • as Ex. “product1 has a name called iphone4” as RDF. $499 • Represent information as a form of triple. • A subject as “product1” • A property as “name” • An object as “iphone4” • (:Product1, :name, :iphone4) :price :Product1 “iphone4” :date :name :design :homepage :Producer1 www.apple.com :design • Data model is a directed labeled graph. • Node: subject, object • Labeled edge: property :Product2 “iphone5” :name :date
Processing RDF Query (from the Viewpoint of Graph Pattern Matching) 2. Example RDF Query: 1. Example RDF Dataset: SELECT * WHERE{ ?product :name ?productName ?product :price ?productPrice ?product :date ?productDate. } SELECT * WHERE{ ?product :name ?productName . ?product :price ?productPrice . ?product :date ?productDate. } SELECT * WHERE{ ?product :name ?productName . ?product :price ?productPrice . ?product :date ?productDate. } SELECT * WHERE{ ?product :name ?productName . ?product :price ?productPrice . ?product :date ?productDate. } Example Data: RDF graph on mobile devices $499 $499 $499 $499 :price :price :price :price “2011-10-14” “2011-10-14” “2011-10-14” “2011-10-14” :Product1 :Product1 :Product1 :Product1 “iphone4” “iphone4” “iphone4” “iphone4” :date :date :date :date :name :name :name :name :design :design :design :design (Three) Triple Patterns :homepage :homepage :homepage :homepage :Producer1 :Producer1 :Producer1 :Producer1 www.apple.com www.apple.com www.apple.com www.apple.com Graph Pattern :design :design :design :design • Query Variable is denoted with a question mark (e.g., ?product) “2012-09-12” “2012-09-12” “2012-09-12” “2012-09-12” :Product2 :Product2 :Product2 :Product2 “iphone5” “iphone5” “iphone5” “iphone5” :date :date :date :date :name :name :name :name • A star pattern whose subject variable is ?product • Oval: Resources in the Web • Rectangle: Literals
Processing RDF Query(based on Relational Algebra) 1. Example RDF Dataset 2. Example RDF Query: SELECT * WHERE{ ?product :name ?productName. ?product :price ?productPrice. ?product :date ?productDate . } SELECT * WHERE{ ?product :name ?productName. ?product :price ?productPrice. ?product :date ?productDate. } SELECT * WHERE{ ?product :name ?productName. ?product :price ?productPrice. ?product :date ?productDate . } SELECT * WHERE{ ?product :name ?productName. ?product :price ?productPrice. ?product :date ?productDate . } First scan of relation R Second scan of relation R Third scan of relation R Relation R 3. Conceptual Execution Plan (Subject = Subject) (Subject = Subject) (Subject = Subject) (Subject = Subject) (Subject = Subject) (Subject = Subject) (Subject = Subject) (Subject = Subject) (R) (R) (R) (R) • Implicit joins on ?product 4. (Intermediate) Result: (:Product1, :name, “iphone 4”) (:Product2, :name, “iphone 5”) (:Product1, :name, “iphone 4”, :Product1, :price, “$499”, :Product1, :date, “2011-10-14”) (:Product1, :name, “iphone 4”, :Product1, :price, “$499”)
Overview of MapReduce • MapReduce (MR): Large-scale data processing systems running on a cluster of machines.[DEAN04] • Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster. 1.Map(k1,v1) → list(k2,v2) 2. Reduce(k2, list (v2)) → list(v3) HDFS Disk HDFS Disk Disk [NYKIEL10] ☼ = sort ☼ ☼ = merge • : sort and merge input. • : execute user’s reduce function. • : transfer result to HDFS. • : read the data. • : execute user’s map function. • :sort and write intermediate data
Join Processing on MapReduce [BLANAS10] • Example: • Equi-join operation with the first column of relation L and R HDFS (k1, ((L: k1, v5)) (k1,v5) (k1, (L: k1, v5)) L (k2, (R: k2, v1)) (k2,v1) (k1, v5) (k2, ((L: k2, v4), (R: k2, v1)) (k2, v4) (k2, v4, k2, v1) R (k2,v4) (k2, (L: k2, v4)) (k2, v1) (k3,v6) (k3, ((R: k3, v6)) (k3, (R: k3, v6)) (k3, v6) Result: (k2, v4, k2, v1) • Map: • Extract the join column • Add a tag of either L or R • Annotate tuples with join key • Reduce: • Separate and buffer the input records into two sets according to the table tag (L or R) • Perform a cross-product
Processing Multi-Join Query on MapReduce 1. (Extended) Example Query SELECT * WHERE{ ?product :name ?productName . ?product :price ?productPrice . ?producer :design ?product . ?producer :type ?ProducerType . } 3. Corresponding Logical Plan based on VP • 2. Vertical Partitioning (VP): • Partition relation R vertically based on the value of the property attribute. • E.g., property relation name, price, design, and type can be generated using selection or split operators. • name = • price= • design= • type= [ABADI07] [ABADI07] (subject = subject) (subject = object) (subject = subject) name price type design 4. MapReduce Plans MR Job: MR Job: MR Job: output temp1 temp2 (subject = subject) (subject = subject) (subject = object) name price temp1 design temp2 type Cost =
Query Optimization on MapReduce • Heuristic to group operations -> fewer MR jobsin a workflow. • Group multiple join operations on the same key in same MR cycle. (Pig) 1. (Extended) Example Query SELECT * WHERE{ ?product :name ?productName . ?product :price ?productPrice . ?product :date ?productDate. ?producer :design ?product . ?producer :type ?ProducerType . } 2. Corresponding Logical Plan based on VP (subject = object) (subject = subject) (subject = subject) (subject = subject) name price date design type MR Job: MR Job: MR Job: output temp1 temp2 (subject = object) (subject = subject) (subject = subject) name price date design type temp1 temp2 • Finding optimal grouping is NP-hard; more advanced techniques use greedy approach that groups non-conflicting joins as much as possible. [HUSAIN11]
Queries with “Repeated” Properties • Query: We want to see the list of the products with detail information and its producer information as well (e.g., the company name, the type of company, and its foundation date) Example Query SELECT * WHERE{ ?product :name ?prodName. ?product :type ?prodType. ?product :date ?prodDate. ?product :price ?prodPrice . ?producer :design ?product . ?producer :name ?prcName. ?producer :type ?prcType. ?producer :date ?prcDate. } J2 HDFS TS(price) HDFS JOIN J4 TS(name) price J1 TS(type) TS(price, name, … HDFS TS(R) TS(date) name (price, name, …) SPLIT type J3 JOIN (name, type, …) TS(name) date TS(type) TS(name, type, … design TS(date) TS: TableScan (Load) operator JOIN TS(design) • Issue: name, type, date are scanned repeatedly across MR jobs J2, J3 • Possible Optimization Considerations: • Minimize Scan overhead using indexes. • MapReduce does not support any indexes by default. • Buffer such relations across multiple joins (memory intensive) • Another approach : Algebraic Optimization • Rewrite queries to equivalent queries but less expensive ones.
General Intuition in NTGA • Nested TripleGroup Algebra (NTGA) : Re-interpret multiple star-joins as a grouping operation • leads to “groups of Triples” (TripleGroups) instead of n-tuples [RAVINDRA11] 2. Input Triples Example Query SELECT * WHERE { 1: ?x :p1 ?o1 . 2: ?x :p2 ?o2 . 3: ?y :p3 ?o2 . 4: ?y :p4 ?o3 . } (:s1, :p1, :o1) (:s1, :p2, :o2) tg1= p1⋈(subject=subject) p2 t1 =(:s1, :p1, :o1, :s2, p2, o2) p3⋈(subject=subject) p4 (:s2, :p3, :o3) (:s2, :p4, :o4) tg2 = t2 =(:s2, :p3, :o3, :s3, p4, o4) • different structure BUT “content equivalent” • VP: 1MR job for each star pattern→ 2MR jobs! • each MR job for star pattern whose subject variable ?x, ?y • NTGA: 1MR job for all star patterns!
Processing RDF Query with NTGA J2 Example Query SELECT * WHERE{ ?product :name ?prodName. ?product :type ?prodType. ?product :date ?prodDate. ?product :price ?prodPrice . ?producer :design ?product . ?producer :name ?prcName. ?producer :type ?prcType. ?producer :date ?prcDate. } HDFS TS(price) HDFS JOIN J4 TS(name) price J1 TS(type) TS(price, name, … HDFS TS(R) TS(date) name (price, name, …) SPLIT type J3 JOIN (name, type, …) TS(name) date TS(type) TS(name, type, … design TS(date) JOIN TS(design) VP: 4 MR jobs (4 HDFS reads) NTGA: 2MR jobs (2 HDFS reads) TS: TableScan (Load) operator HDFS HDFS :name :type :date :price J1 J2 TS(R) TS (Rpltd) TG_JOIN TG_GroupBy :design :name :type :date TS (Rltds) TG_Unnest TG_GroupFilter TG_Flatten
A "Key" NTGA Operator: TG_GroupFilter. • Retain only TripleGroups that satisfy the required query sub structure • Check “exact” match between a set of property in star patterns and a TripleGroup • Example Query: SELECT * WHERE { 1: ?x :p1 :o1 . • 2: ?x :p2 ?y . • 3: ?y :p3 :o2 . 4: ?y :p4 :o3 . } • Input TripleGroups: { (:s1, :p1, :o1) (:s1, :p2, :o2) tg1= (:p1, :p2) (:p1, :p2) (:p1, :p2) (:p3, :p4) Correct match. Therefore, tg1 passes. (:p1, :p2) (:p2, :p3) , (:p1, :p2) (:p2, :p3) (:s2, :p2, :o2) (:s2, :p3, :o3) tg2 = } No Matches. Therefore, tg2 filtered out. = : Matched : Not matched
Outline • Background • RDF Graph Pattern Matching • Graph Pattern Matching on MapReduce • Queries with Repeated Properties (QRP) • Nested Triplegroup Algebra (NTGA) • Challenges: Processing QRP with NTGA • Approach: TripleGroup Cloning • Well-formed, Ambiguous, and Perfect TripleGroups • TripleGroup Cloning in TG_GroupFilter • Evaluation • Related Work
TG_GroupFilter Semantics and Repeated Properties. • Assumes 1-1 correspondence between TripleGroups and star subpatterns. • But with repeated properties there can be ambiguities 1. Given triple pattern 2. A triplegroup from TG_GroupBy SELECT * WHERE{ ?product :name ?prodname . ?product :type ?prodType. ?product :date ?prodDate. ?product :price ?prodPrice . ?producer :design ?product . ?producer :name ?prcName. ?producer :type ?prcType. ?producer :date ?prcDate. } s1 :type o1 s1 :nameo2 s1 :date o3 s1 :price o4 s1 :design o5 ? tg0= Stp1 ? (Partial Match with stp1 and stp2) Stp2
Overview of the Solution • Issue: Mappings between TripleGroups and star patterns become ambiguous if repeated properties exist across multiple star patterns. • Goal: Produce TripleGroups that can be a exact match with a star pattern in a query. • Solution: Classify the filtering processing into two steps. • Remove out incomplete TripleGroups that do not match with any star patterns (or eliminate Non-well-formed TripleGroups) • Solve the ambiguity of remaining TripleGroups that may match with multiple star patterns (Ambiguous TripleGroup) and generate TripleGroups that can be an exact match with a star pattern (Perfect TripleGroup)
Well-formed TripleGroup • Well-formed TripleGroup: a TripleGroup consisting of triples which contains all the properties of somestar subpattern. 1. Example Query 2. TripleGroups generated from TG_GroupBy s1 :name :o1 s1 :date :o2 s1 :price :o3 SELECT * WHERE{ ?product :name ?prodname . ?product :date ?prodDate. ?product :price ?prodPrice . ?producer :design ?product . ?producer :name ?prcname . ?producer :date ?prcdate. } well-formed (contain properties from ) tg1= stp1 s1 :name :o1 s1 :date :o2 s1 :price :o3 s1 :design :o4 well-formed (contain properties from ) tg2= stp2 NOT well-formed (Not contain all the properties from s1 :name :o4 s1 :design :o3 tg3=
Ambiguous&Perfect TripleGroup • Ambiguous TripleGroup : a well-formed TripleGroup that can be matched with multiple star subpatternsin a query, e.g. tg2 • Perfect TripleGroup : a well-formed TripleGroupwhich is an exact match for a single star pattern.* (valid intermediate answers) 1. Example Query 2. TripleGroups generated from TG_GroupBy Perfect TripleGroup (“exact” match with ) s1 :name :o1 s1 :date :o2 s1 :price :o3 SELECT * WHERE{ ?product :name ?prodname . ?product :date ?prodDate. ?product :price ?prodPrice . ?producer :design ?product . ?producer :name ?prcname . ?producer :date ?prcdate. } tg1= stp1 s1 :name :o1 s1 :date :o2 s1 :price :o3 s1 :design :o4 Ambiguous TripleGroup (can be matched with ) stp2 tg2= * a single star pattern “class”
Dealing with Ambiguous TripleGroups Perfect triplegroups and are cloned from the ambiguous triplegroup and the non-perfect triplegroup is rejected. Perfect TripleGroup SELECT * WHERE{ ?product :name ?prodname . ?product :date ?prodDate. ?product :price ?prodPrice . ?producer :design ?product . ?producer :name ?prcname . ?producer :date ?prcdate. ?seller :sell ?product ?seller :name ?selName } s1 :name :o1 s1 :date:o2 s1 :price :o3 stp1 tg1= Clone (:name, :date, :price) stp2 s1 :design :o4 s1 :name :o1 s1 :date:o2 stp3 tg2= Clone (:design, :name, :date) Ambiguous TripleGroup s1 :name :o1 s1 :date :o2 s1 :price :o3 s1 :design :o4 Clone (:sell,:name) s1 :sell ?? s1 :name :o1 tg0= tg3=
NTGA-based MapReduce Plan • Example Query • Generated MR Plan J1 SELECT * WHERE{ ?product :name ?prodname . ?product :date ?prodDate. ?product :price ?prodPrice . ?producer :design ?product . ?producer :name ?prcname . ?producer :date ?prcdate. ?seller :sell ?product ?seller :name ?selName } • Clone in TG_GroupFilter J1: Map s1 :name :o1 s1 :date :o2 s1 :price :o3 s1 :design :o4 m:TG_GroupBy tg0= J1: Reduce } { r:TG_GroupBy (clone) r:TG_GroupFilter* (Revised) { s1 :name :o1 s1 :date :o2 s1 :price :o3 tg1= J2 , J2: Map s1 :design :o4 s1 :name :o1 s1 :date :o2 m:TG_JOIN (?o1 = ?o1) tg2= } J2: Reduce m:op : Map-side Operator r:TG_JOIN r:op :Reduce-side Operator (…)
Losslessness of Revised TG_Groupfilter. • Filter out non-well-formed TripleGroup. • Incomplete TripleGroup that does not contain all the properties for any star patterns clearly does not match any star patterns in a query. • Generate multiple Perfect TripleGroups from an ambiguous TripleGroups. 1. Relational Algebra (VP) • Example Dataset 1) name ⋈(subject=subject)date ⋈(subject=subject) price t1= (:s1, :name, :o1, :s1, :date, :o2, :s1, :price, :o3) 2) design⋈(subject=subject)name⋈(subject=subject) date t2= (:s1, :design, :o4, :s1, :name, :o1, :s1, :price, :o3) 2. NTGA s1 :name :o1 s1 :date :o2 s1 :price :o3 s1 :design :o4 s1 :name :o1 s1 :date :o2 s1 :price :o3 tg0= tg1= (clone) , s1 :design :o4 s1 :name :o1 s1 :date :o2 t1, t2 No valid intermediate results are destroyed nor are spurious results introduced by cloning. tg2=
Outline • Background • RDF Graph Pattern Matching • Graph Pattern Matching on MapReduce • Queries with Repeated Properties (QRP) • Nested Triplegroup Algebra (NTGA) • Challenges: Processing QRP with NTGA • Approach: TripleGroup Cloning • Well-formed, Ambiguous, and Perfect TripleGroups • TripleGroup Cloning in TG_GroupFilter • Evaluation • Related Work
Setup and TestBed • Setup: • Implement VP and NTGA on top of Apache Pig. • 10-node Hadoop clusters on NCSU’s VCL*. • Three approaches were considered : • 1-join-per-cycle (SHARD) • 1-star-join-per-cycle (Pig-Def or VP) • all-star-joins-1-cycle (NTGA) • Evaluation of the redundant scans during star-join computations. • Task 1a – varying the ratio of repeated properties to fixed ones. • Task 1b – varying the selectivity of repeated properties. • Task 2 – scaling up sub patterns with repeated properties. • Task 3 –scalability test with varying data size [ROHLOFF10] *https://vcl.ncsu.edu
Dataset • Dataset: Synthetic benchmark dataset generated using BSBM* • From 22GB (250k Products, BSBM-250k ~86M triples) • Up to 87GB (1M Products, BSBM-1000k ~350M triples) • 7 repeated properties: • - Across all classes e.g. type, publisher • - Only for a smaller subset of classes, e.g. name • The size and selectivity** of BSBM-250k : • :publisher - 1.7GB, 0.091 • :type - 1.8GB, 0.105 • :name - 49MB, 0.003 • :date - 1.4GB, 0.091 * http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/ = denotes triples containing P and T denotes all triples
Task 1a: Varying the Ratio of Repeated Properties to Fixed ones. • Test Queries –(dq0 to dq4) • Two star patterns with fixed subset of unique properties + varying #repeated properties in the second star pattern (from 0 to 4). • Overall #triple patterns increase from 8 to 12 :type :type :publisher :name :publisher :name :date :date :type :publisher :name :date dq1: 1 repeated props. dq2: 2 repeated props. dq3: 3 repeated props. dq0: 2 star pattern, 0 repeated properties. dq4:2 star patterns, 4 repeated properties. (:type, :publisher, :name, :date) • Black edge: arbitrary unique property • Red edge: repeated property
Task 1a: Varying the Ratio of Repeated Properties to Fixed ones. 1-join-per-cycle (SHARD) 1-star-join-per-cycle (Pig-Def) all-star-joins-1-cycle (NTGA) • With increasing #repeated properties, • 1. NTGA : Constant HDFS reads and execution time • : Less HDFS writes due to the fewer number of required MR jobs. • 2. SHARD #the scans of the whole relations are increased. • 3. Pig-Def or VP : #the scans of the property relations are increased. Pig-Def (4 MR cycles), NTGA(2 cycles), SHARD (13 cycles)
Task 1b: Varying the Size of Repeated Props • Test Queries – rq1 and rq2 • Identical queries with two star subpatterns • but contain a different repeated property. • - rq1 : :publisher - 1.7GB, 9.1% • - rq2 : :name - 49MB, 0.3% • NTGA has around 42% performance gain over Pig-Deffor rq2 and increases to around 48% gain for rq1. • With rq2, Pig-Def always uses additional 70 seconds than rq1. :name :publisher :name :publisher rq1: two star pattern with repeated property :publisher rq2: two star pattern with repeated property :name
Task 2: Scaling up Sub patterns with Repeated Properties • Four queries (mq1 ~mq4) • Two repeated properties occur in each of the star subpatterns, • Vary number of star patterns (1 to 4). • The total number of repeated properties are increased across a graph pattern query: from 2 (in mq1) to 8 (in mq4) :type :type :publisher :type :type :publisher :publisher :type :publisher :publisher :type :publisher mq1: a single star pattern mq2: two star patterns mq3: three star patterns
Task 2: Scaling up Sub patterns with Repeated Properties 1-join-per-cycle (SHARD) 1-star-join-per-cycle (Pig-Def) all-star-joins-1-cycle (NTGA) ≈ 120G ≈ 80G ≈40G • mq1 mq4: ↑ #star patterns → ↑ #repeated properties across star patterns (from 2 to 8), ↑ #the amount of scan-sharing across star patterns (from around 40G to 120G) • Execution Time is increased due to join operations for connecting sub stars.
Task 3: Varying Size of Graphs • Increases #RDF triples for query dq4 used in Task1. • From BSBM-250k (22GB) to BSBM-1000k (86GB) • NTGA approach scales well. • Performance gain is observed from 52% to 58% • The size of relations containing repeated properties are not increased linearly when increasing the size of data
Related Work • RDF Data Processing on MapReduce: • SHARD[Rohloff10] : • The clause-iteration algorithm (n +1 jobs to process ntriple patterns) • HadoopDB[Huang11] : • A hybrid architecture of database (RDF-3x) and Hadoop with a graph partitioning scheme. • HadoopRDF[Husain10] : • A customized storage format and plan generation based on a heuristic greedy approach. • Work Sharing on MapReduce: • MRShare[NYKIEL10]: • Inter-query sharing scheme customized into the MapReduce framework. • NOVA [Olston11]: • Share the initial load operation if multiple copies of workflow use the identical input. • CoScan[Wang11]: • Minimize redundant data loading by merging multiple Pig scripts.
Relevant Publications • Kim, H., Ravindra, P., Anyanwu, K.: Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce, In: Proc. CLOUD (2012) • Anyanwu, K., Kim, H., Ravindra, P., : Algebraic Optimization for Processing Graph Pattern Queries in the Cloud, IEEE Internet Computing (2012) • Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. In: Proc. International Conference on Very Large Data Bases (2011) – (Demonstration). • Ravindra, P., Kim, H., Anyanwu, K.: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Platforms, In: Proc. Extended Semantic Web Conference (2011)
References [DEAN08] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51 (2008) 107–113 [OLSTON08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. International Conference on Management of data. (2008) [HUSAIN11] M. F. Husain, J. McGlothlin et al., “Heuristics-Based QueryProcessing for Large RDF Graphs Using CloudComputing,”TKDE, vol. 23, pp. 1312–1327, 2011. [HUANG11] J. Huang, D. J. Abadi et al., “Scalable SPARQL Querying ofLarge RDF Graphs,” Proc. VLDB, vol. 4, no. 11, 2011. [NYKIEL10] T. Nykiel, M. Potamias et al., “MRShare: Sharing acrossMultiple Queries in MapReduce,” Proc. VLDB, vol. 3, pp.494–505, 2010. [OLSTON11] C. Olston, G. Chiou et al., “Nova: Continuous Pig/HadoopWorkflows,” in Proc. SIGMOD, 2011, pp. 1081–1090. [WANG11] X. Wang, C. Olston et al., “CoScan: Cooperative Scan Sharing in the Cloud,” in Proc. SOCC, 2011, pp. 11:1–11:12. [RAVINDRA11] P. Ravindra, H. Kim et al., “An Intermediate Algebra forOptimizing RDF Graph Pattern Matching on MapReduce,”in Proc. ESWC, 2011, vol. 6644, pp. 46–61. [ABADI07] D. J. Abadi, A. Marcus et al., “Scalable Semantic Web dataManagement using Vertical Partitioning,” in Proc. VLDB,2007. [ROHLOFF10] K. Rohloff and R. E. Schantz, “High-performance, MassivelyScalable Distributed Systems using the MapReduce SoftwareFramework: the SHARD Triple-store,” in PSI EtA, 2010, pp.4:1–4:5. [NEUMANN10] T. Neumann and G. Weikum, “The RDF-3X engine forscalable management of RDF data,” The VLDB Journal,vol. 19, pp. 91–113, 2010. [WEISS08] C. Weiss, P. Karras, and A. Bernstein.“Hexastore: Sextuple Indexing for Semantic Web Data Management”, Proc. VLDB, vol. 1, no. 1, 2008. [HERODOTOU11] H. Herodotou and S. Babu. “Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs.” Proc. VLDB, vol. 4, 2011 [BLANAS1010] S.Blanas, J.M. Patel, V.Ercegovac, J.Rao, E.J. Shekita, and Y.Tian. “A Comparison of Join Algorithms for Log Processing in MapReduce.” Proc. SIGMOD, 2010.
RDF Data Model(Resource Description Framework) 1. Statements (triples) 2. Graph Representation :Product1 “2011-10-14” :date :name :color “iphone4” :white :publisher :Producer1 :homepage :type :name :date apple.com Star subgraphs - set of edges with same subject e.g. :Product1 and :Producer1, “1976-04-01” :Producer “Apple” • Oval: resources i.e. URIs • Rectangle: Literals
Relationship between TripleGroups and n-tuples • TripleGroups are not structurally equivalent to n-tuples but are “content equivalent”. 1.TripleGroup in NTGA (TG_GroupByandTG_GroupFilter) (:Product1, :type, :Product) (:Product1, :date, “1976-04-01”), (:Product1, :name, “iphone 4”) tg1= • different structure BUT “content equivalent” 2. n-tuple in VP (SPLIT and JOIN) (:Product1, :type, :Product, :Product1, :date, “1976-04-01”, :Product1, :name, “iphone 4”) t3 t1 t2
NTGA Quick Reference Consider, a set of Triplegroups TG = {tg1 ,tg2 } such that (:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, :prdcr1), (:Prdct1, :price, “100”) (:Prdcr1, :type, :Prdcr), (:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”) tg1 = tg2=
Execution on MapReduce Platform • MapReduce (MR): Popular large-scale data processing systems of data running on a cluster of commodity grade machines [DEAN04] • Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster. • Apache Hadoop* – open-source implementation • Extended systems provides high-level languages for specifying tasks along with optimizing compilers for generating map/reduce code à la database systems. • Pig Latin for Apache Pig**, HiveQL for Apache Hive***. * http://hadoop.apache.org ** http://pig.apache.org, *** http://hive.apache.org
Architecture of RAPID+ Query Parser Layer (…) Pig Latin parser SPARQL parser Logical Plan Generator/Optimizer Query Analyzer Pig Latin Plan Generator NTGA Plan Generator Architecture of RAPID+ JOIN TG_GroupBy TG_Join SPLIT LOAD STORE TG_GroupFilter LOAD STORE JOIN Logical-to-Physical Plan Translator MapReduce Job Compiler Hadoop Job Tracker