270 likes | 452 Views
Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, kogan}@ncsu.edu. COUL - semantic CO mp U ting research L ab. Introduction. Growing interest in exploiting RDF data for decision-making
E N D
Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, kogan}@ncsu.edu COUL - semantic COmpUting research Lab
Introduction • Growing interest in exploiting RDF data for decision-making • Requires support for analytical-style querying • - More complex than traditional SPJ queries • Often include multiple groupings and / or aggregations • Next release of SPARQL expected to include such constructs e.g. : Sales (Cust, prod, price, loc, month, year) * For each prod, count for each month of 2008, the sales that were between previous month’s avg sale and next month’s avg sale (prev_avg_sale, next_avg_sale) * Example from [1]
Analytical Query Processing • Traditional OLAP techniques • Requires star / snowflake schema • Enterprise-scale • But Semantic Web data (RDF) • Semi-structured (labeled graphs) • Absence of star-like schema • Billion triple data sets Goal : Exploit MapReduce-based frameworks to develop a scalable, cost-effective platform for Semantic Web analytics.
MapReduce-based Data Processing • High-level dataflow languages - Pig Latin, DryadLINQ, HiveQL, JAQL • Hybrid approach - HadoopDB[5] • MapReduce in RDF processing • Graph pattern queries [8], [9] • Graph closure computation [10] • RAPID [6] • Succinct expression of complex queries • Optimize multiple groupings / aggregations
RDF data model Graph representation Statements (triples) Rankings Groups = Stars UserVisits
Traditional Querying of RDF • Graph pattern matching • E.g. Get details about all pages visited by particular users between “1979/12/01” and “1979/12/30” SPARQL Query Matching graph pattern
Example Analytical Query on RDF data Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 • Pattern matching • Star sub graphs – Rankings, UserVisits • Join between the stars • Groupingbased on value of srcIP property • Aggregation on value of pageRank and adRevenue
Pig : Data Processing • Express data processing tasks using high-level query primitives • usability, code reuse, automatic optimization • Pig Latin data model : atom, tuple, bag (nesting) • Operators : LOAD, STORE, JOIN, GROUP BY, COGROUP, FOREACH, SPLIT, aggr. functions • Extensibility support via UDFs • Operators compile into MapReduce jobs Partition REL A using values in age column ($1) SPLIT A into minors IF $1 < 18, majors IF $1 >= 18; Equijoin on REL A (column 0) and REL B (column 1) JOIN A by $0, B by $1;
Compiling Pig Latin’s JOIN to MapReduce REL B REL A P1 P1 map P2 P2 Annotate based on $1 (join key) JOIN A by $1, B by $0; P1 reduce Package tuples Reducer 1 P1 Reducer 2 P2
Pattern Matching in Pig : Approach 1 Rankings type R1 Ranking pageRank RankingsStarPattern = JOIN triples1 ON Sub, triples2 ON Sub, triples3 ON Sub; pageURL 11 Triple store url1 triples1 triples2 triples3 Issues - Self-joins on very large relations high I/O costs - Generate meaningless tuples additional filtering step (R1, type, Ranking, R1, type, Ranking, R1, type, Ranking) Rankings star pattern = 3-way self-join UserVisits star pattern = 5-way self-join
Approach 2: Vertical Partitioning LOAD all the RDF triples SPLIT typeRanking destURL visitDate visitDate Sub Prop Obj UV1 visitDate 1979/12/12 UV4 visitDate 1979/12/02 Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj R1 type Ranking R2 type Ranking Sub Prop Obj UV1 destURL url1 UV2 destURL url1 Filter pageURL Sub Prop Obj R1 pageURL url1 R2 pageURL url2 typeUV adRev Sub Prop Obj UV1 type userVisits UV2 type userVisits Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 srcIP UserVisits = JOIN (compute Star Pattern) Ranking = JOIN (compute Star Pattern) pageRank Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 Sub Prop Obj R1 pageRank 11 R2 pageRank 27 JOIN between Ranking, UserVisits GROUP BY srcIP FOREACH group GENERATE aggregations
Approach 2: Vertical Partitioning LOAD all the RDF triples SPLIT typeRanking destURL visitDate Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj R1 type Ranking R2 type Ranking Sub Prop Obj UV1 destURL url1 UV2 destURL url1 pageURL Sub Prop Obj R1 pageURL url1 R2 pageURL url2 typeUV adRev Sub Prop Obj UV1 type userVisits UV2 type userVisits Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 • Issues • SPLIT : Concurrent sub flows • Risk of Disk spills I/O costs • Structure of intermediate relations srcIP Ranking = JOIN (compute Star Pattern) pageRank Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 Sub Prop Obj R1 pageRank 11 R2 pageRank 27
Compilation to MapReduce Jobs Rankings UserVisits map1 map2 FILTER FILTER JOIN JOIN reduce1 reduce2 map3 JOIN reduce3 map4 GROUP BY reduce4 FOREACH Step 1 : Pattern Matching Step 2 : Grouping Step 3 : Aggregation
Our Approach : RAPID+ • Goal : Minimize I/O costs • Strategy: • Concurrent computation of star patterns using grouping-based algorithm • Can improve efficiency using Operator-coalescing and Look-ahead processing
Concurrent Star Pattern Matching • Use grouping-based algorithm on a triple storage model • - GROUP BY Subject • More efficient if prior filtering of irrelevant triples` Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 Ranking Filter irrelevant properties UserVisits
Concurrent Star Pattern Matching -2 Filter irrelevant triples by coalescing LOAD and FILTER operators Our Approach Using Pig Latin map1 map1 LOAD LOAD Operator Coalescing loadFilter FILTER input = LOAD ‘\data’ using loadFilter( pageRank, pageURL, type:Ranking, destURL, adRevenue, srcIP, visitDate, type:UserVisits) • Savings by Coalescing: • Context switching • Parameter passing • Multiple handling of same data
Grouping-based Pattern Matching starSubgraphs = GROUP input BY $0; GROUP BY Subject BUT heterogeneous bags
Filtering the Groups BUT all possible sub patterns computed Filter non-matching sub patterns visitDatebetween 1979/12/01 and 1979/12/30 • Structure-based filtering • eliminate sub graphs • with missing properties Missing srcIP • Value-based filtering • validate each sub graph • against filter condition
Joining the Stars : Look-ahead Processing Star Pattern Matching Cycle Next Cycle (Joining the Stars) Process each bag Annotate based on value of join property No repeated processing Annotate based on Subject map map Group by Subject Process each bag Structure-based and value-based filtering Annotate based on value of join prop Group by Subject Process each bag Structure-based and value-based filtering Join between the star sub graphs reduce reduce
Example : Look-ahead Processing Star Pattern Matching Joining the Stars Structure-based filtering Value-based filtering Look-Ahead - Annotate bag based on join key Join between the star sub graphs Eliminate properties irrelevant for future processing (join and filter prop) Minimize size of intermediate results
Case Study • Setup: 5-node / 20-node Hadoop clusters on NCSU’s Virtual Computing Lab [13] • Dataset: Synthetic benchmark data set [4] • Tasks: Baseline case • Task A (PM) – basic pattern matching (2 star patterns and a join between the stars) • Task B(PM+GA) – pattern matching with grouping and aggregation (two look-ahead processing opportunities)
Experimental Results Cost Analysis for Task B (PM+GA) Cost Analysis for Task A (PM) 5-node cluster 5-node cluster
Experimental Results Scalability Study 5-node vs 20-nodes 2.8GB per node 1.8GB per node
Conclusion and Ongoing work • Promising results even for baseline case • Further opportunities for improvement • First-class operators vs UDFs • Exploit combiners during aggregations • More efficient data structures for processing bags • Further look-ahead optimizations during multiple groupings and aggregations
References [1] D. Chatziantoniou M. Akinde, T. Johnson, and S. Kim “The MD-join: an operator for Complex OLAP” ICDE 2001, 108–121 [2] J. Dean and S. Ghemawat. “MapReduce : Simplified Data Processing on Large Clusters”. In Proc. Of OSDI'04, 2004 [3] C. Olston, B. Reed, U.Srivastava, R. Kumar and A.Tomkins. “Pig Latin: a not-so-foreign language for data processing”. In Proc. of ACM SIGMOD2008, p.1099 -1110 [4] A.Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A Comparison of Approaches to Large-Scale Data Analysis", In Proc. of SIGMOD 2009 [5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 [6] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009 [7] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008 [8] A. Newman, Y. Li, J. Hunter. Scalable Semantics – The Silver Lining of Cloud Computing. eScience, 2008. IEEE Fourth International Conference on eScience '08. 2008 [9] Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008. [10] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, "Scalable Distributed Reasoning using MapReduce," in Proceedings of the ISWC ‘09, 2009 [11] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007 [12] Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-quer [13] VCL Setup at NC State University, https://vcl.ncsu.edu/ [14] HiveQL, http://hadoop.apache.org/hive/ [15] JAQL, http://code.google.com/p/jaql [16] RDF, http://www.w3.org/RDF/