100 likes | 218 Views
Research Meeting. 2009-10-22 Jaeseok Myung. Summary. TA DB : project 3, midterm(24 명 응시 ) WEC : report, project (android), classroom, 수업 ( 정재목 이사 ) Research DESWeb 2010 1 st International Workshop on Data Engineering meets the Semantic Web in conjunction with ICDE 2010
E N D
Research Meeting 2009-10-22 JaeseokMyung
Summary • TA • DB : project 3, midterm(24명 응시) • WEC : report, project (android), classroom, 수업(정재목 이사) • Research • DESWeb 2010 • 1st International Workshop on Data Engineering meets the Semantic Web in conjunction with ICDE 2010 • Submission : Nov 15th, 6 pages • 논문 개요 작성 • LUBM 변환, Complex Query 선정 Center for E-Business Technology
SPARQL Basic Graph Pattern Processing with Iterative MapReduce • Abstract • In this paper, we propose an iterative MapReduce(MR) algorithm for SPARQL Basic Graph Pattern (BGP). Generally, a BGP may have a lot of self-join in itself, but because of MR’s shared-nothing architecture, it is difficult to process such join operations with MR framework. In other words, an expensive MR iteration is needed for getting a shared join key between two graph patterns. For this reason, we suggest an algorithm which reduces the number of MR iteration, and we examine the algorithm with the Lehigh University Benchmark(LUBM). Our experiments are based on physically separated RDF storage and parallel data processing framework, and the result shows that the algorithm provides scalable access to large RDF data. Center for E-Business Technology
Outline • Introduction • Related Work • BGP Processing with MR • MR Iteration (Join시 MR iteration 발생이유, N-Triple 저장 구조) • Naïve Approach (Single-Random) • Our Approach • Multi-Greedy Algorithm • Discussion (edge preserving, type별 performance, key selection) • Experiments • Environmental Settings (Hadoop, LUBM, Complex Query, Amazon EC2, Converter) • SPARQL Processing Results (node개수 변화, 데이터 size 변화) • Dealing with Intermediate Result (중간의 파일 IO 비용 크다, CGL-MR) • Conclusion (N-Triple보다 복잡한, 압축가능한 저장 구조 및 인덱싱 연구 필요) • Reference Center for E-Business Technology
Outline2 • Introduction • Related Work • BGP Processing with MR • MR Iteration (Join시 MR iteration 발생이유, N-Triple 저장 구조) • Naïve Approach (Single Point –Random Selection) • Multi-point Greedy Selection Algorithm • Experiments • Environmental Settings (Hadoop, LUBM, Complex Query, Amazon EC2, Converter) • SPARQL Processing Results (node개수 변화, 데이터 size 변화) • Discussion • Discussion (edge preserving, type별 performance, key selection) • Dealing with Intermediate Result (중간의 파일 IO 비용 크다, CGL-MR) • Conclusion (N-Triple보다 복잡한, 압축가능한 저장 구조 및 인덱싱 연구 필요) • Reference Center for E-Business Technology
Introduction (1/2) • SPARQL is a recommendation of W3C for querying RDF data • RDF활용을 위해 SPARQL이 중요하고, BGP가 SPARQL Pattern matching의 기본임을 설명 • SPARQL BGP Processing is difficult, because BGP may have a significant number of self-joins which is expensive • Many researches were conducted with a perspective of single machine triplestore • However, for some tasks, we may need multiple machines and federated query processing techniques Center for E-Business Technology
Introduction (2/2) • MR is a distributed & parallel data processing framework, which is good at large-scale data analysis • Unfortunately, MR has not been considered as the best option for join operations which are inherent in graph pattern matching algorithms • heterogeneous 하고 shared-nothing이기 때문 • Some researchers have employed iterative MR, but the iteration is expensive • In this paper, we propose an algorithm which reduces the number of MR iteration for BGP Processing • The rest of the paper is organized as follow Center for E-Business Technology
Related Work • SPARQL Processing • BGP, Join (single machine), Triplestore • Data Processing with MR • Google, Hadoop, Hive, Pig • PDBMS vs. MR • Federated SPARQL Processing • DARQ, YARS2, Virtuoso, … • SPARQL Processing with MR is a new approach, but it takes advantage of above researches Center for E-Business Technology
An Example of BGPs ub:Faculty ub:Chair ub:GraduateStudent ub:Lecturer rdf:type rdf:type rdf:type rdf:type ub:advisor ub:publicationAuthor ?j1 ?x ?d1 ?n1 ?p1 ub:teacherOf ub:memberOf ub:teacherOf ub:worksFor ub:takesCourse ub:hasAlumnus ?j3 ?y ?m3 ?o1 ub:subOrganizationOf ub:subOrganizationOf rdf:type rdf:type rdf:type ub:Course ub:Department ub:Person ?l1
Reference • M. Stocker et al, SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation, WWW 2008 • C. Weiss et al, Hexastore: Sextuple Indexing for Semantic Web Data Management, VLDB 2008 • D. J. Abadi et al, SW-Store: a vertically partitioned DBMS for Semantic Web data management, VLDB Journal 2009 • T. Neumann et al, Scalable Join Processing on Very Large RDF Graphs, SIGMOD 2009 • H. Yang et al, Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, SIGMOD 2007 • A. Pavlo et al, A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 • A. Abouzeid et al, HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, VLDB 2009 • C. Olston et al, Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD 2008 • J. Ekanayake et al, MapReduce for Data Intensive Scientific Analyses, ESCIENCE 2008 • J. Cohen, Graph Twiddling in a MapReduce World, CISE 2009 • B. Quilitz et al, Querying Distributed RDF Data Sources with SPARQL, ESWC 2008 • A. Harth et al, YARS2: A Federated Repository for Querying Graph Structured Data from the Web, ISWC 2007 Center for E-Business Technology