Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

Data Intensive Query Processing for Large RDFGraphs Using Cloud Computing Tools Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department of Computer Science University of Texas at Dallas

Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works

Semantic Web Technologies • Data in machine understandable format • Infer new knowledge • Standards • Data representation – RDF • Triples • Example: • Ontology – OWL, DAML • Query language - SPARQL

Cloud Computing Frameworks • Proprietary • Amazon S3 • Amazon EC2 • Force.com • Open source tool • Hadoop – Apache’s open source implementation of Google’s proprietary GFS file system • MapReduce – functional programming paradigm using key-value pairs

Goal • To build efficient storage using Hadoop for large amount of data (e.g. billion triples) • To build an efficient query mechanism • Publish as open source project • http://code.google.com/p/hadooprdf/ • Integrate with Jena as a Jena Model

Motivation • Current Semantic Web frameworks do not scale to large number of triples, e.g. • Jena In-Memory, Jena RDB, Jena SDB • AllegroGraph • Virtuoso Universal Server • BigOWLIM • There is a lack of distributed framework and persistent storage • Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability

Current Approaches • State-of-the-art approach • Store RDF data in HDFS and query through MapReduce programming (Our approach) • Traditional approach • Store data in HDFS and process query outside of Hadoop • Done in BIOMANTA1 project (details of querying could not be found) 1. http://biomanta.org/

System Architecture LUBM Data Generator 1. Query RDF/XML MapReduce Framework Preprocessor Query Rewriter N-Triples Converter 3. Answer Predicate Based Splitter Query Plan Generator Object Type Based Splitter Plan Executor 2. Jobs Preprocessed Data Hadoop Distributed File System / Hadoop Cluster 3. Answer

Storage Schema • Data in N-Triples • Using namespaces • Example: http://utdallas.edu/res1 utd:resource1 • Predicate based Splits (PS) • Split data according to Predicates • Predicate Object based Splits (POS) • Split further according to rdf:type of Objects

Example D0U0:GraduateStudent20 rdf:type lehigh:GraduateStudent lehigh:University0 rdf:type lehigh:University D0U0:GraduateStudent20 lehigh:memberOf lehigh:University0 P File: rdf_type D0U0:GraduateStudent20 lehigh:GraduateStudent lehigh:University0 lehigh:University PS File: lehigh_memberOf D0U0:GraduateStudent20 lehigh:University0 File: rdf_type_GraduateStudent D0U0:GraduateStudent20 File: lehigh_memberOf_University D0U0:GraduateStudent20 lehigh:University0 File: rdf_type_University D0U0:University0 POS

Space Gain • Example Data size at various steps for LUBM1000

SPARQL Query • SPARQL – SPARQL Protocol And RDF Query Language • Example SELECT ?x ?y WHERE { ?z foaf:name ?x ?z foaf:age ?y } Query Data Result

SPAQL Query by MapReduce • Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu} • Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}

Inside Hadoop MapReduce Job INPUT subOrganizationOf_University Department1 http://University0.edu Department2 http://University1.edu worksFor_Department Professor1 Deaprtment1 Professor2 Department2 MAP Map Map SHUFFLE&SORT Filtering Object == http://University0.edu Department1 SO#http://University0.edu Department1WF#Professor1 Department2WF#Professor2 REDUCE Reduce Department1SO#http://University0.edu WF#Professor1 Department2WF#Professor2 OUTPUT Output WF#Professor1

Query Plan Generation • Challenge • One Hadoop job may not be sufficient to answer a query • In a single Hadoop job, a single triple pattern cannot take part in joins on more than one variable simultaneously • Solution • Algorithm for query plan generation • Query plan is a sequence of Hadoop jobs which answers the query • Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously

Example • Example query:SELECT ?X, ?Y, ?Z WHERE { ?X pred1 obj1 subj2 ?Z obj2 subj3 ?X ?Z ?Y pred4 obj4 ?Y pred5 ?X } • Simplified view: • X • Z • XZ • Y • XY

Join Graph &Hadoop Jobs 2 X 2 2 2 Z Z Z Z 3 X 3 3 3 X X X 1 X 1 1 1 X X X X 5 X X X 5 5 5 Y Y Y Y 4 4 4 4 Join Graph Valid Job 1 Valid Job 2 Invalid Job

Possible Query Plans • A. job1: (x, xz, xy)=yz, job2: (yz, y) = z, job3: (z, z) = done 2 2 Z Z 2 3 3 2 X X Z 1,2,3,4,5 Z 1 1 1,3,5 X X 1,3,4,5 X X 5 5 Result Y Y Y Job 3 4 4 4 Join Graph Job 1 Job 2

Possible Query Plans • B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done 2 2 Z Z 3 3 2,3 X X 1,2,3,4,5 X 1 1 X X 1 X X X 5 5 X Result Y Y 4,5 4 4 Join Graph Job 1 Job 2

Query Plan Generation • Goal: generate a minimum cost job plan • Back tracking approach • Exhaustively generates all possible plans. • Uses two coloring scheme on a graph to find jobs with colors WHITE and BLACK. • Two WHITE nodes cannot be adjacent • User defined cost model. • Chooses best plan according to cost model.

Some Definitions • Triple Pattern,TP A triple pattern is an ordered collection of subject, predicate and object which appears in a SPARQL query WHERE clause. The subject, predicate and object can be either a variable (unbounded) or a concrete value (bounded). • Triple Pattern Join,TPJ A triple pattern join is a join between two TPs on a variable • MapReduceJoin, MRJ A MapReduceJoin is a join between two or more triple patterns on a variable.

Some Definitions • Job, JB A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files. • Conflicting MapReduceJoins, CMRJ A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files. • NON-Conflicting MapReduceJoins, NCMRJ Non-conflicting MapReduceJoins is a pair of MRJs either not sharing any triple pattern or sharing a triple pattern and the MRJs are on same variable.

Example • LUBM Query • SELECT ?X WHERE { • 1 ?X rdf : type ub : Chair . • 2 ?Y rdf : type ub : Department . • 3 ?X ub : worksFor ?Y . • 4 ?Y ub : subOrganizat ionOf <http : / /www.U0 . edu> }

Example (contd.) • Triple Pattern Graph and Join Graph for the LUBM Query Triple Pattern Graph (TPG)#2 Triple Pattern Graph (TPG)#1 Join Graph (JG)#1 Join Graph (JG)#2

Example(contd.) • Figure shows TPG and JG for query. • On left, we have TPG where each node represents a triple pattern in query, and they are named in the order they appear. • In the middle, we have the JG. Each node in the JG represents an edge in the TPG • For the query, an FQP can have two jobs • First one dealing with NCMRJ between triple patterns 2, 3, 4 • Second one NCMRJ between triple pattern 1 and the output of the first join. • IQP would be first job having CMRJs between 1, 3 and 4 and the second having MRJ between triple pattern 2 and the output of the first join.

Query Plan Generation: Backtracking

Query Plan Generation: Backtracking • Drawbacks of back tracking approach • Computationally intractable • Search space is exponential in size

Steps a Hadoop Job Goes Through • Executable file (containing MapReduce code) is transferred from client machine to JobTracker1 • JobTracker decides which TaskTrackers2 will execute the job • Executable file is distributed to TaskTrackers over network • Map processes start by reading data from HDFS • Map outputs are written to discs • Map outputs are read from discs, shuffled (transferred over the network to TaskTrackers which would run Reduce processes), sorted and written to discs • Reduce processes start by reading the input from the discs • Reduce outputs are written to discs

MapReduce Data Flow http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow

Observations & an Approximate Solution • Observations • Fixed overheads of a Hadoop job • Multiple read-writes to disc • Data transfer over network multiple times • Even a “Hello World” MapReduce job takes a couple of seconds because of the fixed overheads • Approximate solution • Minimize number of jobs • This is a good approximation since the overhead of each job (e.g. jar file distribution, multiple disc read-writes, multiple network data transfer) and job switching is huge

Greedy Algorithm: Terms • Joining variable: • A variable that is common in two or more triples • Ex: x, y, xy, xz, za -> x,y,z are joining, a not • Complete elimination: • A join operation that eliminates a joining variable • y can be completely eliminated if we join (xy,y) • Partial elimination: • A join that partially eliminates a joining variable • After complete elimination of y, x can be partially eliminated by joining (xz,x)

Greedy Algorithm: Terms • E-count: • Number of joining variables in the resultant triple after a complete elimination • In the example x, y, z, xy, xz • E-count of x is = 2 (resultant triple: yz) • E-count of y is = 1 (resultant triple: x) • E-count of z is = 1 (resultant triple: x)

Greedy Algorithm: Proposition • Maximum job required for any SPARQL query • K, if K<=1; min( ceil(1.71*log2K), N), if K > 1 • Where K is the number of triples in the query • N is the total number of joining variables

Greedy Algorithm: Proof • If we make just one join with each joining variable, then all joins can be done in N jobs (one join per job) • Special case scenario- • Suppose each joining variable is common in exactly two triples: • Example- ab, bc, cd, de, ef, …. (like a chain) • At each job, we can make K/2 joins, which reduce the number of triples to half (i.e., K/2) • So, each job halves the number of triples • Therefore, total jobs required is log2K < 1.71*log2K

Greedy Algorithm: Proof (Continued) • General case: • Suppose we sort (decreasing order) the variables according to the frequency in different triples • Let vi has frequency fi • Therefore, fi <= fi-1<=fi-2<=…<=f1 • Note that if f1=2, then it reduces to the special case • Therefore, f1>2 in the general case, also, fN>=2 • Now, we keep joining on v1, v2, … ,vN as long as there is no conflict

Greedy Algorithm: Proof (Continued) • Suppose L triples could not be reduced because each of them are left alone with one/more joining variable that are conflicting (e.g. try reducing xy, yz, zx) • Therefore, M>=L joins have been performed, producing M triples (total M+L triples remaining) • Since each join involved at least 2 triples, • 2M + L <= K • 2(L+e) + L <= K (letting M = L +e, e >= 0) • 3L + 2e <= K • 2L + (4/3)e <= K*(2/3) (multiplying by 2/3 on both sides)

Greedy Algorithm: Proof (Continued) • 2L+e <= (2/3) * K • So each job reduces #of triples to 2/3 • Therefore, • K * (2/3)Q >= 1>= K * (2/3)Q+1 • (3/2) Q <= K <= (3/2)Q+1 , Q <= log3/2K = 1.71 * log2K <= Q+1 • In most real world scenarios, we can assume that 100 triples in a query is extremely rare • So, the maximum number of jobs required in this case is 12

Greedy Algorithm • Greedy algorithm • Early elimination heuristic: • Make as many complete eliminations in each job as possible • This leaves the fewest number of variables for join in the next job • Must choose the join first that has the least e-count (least number of joining variables in the resultant triple)

Greedy Algorithm

Greedy Algorithm • Step I: remove non-joining variables • Step II: sort the vars according to e-count • Step III: choose a var for elimination as long as complete or partial elimination is possible – these joins make a job • Step IV: continue to step II if more triples are available

Experiment • Dataset and queries • Cluster description • Comparison with Jena In-Memory, SDB and BigOWLIM frameworks • Experiments with number of Reducers • Algorithm runtimes: Greedy vs. Exhaustive • Some query results

Dataset And Queries • LUBM • Dataset generator • 14 benchmark queries • Generates data of some imaginary universities • Used for query execution performance comparison by many researches

Our Clusters • 10 node cluster in SAIAL lab • 4 GB main memory • Intel Pentium IV 3.0 GHz processor • 640 GB hard drive • OpenCirrus HP labs test bed

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools