1.82k likes | 1.83k Views
This article discusses the application of semantic web technologies and cloud computing frameworks in processing data-intensive queries for large RDF graphs. It explores the use of tools such as Hadoop and MapReduce for efficient storage and query mechanisms, with a focus on scalability and fault tolerance.
E N D
Data Intensive Query Processing for Large RDFGraphs Using Cloud Computing Tools Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department of Computer Science University of Texas at Dallas
Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works
Semantic Web Technologies • Data in machine understandable format • Infer new knowledge • Standards • Data representation – RDF • Triples • Example: • Ontology – OWL, DAML • Query language - SPARQL
Cloud Computing Frameworks • Proprietary • Amazon S3 • Amazon EC2 • Force.com • Open source tool • Hadoop – Apache’s open source implementation of Google’s proprietary GFS file system • MapReduce – functional programming paradigm using key-value pairs
Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works
Goal • To build efficient storage using Hadoop for large amount of data (e.g. billion triples) • To build an efficient query mechanism • Publish as open source project • http://code.google.com/p/hadooprdf/ • Integrate with Jena as a Jena Model
Motivation • Current Semantic Web frameworks do not scale to large number of triples, e.g. • Jena In-Memory, Jena RDB, Jena SDB • AllegroGraph • Virtuoso Universal Server • BigOWLIM • There is a lack of distributed framework and persistent storage • Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability
Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works
Current Approaches • State-of-the-art approach • Store RDF data in HDFS and query through MapReduce programming (Our approach) • Traditional approach • Store data in HDFS and process query outside of Hadoop • Done in BIOMANTA1 project (details of querying could not be found) 1. http://biomanta.org/
Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works
System Architecture LUBM Data Generator 1. Query RDF/XML MapReduce Framework Preprocessor Query Rewriter N-Triples Converter 3. Answer Predicate Based Splitter Query Plan Generator Object Type Based Splitter Plan Executor 2. Jobs Preprocessed Data Hadoop Distributed File System / Hadoop Cluster 3. Answer
Storage Schema • Data in N-Triples • Using namespaces • Example: http://utdallas.edu/res1 utd:resource1 • Predicate based Splits (PS) • Split data according to Predicates • Predicate Object based Splits (POS) • Split further according to rdf:type of Objects
Example D0U0:GraduateStudent20 rdf:type lehigh:GraduateStudent lehigh:University0 rdf:type lehigh:University D0U0:GraduateStudent20 lehigh:memberOf lehigh:University0 P File: rdf_type D0U0:GraduateStudent20 lehigh:GraduateStudent lehigh:University0 lehigh:University PS File: lehigh_memberOf D0U0:GraduateStudent20 lehigh:University0 File: rdf_type_GraduateStudent D0U0:GraduateStudent20 File: lehigh_memberOf_University D0U0:GraduateStudent20 lehigh:University0 File: rdf_type_University D0U0:University0 POS
Space Gain • Example Data size at various steps for LUBM1000
Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works
SPARQL Query • SPARQL – SPARQL Protocol And RDF Query Language • Example SELECT ?x ?y WHERE { ?z foaf:name ?x ?z foaf:age ?y } Query Data Result
SPAQL Query by MapReduce • Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu} • Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}
Inside Hadoop MapReduce Job INPUT subOrganizationOf_University Department1 http://University0.edu Department2 http://University1.edu worksFor_Department Professor1 Deaprtment1 Professor2 Department2 MAP Map Map SHUFFLE&SORT Filtering Object == http://University0.edu Department1 SO#http://University0.edu Department1WF#Professor1 Department2WF#Professor2 REDUCE Reduce Department1SO#http://University0.edu WF#Professor1 Department2WF#Professor2 OUTPUT Output WF#Professor1
Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works
Query Plan Generation • Challenge • One Hadoop job may not be sufficient to answer a query • In a single Hadoop job, a single triple pattern cannot take part in joins on more than one variable simultaneously • Solution • Algorithm for query plan generation • Query plan is a sequence of Hadoop jobs which answers the query • Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously
Example • Example query:SELECT ?X, ?Y, ?Z WHERE { ?X pred1 obj1 subj2 ?Z obj2 subj3 ?X ?Z ?Y pred4 obj4 ?Y pred5 ?X } • Simplified view: • X • Z • XZ • Y • XY
Join Graph &Hadoop Jobs 2 X 2 2 2 Z Z Z Z 3 X 3 3 3 X X X 1 X 1 1 1 X X X X 5 X X X 5 5 5 Y Y Y Y 4 4 4 4 Join Graph Valid Job 1 Valid Job 2 Invalid Job
Possible Query Plans • A. job1: (x, xz, xy)=yz, job2: (yz, y) = z, job3: (z, z) = done 2 2 Z Z 2 3 3 2 X X Z 1,2,3,4,5 Z 1 1 1,3,5 X X 1,3,4,5 X X 5 5 Result Y Y Y Job 3 4 4 4 Join Graph Job 1 Job 2
Possible Query Plans • B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done 2 2 Z Z 3 3 2,3 X X 1,2,3,4,5 X 1 1 X X 1 X X X 5 5 X Result Y Y 4,5 4 4 Join Graph Job 1 Job 2
Query Plan Generation • Goal: generate a minimum cost job plan • Back tracking approach • Exhaustively generates all possible plans. • Uses two coloring scheme on a graph to find jobs with colors WHITE and BLACK. • Two WHITE nodes cannot be adjacent • User defined cost model. • Chooses best plan according to cost model.
Some Definitions • Triple Pattern,TP A triple pattern is an ordered collection of subject, predicate and object which appears in a SPARQL query WHERE clause. The subject, predicate and object can be either a variable (unbounded) or a concrete value (bounded). • Triple Pattern Join,TPJ A triple pattern join is a join between two TPs on a variable • MapReduceJoin, MRJ A MapReduceJoin is a join between two or more triple patterns on a variable.
Some Definitions • Job, JB A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files. • Conflicting MapReduceJoins, CMRJ A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files. • NON-Conflicting MapReduceJoins, NCMRJ Non-conflicting MapReduceJoins is a pair of MRJs either not sharing any triple pattern or sharing a triple pattern and the MRJs are on same variable.
Example • LUBM Query • SELECT ?X WHERE { • 1 ?X rdf : type ub : Chair . • 2 ?Y rdf : type ub : Department . • 3 ?X ub : worksFor ?Y . • 4 ?Y ub : subOrganizat ionOf <http : / /www.U0 . edu> }
Example (contd.) • Triple Pattern Graph and Join Graph for the LUBM Query Triple Pattern Graph (TPG)#2 Triple Pattern Graph (TPG)#1 Join Graph (JG)#1 Join Graph (JG)#2
Example(contd.) • Figure shows TPG and JG for query. • On left, we have TPG where each node represents a triple pattern in query, and they are named in the order they appear. • In the middle, we have the JG. Each node in the JG represents an edge in the TPG • For the query, an FQP can have two jobs • First one dealing with NCMRJ between triple patterns 2, 3, 4 • Second one NCMRJ between triple pattern 1 and the output of the first join. • IQP would be first job having CMRJs between 1, 3 and 4 and the second having MRJ between triple pattern 2 and the output of the first join.
Query Plan Generation: Backtracking • Drawbacks of back tracking approach • Computationally intractable • Search space is exponential in size
Steps a Hadoop Job Goes Through • Executable file (containing MapReduce code) is transferred from client machine to JobTracker1 • JobTracker decides which TaskTrackers2 will execute the job • Executable file is distributed to TaskTrackers over network • Map processes start by reading data from HDFS • Map outputs are written to discs • Map outputs are read from discs, shuffled (transferred over the network to TaskTrackers which would run Reduce processes), sorted and written to discs • Reduce processes start by reading the input from the discs • Reduce outputs are written to discs
MapReduce Data Flow http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow
Observations & an Approximate Solution • Observations • Fixed overheads of a Hadoop job • Multiple read-writes to disc • Data transfer over network multiple times • Even a “Hello World” MapReduce job takes a couple of seconds because of the fixed overheads • Approximate solution • Minimize number of jobs • This is a good approximation since the overhead of each job (e.g. jar file distribution, multiple disc read-writes, multiple network data transfer) and job switching is huge
Greedy Algorithm: Terms • Joining variable: • A variable that is common in two or more triples • Ex: x, y, xy, xz, za -> x,y,z are joining, a not • Complete elimination: • A join operation that eliminates a joining variable • y can be completely eliminated if we join (xy,y) • Partial elimination: • A join that partially eliminates a joining variable • After complete elimination of y, x can be partially eliminated by joining (xz,x)
Greedy Algorithm: Terms • E-count: • Number of joining variables in the resultant triple after a complete elimination • In the example x, y, z, xy, xz • E-count of x is = 2 (resultant triple: yz) • E-count of y is = 1 (resultant triple: x) • E-count of z is = 1 (resultant triple: x)
Greedy Algorithm: Proposition • Maximum job required for any SPARQL query • K, if K<=1; min( ceil(1.71*log2K), N), if K > 1 • Where K is the number of triples in the query • N is the total number of joining variables
Greedy Algorithm: Proof • If we make just one join with each joining variable, then all joins can be done in N jobs (one join per job) • Special case scenario- • Suppose each joining variable is common in exactly two triples: • Example- ab, bc, cd, de, ef, …. (like a chain) • At each job, we can make K/2 joins, which reduce the number of triples to half (i.e., K/2) • So, each job halves the number of triples • Therefore, total jobs required is log2K < 1.71*log2K
Greedy Algorithm: Proof (Continued) • General case: • Suppose we sort (decreasing order) the variables according to the frequency in different triples • Let vi has frequency fi • Therefore, fi <= fi-1<=fi-2<=…<=f1 • Note that if f1=2, then it reduces to the special case • Therefore, f1>2 in the general case, also, fN>=2 • Now, we keep joining on v1, v2, … ,vN as long as there is no conflict
Greedy Algorithm: Proof (Continued) • Suppose L triples could not be reduced because each of them are left alone with one/more joining variable that are conflicting (e.g. try reducing xy, yz, zx) • Therefore, M>=L joins have been performed, producing M triples (total M+L triples remaining) • Since each join involved at least 2 triples, • 2M + L <= K • 2(L+e) + L <= K (letting M = L +e, e >= 0) • 3L + 2e <= K • 2L + (4/3)e <= K*(2/3) (multiplying by 2/3 on both sides)
Greedy Algorithm: Proof (Continued) • 2L+e <= (2/3) * K • So each job reduces #of triples to 2/3 • Therefore, • K * (2/3)Q >= 1>= K * (2/3)Q+1 • (3/2) Q <= K <= (3/2)Q+1 , Q <= log3/2K = 1.71 * log2K <= Q+1 • In most real world scenarios, we can assume that 100 triples in a query is extremely rare • So, the maximum number of jobs required in this case is 12
Greedy Algorithm • Greedy algorithm • Early elimination heuristic: • Make as many complete eliminations in each job as possible • This leaves the fewest number of variables for join in the next job • Must choose the join first that has the least e-count (least number of joining variables in the resultant triple)
Greedy Algorithm • Step I: remove non-joining variables • Step II: sort the vars according to e-count • Step III: choose a var for elimination as long as complete or partial elimination is possible – these joins make a job • Step IV: continue to step II if more triples are available
Outline • Semantic Web Technologies & Cloud Computing Frameworks • Goal & Motivation • Current Approaches • System Architecture & Storage Schema • SPARQL Query by MapReduce • Query Plan Generation • Experiment • Future Works
Experiment • Dataset and queries • Cluster description • Comparison with Jena In-Memory, SDB and BigOWLIM frameworks • Experiments with number of Reducers • Algorithm runtimes: Greedy vs. Exhaustive • Some query results
Dataset And Queries • LUBM • Dataset generator • 14 benchmark queries • Generates data of some imaginary universities • Used for query execution performance comparison by many researches
Our Clusters • 10 node cluster in SAIAL lab • 4 GB main memory • Intel Pentium IV 3.0 GHz processor • 640 GB hard drive • OpenCirrus HP labs test bed