170 likes | 315 Views
Storing RDF Data in Hadoop And Retrieval. Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham. Goal. To build efficient storage using Hadoop for Peta-bytes of data To build an efficient query mechanism Possible outcomes
E N D
Storing RDF Data in Hadoop And Retrieval PankilDoshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. BhavaniThuraisingham
Goal • To build efficient storage using Hadoop for Peta-bytes of data • To build an efficient query mechanism • Possible outcomes • Open Source Framework for RDF • Integration with Jena
Possible Approaches • Store RDF data in HDFS and query through Map-Reduce programming • Our current approach • Store RDF data in HDFS and process query outside of Hadoop • Done in BIOMANTA [1] project, no details however • Hbase • Currently being worked on by another team in Semantic Web lab
Dataset And Queries • LUBM [2] • Dataset generator • 14 benchmark queries • Generates data of some imaginary universities • Used for query execution performance comparison by many researches
Our Clusters • 4 node cluster in Semantic Web lab • 10 node cluster in SAIAL lab • 4 GB main memory • Intel Pentium IV 3.0 GHz processor • 640 GB hard drive • OpenCirrus HP labs test bed • Sponsor: Andy Seaborne, HP Labs
Tasks Completed/In Progress • Setup Hadoop cluster • Generate, preprocess & insert data • Devise algorithm to produce map-reduce code for a SPARQL query • Code for 14 queries • Cascading output of one job to another job as input without using hard disk
Two Storage Approaches • Multiple File Approach: • Dumping files as generated by LUBM generator, possibly merging some • Each Line on file Contains Subject, Predicate and Object • Predicate Based Approach: • Dividing Files based on Predicate • File name will be “Predicate “ name • Each line then contains only Subject and Object • On-an Average there are about 20 different type of Predicate Common Preprocessing :- Adding Prefixes http://www.University10Department5:.... == U10D5:….
Example Of Predicate Based File division: D0U0:Graduate20 ub:type lehigh:GraduateStudent D0U0:Graduate20 ub:memberOf lehigh:University0 Filename : type D0U0:Graduate20 lehigh:GraduateStudent … … … Filename : memberOf D0U0:Graduate20 lehigh:University0 … … Filename: type_GraduateStudent D0U0:Graduate20 … Filename: memberOf_University D0U0:Graduate20 lehigh:University0 …
Sample Query:- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT ?X WHERE { ?X rdf:type ub:Publication . ?X ub:publicationAuthor D0U0:AssistantProfessor0} • Map Function :- • Look from which file (key) the data (value) is coming and filter it according to conditions. For example: • If data is from file “type_Publication” output the pair • If data is from file “publicationAuthor_*” look for D0U0:AssistantProfessor0 as object • Reduce Function :- • Look for all the required values according to condition and output the key as the result • Ex: Filter those results having both ub:Publication & D0U0:AssistantProfessor0
Algorithm SELECT ?X, ?Y WHERE { • ?X rdf:type ub:Chair . • ?Y rdf:type ub:Department . • ?X ub:worksFor ?Y . • ?Y ub:subOrganizationOf <http://www.University0.edu> } 1 X 2 Y X |E| = 4 Y Y 4 Y 3 X,Y Y • Job 1 map output keys: • Y – 2, 3, 4 (3 joins) • Job 1 joins: 3 • 1 join left, so need more job
Algorithm (contd.) A (2, 3, 4) X, Y B (1) X X • Job 2 map output key: • X – A, B (1 Join) • Job 2 joins: 1 • No joins left, no more jobs needed
Some Query Results Horizontal axis: Number of Triples Vertical axis: Time in milliseconds
Query Preprocessing • Original query 2:?X rdf:type ub:GraduateStudent . ?Y rdf:type ub:University . ?Z rdf:type ub:Department . ?X ub:memberOf ?Z . ?Z ub:subOrganizationOf ?Y . ?X ub:undergraduateDegreeFrom ?Y • Rewritten:?X rdf:type ub:GraduateStudent . ?X ub:memberOf_Department ?Z . ?Z ub:subOrganizationOf_University ?Y . ?X ub:undergraduateDegreeFrom_University ?Y
Parallel Experiment with Pig • Script for query 2:/* Load statements */GS = LOAD ‘type_GraduateStudent‘ AS (gs_subject:chararray);MO = LOAD ‘memberOf_Department‘ AS (mo_subject:chararray, mo_object:chararray);SOF = LOAD ‘subOrganizationOf_University‘ AS (sof_subject:chararray, sof_object:chararray); UDF = LOAD ‘undergraduateDegreeFrom_University‘ AS (udf_subject:chararray, udf_object:chararray); /* Joins */ MO_UDF_GS = JOIN GS BY gs_subject, UDF BY udf_subject, MO BY mo_subject PARALLEL 8; MO_UDF_GS = FOREACH MO_UDF_GS GENERATE mo_subject, udf_object, mo_object;MO_UDF_GS_SOF = JOIN SOF BY (sof_subject, sof_object), MO_UDF_GS BY (mo_object, udf_object);MO_UDF_GS_SOF = FOREACH MO_UDF_GS_SOF GENERATE mo_subject, udf_object, mo_object; /* Store query answer */STORE MO_UDF_GS_SOF INTO ‘Query2' USING PigStorage('\t');
Parallel Experiment with Pig • 2 jobs created for query 2 • For 330 mln triples, answers in 20 mins • Direct MapReduce approach takes 10 mins
Future Works • Run all 14 queries for 100 mln, 200 mln, … , 1 bln triples and compare with Jena In-Memory, RDB, SDB, TDB models • Cascading output of one job to another job as input without using hard disk • Generic map reduce code • Proof of algorithm • Modification of algorithm for queries with optional triple patterns • Indexing, summary statistics
References • [1] BIOMANTA: http://www.biomanta.org/ • [2] LUBM: http://swat.cse.lehigh.edu/projects/lubm/