1 / 17

Storing RDF Data in Hadoop And Retrieval

Storing RDF Data in Hadoop And Retrieval. Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham. Goal. To build efficient storage using Hadoop for Peta-bytes of data To build an efficient query mechanism Possible outcomes

lynley
Download Presentation

Storing RDF Data in Hadoop And Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storing RDF Data in Hadoop And Retrieval PankilDoshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. BhavaniThuraisingham

  2. Goal • To build efficient storage using Hadoop for Peta-bytes of data • To build an efficient query mechanism • Possible outcomes • Open Source Framework for RDF • Integration with Jena

  3. Possible Approaches • Store RDF data in HDFS and query through Map-Reduce programming • Our current approach • Store RDF data in HDFS and process query outside of Hadoop • Done in BIOMANTA [1] project, no details however • Hbase • Currently being worked on by another team in Semantic Web lab

  4. Dataset And Queries • LUBM [2] • Dataset generator • 14 benchmark queries • Generates data of some imaginary universities • Used for query execution performance comparison by many researches

  5. Our Clusters • 4 node cluster in Semantic Web lab • 10 node cluster in SAIAL lab • 4 GB main memory • Intel Pentium IV 3.0 GHz processor • 640 GB hard drive • OpenCirrus HP labs test bed • Sponsor: Andy Seaborne, HP Labs

  6. Tasks Completed/In Progress • Setup Hadoop cluster • Generate, preprocess & insert data • Devise algorithm to produce map-reduce code for a SPARQL query • Code for 14 queries • Cascading output of one job to another job as input without using hard disk

  7. Two Storage Approaches • Multiple File Approach: • Dumping files as generated by LUBM generator, possibly merging some • Each Line on file Contains Subject, Predicate and Object • Predicate Based Approach: • Dividing Files based on Predicate • File name will be “Predicate “ name • Each line then contains only Subject and Object • On-an Average there are about 20 different type of Predicate Common Preprocessing :- Adding Prefixes http://www.University10Department5:.... == U10D5:….

  8. Example Of Predicate Based File division: D0U0:Graduate20 ub:type lehigh:GraduateStudent D0U0:Graduate20 ub:memberOf lehigh:University0 Filename : type D0U0:Graduate20 lehigh:GraduateStudent … … … Filename : memberOf D0U0:Graduate20 lehigh:University0 … … Filename: type_GraduateStudent D0U0:Graduate20 … Filename: memberOf_University D0U0:Graduate20 lehigh:University0 …

  9. Sample Query:- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT ?X WHERE { ?X rdf:type ub:Publication . ?X ub:publicationAuthor D0U0:AssistantProfessor0} • Map Function :- • Look from which file (key) the data (value) is coming and filter it according to conditions. For example: • If data is from file “type_Publication” output the pair • If data is from file “publicationAuthor_*” look for D0U0:AssistantProfessor0 as object • Reduce Function :- • Look for all the required values according to condition and output the key as the result • Ex: Filter those results having both ub:Publication & D0U0:AssistantProfessor0

  10. Algorithm SELECT ?X, ?Y WHERE { • ?X rdf:type ub:Chair . • ?Y rdf:type ub:Department . • ?X ub:worksFor ?Y . • ?Y ub:subOrganizationOf <http://www.University0.edu> } 1 X 2 Y X |E| = 4 Y Y 4 Y 3 X,Y Y • Job 1 map output keys: • Y – 2, 3, 4 (3 joins) • Job 1 joins: 3 • 1 join left, so need more job

  11. Algorithm (contd.) A (2, 3, 4) X, Y B (1) X X • Job 2 map output key: • X – A, B (1 Join) • Job 2 joins: 1 • No joins left, no more jobs needed

  12. Some Query Results Horizontal axis: Number of Triples Vertical axis: Time in milliseconds

  13. Query Preprocessing • Original query 2:?X rdf:type ub:GraduateStudent . ?Y rdf:type ub:University . ?Z rdf:type ub:Department . ?X ub:memberOf ?Z . ?Z ub:subOrganizationOf ?Y . ?X ub:undergraduateDegreeFrom ?Y • Rewritten:?X rdf:type ub:GraduateStudent . ?X ub:memberOf_Department ?Z . ?Z ub:subOrganizationOf_University ?Y . ?X ub:undergraduateDegreeFrom_University ?Y

  14. Parallel Experiment with Pig • Script for query 2:/* Load statements */GS = LOAD ‘type_GraduateStudent‘ AS (gs_subject:chararray);MO = LOAD ‘memberOf_Department‘ AS (mo_subject:chararray, mo_object:chararray);SOF = LOAD ‘subOrganizationOf_University‘ AS (sof_subject:chararray, sof_object:chararray); UDF = LOAD ‘undergraduateDegreeFrom_University‘ AS (udf_subject:chararray, udf_object:chararray); /* Joins */ MO_UDF_GS = JOIN GS BY gs_subject, UDF BY udf_subject, MO BY mo_subject PARALLEL 8; MO_UDF_GS = FOREACH MO_UDF_GS GENERATE mo_subject, udf_object, mo_object;MO_UDF_GS_SOF = JOIN SOF BY (sof_subject, sof_object), MO_UDF_GS BY (mo_object, udf_object);MO_UDF_GS_SOF = FOREACH MO_UDF_GS_SOF GENERATE mo_subject, udf_object, mo_object; /* Store query answer */STORE MO_UDF_GS_SOF INTO ‘Query2' USING PigStorage('\t');

  15. Parallel Experiment with Pig • 2 jobs created for query 2 • For 330 mln triples, answers in 20 mins • Direct MapReduce approach takes 10 mins

  16. Future Works • Run all 14 queries for 100 mln, 200 mln, … , 1 bln triples and compare with Jena In-Memory, RDB, SDB, TDB models • Cascading output of one job to another job as input without using hard disk • Generic map reduce code • Proof of algorithm • Modification of algorithm for queries with optional triple patterns • Indexing, summary statistics

  17. References • [1] BIOMANTA: http://www.biomanta.org/ • [2] LUBM: http://swat.cse.lehigh.edu/projects/lubm/

More Related