Storing RDF Data in Hadoop And Retrieval

Storing RDF Data in Hadoop And Retrieval PankilDoshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. BhavaniThuraisingham

Goal • To build efficient storage using Hadoop for Peta-bytes of data • To build an efficient query mechanism • Possible outcomes • Open Source Framework for RDF • Integration with Jena

Possible Approaches • Store RDF data in HDFS and query through Map-Reduce programming • Our current approach • Store RDF data in HDFS and process query outside of Hadoop • Done in BIOMANTA [1] project, no details however • Hbase • Currently being worked on by another team in Semantic Web lab

Dataset And Queries • LUBM [2] • Dataset generator • 14 benchmark queries • Generates data of some imaginary universities • Used for query execution performance comparison by many researches

Our Clusters • 4 node cluster in Semantic Web lab • 10 node cluster in SAIAL lab • 4 GB main memory • Intel Pentium IV 3.0 GHz processor • 640 GB hard drive • OpenCirrus HP labs test bed • Sponsor: Andy Seaborne, HP Labs

Tasks Completed/In Progress • Setup Hadoop cluster • Generate, preprocess & insert data • Devise algorithm to produce map-reduce code for a SPARQL query • Code for 14 queries • Cascading output of one job to another job as input without using hard disk

Two Storage Approaches • Multiple File Approach: • Dumping files as generated by LUBM generator, possibly merging some • Each Line on file Contains Subject, Predicate and Object • Predicate Based Approach: • Dividing Files based on Predicate • File name will be “Predicate “ name • Each line then contains only Subject and Object • On-an Average there are about 20 different type of Predicate Common Preprocessing :- Adding Prefixes http://www.University10Department5:.... == U10D5:….

Example Of Predicate Based File division: D0U0:Graduate20 ub:type lehigh:GraduateStudent D0U0:Graduate20 ub:memberOf lehigh:University0 Filename : type D0U0:Graduate20 lehigh:GraduateStudent … … … Filename : memberOf D0U0:Graduate20 lehigh:University0 … … Filename: type_GraduateStudent D0U0:Graduate20 … Filename: memberOf_University D0U0:Graduate20 lehigh:University0 …

Sample Query:- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT ?X WHERE { ?X rdf:type ub:Publication . ?X ub:publicationAuthor D0U0:AssistantProfessor0} • Map Function :- • Look from which file (key) the data (value) is coming and filter it according to conditions. For example: • If data is from file “type_Publication” output the pair • If data is from file “publicationAuthor_*” look for D0U0:AssistantProfessor0 as object • Reduce Function :- • Look for all the required values according to condition and output the key as the result • Ex: Filter those results having both ub:Publication & D0U0:AssistantProfessor0

Algorithm SELECT ?X, ?Y WHERE { • ?X rdf:type ub:Chair . • ?Y rdf:type ub:Department . • ?X ub:worksFor ?Y . • ?Y ub:subOrganizationOf <http://www.University0.edu> } 1 X 2 Y X |E| = 4 Y Y 4 Y 3 X,Y Y • Job 1 map output keys: • Y – 2, 3, 4 (3 joins) • Job 1 joins: 3 • 1 join left, so need more job

Algorithm (contd.) A (2, 3, 4) X, Y B (1) X X • Job 2 map output key: • X – A, B (1 Join) • Job 2 joins: 1 • No joins left, no more jobs needed

Some Query Results Horizontal axis: Number of Triples Vertical axis: Time in milliseconds

Query Preprocessing • Original query 2:?X rdf:type ub:GraduateStudent . ?Y rdf:type ub:University . ?Z rdf:type ub:Department . ?X ub:memberOf ?Z . ?Z ub:subOrganizationOf ?Y . ?X ub:undergraduateDegreeFrom ?Y • Rewritten:?X rdf:type ub:GraduateStudent . ?X ub:memberOf_Department ?Z . ?Z ub:subOrganizationOf_University ?Y . ?X ub:undergraduateDegreeFrom_University ?Y

Parallel Experiment with Pig • Script for query 2:/* Load statements */GS = LOAD ‘type_GraduateStudent‘ AS (gs_subject:chararray);MO = LOAD ‘memberOf_Department‘ AS (mo_subject:chararray, mo_object:chararray);SOF = LOAD ‘subOrganizationOf_University‘ AS (sof_subject:chararray, sof_object:chararray); UDF = LOAD ‘undergraduateDegreeFrom_University‘ AS (udf_subject:chararray, udf_object:chararray); /* Joins */ MO_UDF_GS = JOIN GS BY gs_subject, UDF BY udf_subject, MO BY mo_subject PARALLEL 8; MO_UDF_GS = FOREACH MO_UDF_GS GENERATE mo_subject, udf_object, mo_object;MO_UDF_GS_SOF = JOIN SOF BY (sof_subject, sof_object), MO_UDF_GS BY (mo_object, udf_object);MO_UDF_GS_SOF = FOREACH MO_UDF_GS_SOF GENERATE mo_subject, udf_object, mo_object; /* Store query answer */STORE MO_UDF_GS_SOF INTO ‘Query2' USING PigStorage('\t');

Parallel Experiment with Pig • 2 jobs created for query 2 • For 330 mln triples, answers in 20 mins • Direct MapReduce approach takes 10 mins

Future Works • Run all 14 queries for 100 mln, 200 mln, … , 1 bln triples and compare with Jena In-Memory, RDB, SDB, TDB models • Cascading output of one job to another job as input without using hard disk • Generic map reduce code • Proof of algorithm • Modification of algorithm for queries with optional triple patterns • Indexing, summary statistics

References • [1] BIOMANTA: http://www.biomanta.org/ • [2] LUBM: http://swat.cse.lehigh.edu/projects/lubm/

Storing RDF Data in Hadoop And Retrieval

Storing RDF Data in Hadoop And Retrieval

Presentation Transcript

Lesson: Storing Data in Arrays

Storing Data

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Sesame: An Architecture for Storing and Querying RDF Data and Schema Inf.

RDF and Linked Data

Collecting and storing data

Storing data in Program Memory

Sesame: A generic Architecture for Storing and Querying RDF and RDF Schema

Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema

Downloading and Storing Data

Downloading and Storing Data

Downloading and Storing Data

Storing and Organizing Data

Storing Data

Storing Data “Forever”

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Efficient RDF Storage and Retrieval in Jena2

big data and hadoop training in bangalore | big data and hadoop course

Storing Data

Sesame: An Architecture for Storing and Querying RDF Data and Schema Inf.